Etiqueta: Acceleration

  • DFlash: A New Paradigm for LLM Inference Acceleration with Block Diffusion

    If you’ve ever served a large language model in production, you know the pain: autoregressive decoding is slow. Every token depends on the one before it, turning your powerful GPU into a token factory churning out results one at a time. The problem is especially acute with the latest reasoning models like OpenAI’s o1 or DeepSeek-R1, where long chain-of-thought sequences can make inference take minutes instead of seconds.

    Speculative decoding has been the go-to solution — use a small draft model to propose tokens, then verify them all in parallel with the target model. But even the state-of-the-art methods like EAGLE-3 cap out at 2–3× speedup because they still draft autoregressively, one token at a time. The drafter itself is sequential, so it becomes the new bottleneck.

    Enter DFlash, a new framework from Z Lab at UC San Diego that fundamentally changes how drafting works. By replacing the autoregressive drafter with a block diffusion model, DFlash can generate an entire block of tokens in a single parallel forward pass. The results are striking: over 6× lossless acceleration on Qwen3-8B, nearly 2.5× faster than EAGLE-3.

    How speculative decoding works (recap)

    Speculative decoding, first introduced by Leviathan et al. in 2023, follows a simple draft-and-verify loop:

    1. A lightweight draft model proposes K future tokens
    2. The target LLM verifies all K tokens in a single forward pass
    3. Accepted tokens are kept; rejected tokens trigger a redraft from that point

    The key insight is that the target model — the slow part — only runs once per block instead of once per token. But all existing methods (Medusa, EAGLE, EAGLE-2, EAGLE-3) draft autoregressively: token 1, then token 2, then token 3. The drafter is fast, but sequential speed doesn’t scale.

    DFlash: Parallel drafting with block diffusion

    DFlash replaces the autoregressive drafter with a block diffusion model. Here’s what that means:

    Instead of generating tokens left-to-right, a block diffusion model receives the target model’s hidden states and generates K masked positions simultaneously. A single denoising step fills all K positions at once — true parallel generation.

    The architecture combines two innovations:

    1. Block diffusion drafting: The drafter uses block diffusion (also known as parallel diffusion or dLLM techniques) to denoise a block of masked tokens in one forward pass, drawing on the growing body of research into diffusion language models by Nie et al. (Large Language Diffusion Models), Arriola et al. (Block Diffusion), and Wu et al. (Fast-dLLM v2).
    2. Context conditioning via deep key-value injection: Instead of asking a tiny diffusion model to reason from scratch, DFlash conditions the drafter on context features extracted from the target model. This fuses the target’s deep reasoning with the drafter’s parallel speed, achieving high acceptance rates of 89%+ on models like Qwen3-8B.

    The numbers

    From the DFlash paper (arXiv:2602.06036, published February 2026):

    • 6× lossless speedup on Qwen3-8B compared to standard autoregressive decoding
    • 2.5× faster than EAGLE-3 (the previous state of the art)
    • 89%+ acceptance rate on Qwen3-8B, meaning the drafter’s proposals match the target model almost 9 out of 10 times
    • Lossless by design — speculative decoding preserves the target model’s exact output distribution

    Independent benchmarks from Spheron Network on Llama 3.3 70B with H100 PCIe GPUs show estimated throughput of ~9,000 tokens/sec with DFlash, compared to ~3,600 for EAGLE-3 and ~2,600 for standard speculative decoding with a draft model — representing both a massive throughput improvement and roughly 87% cost reduction per million output tokens.

    Supported models and ecosystem

    DFlash has gained rapid adoption. The open-source repository (z-lab/dflash) has accumulated over 3,600 stars on GitHub since its release. Draft checkpoints are available on Hugging Face for a growing list of models including:

    • Qwen3 and Qwen3.5 family (4B through 122B-A10B variants, including Mixture-of-Experts)
    • Gemma 4 (26B-A4B and 31B)
    • GPT-OSS (20B and 120B)
    • MiniMax-M2.5 and Kimi-K2.5
    • Qwen3-Coder and Qwen3-Coder-Next
    • Llama-3.1-8B (UltraChat fine-tune)

    With checkpoints for DeepSeek-V4, MiniMax-M2.7, and GLM-5.1 announced as coming soon. The authors have also pledged to open-source their training recipe, enabling the community to train DFlash drafters for any model.

    Production integrations

    DFlash isn’t just academic research — it’s already integrated into production inference frameworks:

    • SGLang: Full support with --speculative-algorithm DFLASH
    • vLLM: Core DFlash support landed in v0.20.1+, with Docker images for complex models like Gemma4
    • Google TPUs: UCSD researchers (including the co-inventor of PagedAttention) successfully ported DFlash to Google’s TPU/JAX stack, achieving 3× speedups on TPUs
    • Apple Silicon (MLX): Community implementations and official MLX support, tested on M5 Pro
    • Transformers: Simple API for quick experimentation with model.spec_generate()

    DDTree: Pushing further with draft trees

    A follow-up paper, “Accelerating Speculative Decoding with Block Diffusion Draft Trees” (arXiv:2604.12989), introduces DDTree (Diffusion Draft Tree) — a method that constructs a draft tree from DFlash’s per-position distributions. Instead of a single linear draft, DDTree uses a best-first search to select the most promising continuations under a fixed node budget, then verifies them all in one forward pass using tree attention. This extends DFlash’s parallel drafting into a tree-based approach, squeezing even more acceleration from the same infrastructure.

    What this means for practitioners

    If you’re running LLM inference at scale, DFlash represents the most significant advance in speculative decoding since EAGLE-3. The combination of true parallel drafting and context conditioning means:

    • Lower latency: 6× speedup translates directly to faster response times, especially for long outputs
    • Lower cost: The ~87% reduction in cost per million tokens (based on H100 benchmarks) is substantial at scale
    • Lossless output: Unlike compression or distillation, speculative decoding preserves the exact model distribution — your outputs are identical to standard decoding
    • Easy to adopt: Drop-in support in SGLang, vLLM, and Transformers means you can enable it without changing your application code

    The main caveat: DFlash requires a pre-trained draft checkpoint for your target model. But with growing coverage across the Qwen, Gemma, Llama, and OpenAI model families — and the upcoming training recipe — this barrier should disappear quickly.

    Key resources

    References

    1. Chen, J., Liang, Y., Liu, Z. “DFlash: Block Diffusion for Flash Speculative Decoding.” arXiv:2602.06036, February 2026. Link
    2. Leviathan, Y., Kalman, M., Shavit, Y. “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. Link
    3. Li, Y. et al. “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test-Time Discrepancy.” 2025. Link
    4. Cai, T. et al. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” 2024. Link
    5. Nie, S. et al. “Large Language Diffusion Models.” 2025. Link
    6. Arriola, J.I. et al. “Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models.” 2025. Link
    7. Wu, T. et al. “Fast-dLLM v2: Efficient Block-Diffusion LLM.” 2025. Link
    8. Zhang, H. et al. “Achieving 3X Speedups with Diffusion-Style Speculative Decoding on Google TPUs.” Google Developers Blog, April 2026. Link