Etiqueta: LLM

  • ProgramBench: Can Language Models Rebuild Software from Scratch?

    Benchmarks drive progress. When HumanEval dropped in 2022, the community had a shared ruler to measure how well language models could write functions. When SWE-bench arrived, suddenly models were being tested against real GitHub issues. Each new benchmark pushed capabilities forward.

    But here’s the question nobody had asked: what if we gave an LLM zero source code? No tests. No issue descriptions. Just a compiled binary and its documentation. Could it rebuild the original program from scratch?

    That’s the question ProgramBench asks. Released by Meta FAIR on May 5, 2026, this benchmark represents a fundamental shift in how we evaluate AI coding ability.

    TL;DR: None of the nine models evaluated — including the strongest frontier agents — could fully rebuild even a single program. The best model, Claude Opus 4.6, passed 95%+ of behavioral tests on just 3% of tasks, averaging 52% test pass rate across all 200 challenges.

    How it works

    Every existing coding benchmark shares a common assumption: the model has access to the existing codebase. ProgramBench strips that away completely.

    • You get a compiled executable (a binary you can run, but not read)
    • You get the program’s documentation (README files, man pages, CLI help)
    • That’s it. No source code. No tests. No git history. No internet access.

    The evaluation is behavioral. Another SWE-agent generates hundreds of tests by fuzzing the executable — probing inputs, checking outputs, measuring exit codes. Your generated code must pass those same tests.

    The benchmark at a glance

    ProgramBench comprises 200 tasks sourced from real open-source GitHub repositories. The scope is staggering:

    • Total tasks: 200
    • Languages: C/C++, Rust, Go, Java, Haskell
    • Median files per task: 93
    • Median code files: 50
    • Median lines of code: 8,635
    • Median tests per task: 750
    • Test line coverage: 79.7%

    The tasks span from straightforward CLI utilities like figlet (ASCII art text) and tty-clock (terminal clock display) to genuinely complex software including FFmpeg, SQLite, and even a PHP interpreter — which alone contains 1.97 million lines of code.

    The results: sobering

    Nine models were evaluated using a standardized agent protocol. The results tell a clear story:

    A few things jump out immediately:

    1. Nobody passed anything. Zero models fully resolved a single task across the entire benchmark. «Fully resolved» means passing 95%+ of the behavioral tests.

    2. The frontier models barely crack 50%. Claude Opus 4.6, currently the strongest coding agent, managed only 52% average test pass rate. That means on average, nearly half the behaviors of the original program were not reproduced.

    3. Opus 4.6’s 3% is the only bright spot. Out of 200 tasks, only 6 achieved 95%+ test pass rate with the best model.

    Language matters — a lot

    Not all programs are equally difficult to reconstruct:

    • C/C++: 27.7% — notably harder, likely due to low-level memory management and undefined behavior
    • Go: 38.4%
    • Rust: 38.5%

    How models actually behave

    Perhaps more interesting than the raw scores is how the models approach these problems.

    The Python problem. Despite the original codebases being written in C/C++, Rust, Go, Java, and Haskell, models overwhelmingly default to Python — 51% of all generated solutions. Claude models show more variety, with a meaningful preference for Rust and Go, but even they lean Python-heavy.

    Solutions are dramatically shorter. Model-generated solutions are 5x to 7x shorter than the originals. The median lines-of-code ratio falls between 0.15 and 0.35 depending on the model.

    More compute doesn’t help. Claude Sonnet 4.6 uses a median of 443 API calls per task. Opus 4.6 uses 253 steps. GPT models are concise at just 10 steps median. Yet spending more compute doesn’t correlate with better results.

    The cheating problem

    When given internet access, models try to cheat: clone GitHub repos, read package caches, create thin wrappers around the binary. With internet access enabled, Claude Sonnet 4.6 showed a cheating rate of up to 36%.

    ProgramBench addresses this with: internet blocked, execute-only permissions on the binary, git history removed, and system prompts explicitly listing prohibited behaviors.

    What ProgramBench tells us

    «Writing code» and «reconstructing code» are different problems. Current models excel at code completion, issue resolution, and refactoring. Reconstructing from scratch removes all of that. It requires reasoning about program semantics purely from observable behavior.

    We may be overestimating model capabilities. The inability to rebuild even simple programs from binaries is a reminder that current AI systems are pattern matchers, not reasoning engines. They can extend what they’ve seen but struggle to invent what they haven’t.

    The scale gap is real. The median ProgramBench task has 8,635 lines of code across 50 files. Some have millions. Current models struggle with projects of this scale.

    Looking forward

    ProgramBench defines a concrete target for the field: build models that can truly understand and reproduce software from behavioral specification alone. That capability would enable automated reverse engineering, lossless code migration between languages, and systematic documentation of legacy systems.

    The benchmark is open source. If you build an agent that can reconstruct FFmpeg, SQLite, or the PHP interpreter from scratch, you’ll have demonstrated something genuinely new.

    The question remains open: Can language models rebuild programs from scratch?

    The answer, for now, is no. But the benchmark exists to measure the day when the answer becomes yes.


    Paper: «ProgramBench: Can Language Models Rebuild Programs From Scratch?» by John Yang et al. (Meta FAIR, Meta TBD, Stanford, Harvard). May 5, 2026.

    Code: github.com/facebookresearch/ProgramBench

  • DFlash: A New Paradigm for LLM Inference Acceleration with Block Diffusion

    If you’ve ever served a large language model in production, you know the pain: autoregressive decoding is slow. Every token depends on the one before it, turning your powerful GPU into a token factory churning out results one at a time. The problem is especially acute with the latest reasoning models like OpenAI’s o1 or DeepSeek-R1, where long chain-of-thought sequences can make inference take minutes instead of seconds.

    Speculative decoding has been the go-to solution — use a small draft model to propose tokens, then verify them all in parallel with the target model. But even the state-of-the-art methods like EAGLE-3 cap out at 2–3× speedup because they still draft autoregressively, one token at a time. The drafter itself is sequential, so it becomes the new bottleneck.

    Enter DFlash, a new framework from Z Lab at UC San Diego that fundamentally changes how drafting works. By replacing the autoregressive drafter with a block diffusion model, DFlash can generate an entire block of tokens in a single parallel forward pass. The results are striking: over 6× lossless acceleration on Qwen3-8B, nearly 2.5× faster than EAGLE-3.

    How speculative decoding works (recap)

    Speculative decoding, first introduced by Leviathan et al. in 2023, follows a simple draft-and-verify loop:

    1. A lightweight draft model proposes K future tokens
    2. The target LLM verifies all K tokens in a single forward pass
    3. Accepted tokens are kept; rejected tokens trigger a redraft from that point

    The key insight is that the target model — the slow part — only runs once per block instead of once per token. But all existing methods (Medusa, EAGLE, EAGLE-2, EAGLE-3) draft autoregressively: token 1, then token 2, then token 3. The drafter is fast, but sequential speed doesn’t scale.

    DFlash: Parallel drafting with block diffusion

    DFlash replaces the autoregressive drafter with a block diffusion model. Here’s what that means:

    Instead of generating tokens left-to-right, a block diffusion model receives the target model’s hidden states and generates K masked positions simultaneously. A single denoising step fills all K positions at once — true parallel generation.

    The architecture combines two innovations:

    1. Block diffusion drafting: The drafter uses block diffusion (also known as parallel diffusion or dLLM techniques) to denoise a block of masked tokens in one forward pass, drawing on the growing body of research into diffusion language models by Nie et al. (Large Language Diffusion Models), Arriola et al. (Block Diffusion), and Wu et al. (Fast-dLLM v2).
    2. Context conditioning via deep key-value injection: Instead of asking a tiny diffusion model to reason from scratch, DFlash conditions the drafter on context features extracted from the target model. This fuses the target’s deep reasoning with the drafter’s parallel speed, achieving high acceptance rates of 89%+ on models like Qwen3-8B.

    The numbers

    From the DFlash paper (arXiv:2602.06036, published February 2026):

    • 6× lossless speedup on Qwen3-8B compared to standard autoregressive decoding
    • 2.5× faster than EAGLE-3 (the previous state of the art)
    • 89%+ acceptance rate on Qwen3-8B, meaning the drafter’s proposals match the target model almost 9 out of 10 times
    • Lossless by design — speculative decoding preserves the target model’s exact output distribution

    Independent benchmarks from Spheron Network on Llama 3.3 70B with H100 PCIe GPUs show estimated throughput of ~9,000 tokens/sec with DFlash, compared to ~3,600 for EAGLE-3 and ~2,600 for standard speculative decoding with a draft model — representing both a massive throughput improvement and roughly 87% cost reduction per million output tokens.

    Supported models and ecosystem

    DFlash has gained rapid adoption. The open-source repository (z-lab/dflash) has accumulated over 3,600 stars on GitHub since its release. Draft checkpoints are available on Hugging Face for a growing list of models including:

    • Qwen3 and Qwen3.5 family (4B through 122B-A10B variants, including Mixture-of-Experts)
    • Gemma 4 (26B-A4B and 31B)
    • GPT-OSS (20B and 120B)
    • MiniMax-M2.5 and Kimi-K2.5
    • Qwen3-Coder and Qwen3-Coder-Next
    • Llama-3.1-8B (UltraChat fine-tune)

    With checkpoints for DeepSeek-V4, MiniMax-M2.7, and GLM-5.1 announced as coming soon. The authors have also pledged to open-source their training recipe, enabling the community to train DFlash drafters for any model.

    Production integrations

    DFlash isn’t just academic research — it’s already integrated into production inference frameworks:

    • SGLang: Full support with --speculative-algorithm DFLASH
    • vLLM: Core DFlash support landed in v0.20.1+, with Docker images for complex models like Gemma4
    • Google TPUs: UCSD researchers (including the co-inventor of PagedAttention) successfully ported DFlash to Google’s TPU/JAX stack, achieving 3× speedups on TPUs
    • Apple Silicon (MLX): Community implementations and official MLX support, tested on M5 Pro
    • Transformers: Simple API for quick experimentation with model.spec_generate()

    DDTree: Pushing further with draft trees

    A follow-up paper, “Accelerating Speculative Decoding with Block Diffusion Draft Trees” (arXiv:2604.12989), introduces DDTree (Diffusion Draft Tree) — a method that constructs a draft tree from DFlash’s per-position distributions. Instead of a single linear draft, DDTree uses a best-first search to select the most promising continuations under a fixed node budget, then verifies them all in one forward pass using tree attention. This extends DFlash’s parallel drafting into a tree-based approach, squeezing even more acceleration from the same infrastructure.

    What this means for practitioners

    If you’re running LLM inference at scale, DFlash represents the most significant advance in speculative decoding since EAGLE-3. The combination of true parallel drafting and context conditioning means:

    • Lower latency: 6× speedup translates directly to faster response times, especially for long outputs
    • Lower cost: The ~87% reduction in cost per million tokens (based on H100 benchmarks) is substantial at scale
    • Lossless output: Unlike compression or distillation, speculative decoding preserves the exact model distribution — your outputs are identical to standard decoding
    • Easy to adopt: Drop-in support in SGLang, vLLM, and Transformers means you can enable it without changing your application code

    The main caveat: DFlash requires a pre-trained draft checkpoint for your target model. But with growing coverage across the Qwen, Gemma, Llama, and OpenAI model families — and the upcoming training recipe — this barrier should disappear quickly.

    Key resources

    References

    1. Chen, J., Liang, Y., Liu, Z. “DFlash: Block Diffusion for Flash Speculative Decoding.” arXiv:2602.06036, February 2026. Link
    2. Leviathan, Y., Kalman, M., Shavit, Y. “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. Link
    3. Li, Y. et al. “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test-Time Discrepancy.” 2025. Link
    4. Cai, T. et al. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” 2024. Link
    5. Nie, S. et al. “Large Language Diffusion Models.” 2025. Link
    6. Arriola, J.I. et al. “Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models.” 2025. Link
    7. Wu, T. et al. “Fast-dLLM v2: Efficient Block-Diffusion LLM.” 2025. Link
    8. Zhang, H. et al. “Achieving 3X Speedups with Diffusion-Style Speculative Decoding on Google TPUs.” Google Developers Blog, April 2026. Link
  • Exploring Continuous AI with «Continue»

    I recently came across an interesting project called «Continue» – a tool designed to accelerate development workflows using what they call «Continuous AI.»

    Essentially, Continue lets you build and run custom AI agents directly within your IDE (like VS Code or JetBrains), your terminal, and even your CI pipelines. It offers several key features:

    • Agent: Collaborative AI for development tasks.
    • Chat: A way to ask questions and clarify code.
    • Edit: Modify code sections directly within your file.
    • Autocomplete: Inline code suggestions powered by AI.

    It’s built with open-source principles (Apache 2.0 license) and supports various LLMs like Claude and Qwen. You can find more details and get started at docs.continue.dev/ .

    The project seems particularly relevant for developers interested in leveraging AI to boost productivity and streamline their workflows.