Blog

  • Test post

    Hello world

  • The Recursive Revolution: From RLM to Multi-Agent Loops

    The Recursive Revolution: From Single Models to Multi-Agent Loops

    Recursive Language Models (RLMs) emerged in December 2025 as a new scaling axis for AI. Since then, the field has evolved rapidly — from single-model recursion to entire multi-agent systems operating as unified recursive computations. Here’s where we stand in mid-2026.


    Where It Started: RLM (MIT, Dec 2025)

    The original RLM paper by Zhang, Kraska, and Khattab reframed long-context reasoning as an inference-time scaling problem. Instead of stuffing everything into a context window, the model delegates work to a persistent Python REPL and spawnable sub-LLM instances.

    Key results: GPT-5 with RLM achieved +26% over compaction, +130% over CodeAct, and +13% over Claude Code. A fine-tuned RLM-Qwen3-8B rivalled GPT-5 on 3 out of 4 tasks.

    The insight was deceptively simple: treat the prompt as an environment the model can inspect, slice, and recursively query — collapsing reasoning and tool use into a single inference abstraction.


    The Next Step: Ouro — LoopLMs (ByteDance, Oct 2025)

    ByteDance’s «Scaling Latent Reasoning via Looped Language Models» (Zhu et al., with Yoshua Bengio) took a different approach: instead of applying recursion only at inference time, they built reasoning into pre-training itself.

    Key innovations:

    • Iterative computation in latent space during pre-training, not just inference
    • Entropy-regularized objective for learned depth allocation — the model decides how many iterations it needs
    • 7.7 trillion tokens of training data
    • Ouro-2.6B — open-weight 2.6B parameter model available on HuggingFace

    The name «Ouro» comes from the Ouroboros — the serpent eating its own tail. The model loops over its own latent representations, refining reasoning through shared-weight recurrent computation rather than stacking more layers.

    Why this matters: RLM proved recursion works as a fine-tuning strategy. Ouro proves that if you train a model to be recursive from scratch, it learns to organize computation in stages that closely mirror feedforward models — but with far greater parameter efficiency.


    Mechanistic Understanding: Why Does It Work? (Apr 2026)

    Two papers from April 2026 provide the theoretical foundation:

    «A Mechanistic Analysis of Looped Reasoning Language Models» discovered that recurrent blocks in looped models learn stages of inference that repeat cyclically in the latent space. Each layer converges to a distinct fixed point — the model isn’t just looping randomly, it’s following a consistent computational trajectory.

    «From Growing to Looping» (Google DeepMind, OpenReview 2026) by Kapl, von Oswald, and Bauer provides a unified theoretical framework connecting depth growth with looping. The conclusion: iterative computation can scale reasoning without adding parameters, establishing a formal equivalence between model depth and recursion depth.


    The Latest Frontier: RecursiveMAS (Stanford, Apr 2026)

    Now we arrive at the paper you came here for. «Recursive Multi-Agent Systems» by Yang, Zou et al. (Stanford + UIUC) asks: if recursion works for a single model, does it work for a system of agents?

    RecursiveMAS treats the entire multi-agent system as a unified latent-space recursive computation. Each agent acts like an RLM layer, iteratively passing latent representations to the next, forming a looped interaction process.

    Architecture Highlights

    • RecursiveLink module — lightweight component enabling in-distribution latent thoughts generation and cross-agent latent state transfer (not text-based message passing)
    • Inner-outer loop learning — iterative whole-system co-optimization with gradient-based credit assignment across recursion rounds
    • 4 collaboration patterns — sequential pipeline, parallel specialized agents, and hybrids

    Results (9 benchmarks: math, science, medicine, search, code)

    • +8.3% accuracy vs advanced single/multi-agent baselines
    • 1.2x–2.4x inference speedup vs text-based MAS
    • 34.6%–75.6% token reduction — massive efficiency gain

    The token reduction alone is the killer feature. Multi-agent systems have always been expensive because agents exchange text — every message is tokens generated, transmitted, and consumed. RecursiveMAS eliminates most of that overhead by exchanging latent states instead.


    The Bigger Picture: Where Is This Heading?

    Three clear directions are emerging:

    1. Pre-trained recursive models become the default. Ouro shows that recursion can be baked into pre-training. Expect future foundation models to include recursive computation as a native capability, not an afterthought.

    2. Multi-agent recursion for production workloads. RecursiveMAS is academic, but the efficiency gains (75% token reduction) make it immediately relevant for production multi-agent pipelines. Companies like Prime Intellect are already building RLM training infrastructure.

    3. Recursive self-improvement loops. The ICLR 2026 workshop on «AI with Recursive Self-Improvement» signals that agents rewriting their own code and prompts recursively is no longer theoretical — Claude Code and Codex already do this ad-hoc. The next step is systematic, gradient-based self-improvement inspired by RecursiveMAS’s inner-outer loop design.


    Key Papers


    The recursive paradigm is no longer a niche research direction. From single-model inference scaling to multi-agent systems to pre-training architectures, recursion is becoming the unifying principle for how AI systems handle complexity beyond what any single forward pass can manage.

  • Recursive Language Models: The New Paradigm for Inference Scaling

    CoT → ReAct → CodeAct → RLM. The evolution of how language models think doesn’t stop at bigger context windows. A team at MIT CSAIL has introduced a radically different way for LLMs to reason: make the model recursively call itself through a programmable REPL environment.


    The Problem: «Context Rot» Is Real

    If you’ve ever worked with Claude Code, ChatGPT, or any agent framework on a sufficiently large codebase, you’ve experienced it: as the conversation grows, the model gets… dumber.

    Anthropic calls it context rot: when the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases. In real usage — long coding sessions, multi-turn research, massive document analysis — performance degrades in ways that are hard to benchmark and impossible to ignore.

    The natural intuition is: «What if I split the context into two model calls, then combine them in a third?»

    That’s exactly what RLMs do. But instead of a fixed splitting strategy, the model itself decides how to decompose its input, writes code to do it, and recursively processes the pieces.


    What Are Recursive Language Models?

    Recursive Language Models (RLMs) are an inference strategy proposed by Alex Zhang, Omar Khattab, and Tim Kraska at MIT CSAIL (October 2025, updated January 2026).

    The core idea: a language model can spawn recursive calls to itself through a REPL environment, where the input context exists as a symbolic variable — not as text in the model’s prompt.

    Their results: an RLM using GPT-5-mini outperforms GPT-5 by more than double the number of correct answers on the OOLONG long-context benchmark, and costs less per query. RLMs also don’t degrade when given 10M+ tokens at inference time.


    The Evolution: From CoT to RLM

    Chain-of-Thought (CoT, 2022): The model generates intermediate reasoning steps before answering. No tool use. No external interaction.

    ReAct (Yao et al., 2023): The pattern Thought → Action → Observation. The model reasons with chain-of-thought, calls tools, observes results, and repeats.

    CodeAct (Shi et al., 2023): Same as ReAct, but the action is executing code instead of calling predefined tools. More powerful because code allows arbitrary composition.

    RLM (Zhang, Khattab & Kraska, 2025): The model calls itself recursively. The root LM has a Python REPL environment where the context exists as a variable. It can launch recursive sub-LM calls with partitions of the context, each with their own REPL.

    The authors see RLMs as the next milestone after CoT and ReAct.


    Deep Dive: How RLM Works

    The Context Is a Variable, Not a Prompt

    This is the crucial conceptual difference. In a normal LLM call, the context is text in the prompt. The model sees it all at once and attends to it.

    In an RLM, the context is loaded as a Python variable in a REPL environment. The model never sees the actual content unless it explicitly writes code to read a portion of it.

    The user prompt that initializes the system tells the model ONLY metadata:

    Your context is a str with 5,000,000 total characters, broken into chunks of lengths: [5000000].

    That’s it. The model knows something exists, knows its size, but does not see its content. To understand the context, the model must write code.

    The System Prompt

    The RLM system prompt is an extensive instruction manual telling the model what tools it has:

    Available tools:
    context — The input data as a variable
    llm_query(prompt, model) — A single LLM call (fast, ~500K chars)
    llm_query_batched(prompts) — Multiple calls in parallel
    rlm_query(prompt, model) — Spawns a child RLM with its own REPL
    rlm_query_batched(prompts) — Multiple child RLMs in parallel
    SHOW_VARS() — Lists all variables in the REPL
    print() — View REPL output

    Guidance on when to use what:
    – Use llm_query for: simple extraction, summarization, classification, Q&A over a chunk
    – Use rlm_query for: multi-step reasoning, problems requiring iteration, deep subtasks

    The prompt includes detailed examples showing how to search for needles in haystacks, read books section by section, process chunks in parallel, and use recursive calls for reasoning-heavy subtasks.

    The Main Loop

    The RLM framework runs a tight loop (default: 30 iterations max):

    1. Send to model: system prompt + metadata + iteration history + user prompt
    2. Model responds with text + code blocks wrapped in triple backticks with repl tag
    3. Execute the Python code in the REPL. Results are stored as variables
    4. Capture REPL output (stdout, stderr, new variables)
    5. Check if model wrote FINAL(answer) or FINAL_VAR(var name). If yes: terminate and return. If no: append to history and repeat

    Safety mechanisms prevent infinite loops: max_iterations (30), max_budget (USD), max_timeout (seconds), max_errors (consecutive failures), and max_depth (recursion depth).

    How the Model Decides

    The model doesn’t follow a fixed algorithm. It writes Python code that implements its strategy.

    For a needle-in-a-haystack problem, it might grep for keywords — finding the answer without any sub-LLM calls at all.

    For summarizing a 10M-token document, it divides by headings, sends each section to llm_query, then aggregates the results.

    The model chooses the strategy based on the query and the context metadata. It’s the model’s intelligence, not a predefined pipeline.

    Recursion — Child RLMs

    When the model calls rlm_query, the framework spawns a child RLM with its own REPL environment, message history, and iteration loop. The child processes the prompt, can write and execute code, iterate, and returns its result back to the parent as a Python variable.

    Recursion depth is controlled by max_depth: at depth 1, only the root has a REPL. At depth 2, children also get their own REPL and can spawn grandchildren. The framework propagates remaining budget and timeout to children.


    RLM vs. CodeAct vs. ReAct

    All three involve a model writing and executing code. So what makes RLM different?

    Co-author Omar Khattab explicitly drew the line:

    A standard coding agent:
    – Receives a prompt + external data (files, APIs, databases)
    – Reads external data via file I/O, HTTP calls
    – Writes code to transform or analyze that data
    – Results go into the conversation context

    An RLM:
    – The user’s own prompt/context is a symbolic object in the environment
    – The model cannot read long snippets from the context directly — it must write code to access portions
    – Recursion happens during code execution — rlm_query spawns a child RLM whose output returns as a variable
    – All sub-calls return values into symbolic variables — not text appended to the context

    The difference is philosophical: in an agent, the context is external data you read from. In an RLM, the context is the thing the model needs to understand — and it can only understand it by writing code that programmatically explores it.


    Where This Fits

    RLMs represent a shift in how we think about LLM inference:

    Context is code, not text. Instead of cramming more tokens into larger context windows, treat the context as a programmable resource.

    Recursion as a primitive. The model calling itself to solve sub-problems is a natural extension of decomposition-based reasoning.

    Cost-effective. Using a smaller model for sub-calls while reserving the smart model for strategic decisions produces better results at lower cost.

    The entire agentic AI stack already uses iterative code-execution loops. The key innovation is treating the user’s input context as the programmable object, not just external data.

    When a model can write code to explore its own input, the notion of context length becomes something entirely different.


    Sources:
    – Alex Zhang, Omar Khattab, Tim Kraska — Recursive Language Models (arXiv:2512.24601v1): https://arxiv.org/abs/2512.24601
    – RLM Blog Post: https://alexzhang13.github.io/blog/2025/rlm/
    – RLM Source Code: https://github.com/alexzhang13/rlm
    – RLM Minimal: https://github.com/alexzhang13/rlm-minimal
    – RISE Data Labs: https://risedatalabs.com/blog/recursive-language-models
    – Yao et al. — ReAct (2023): https://react-lm.github.io/
    – Shi et al. — CodeAct: https://www.emergentmind.com/topics/codeact-framework
    – Navendu Pottekkat — Notes on RLMs: https://navendu.me/posts/recursive-language-models/
    – Omar Khattab — X clarification on RLM vs. coding agents: https://x.com/lateinteraction/status/2020215204945252429

  • Test Post

    Hello world.

  • The Sacred Tears of Shiva: A Complete Guide to Rudraksha Beads

    Rudraksha — from the Sanskrit «Rudra» (Lord Shiva) and «Aksha» (tears) — are among the most sacred seeds in Hindu and Buddhist tradition. Believed to be born from the tears of a god, each bead carries a different divine frequency. Here’s everything you need to know about these ancient spiritual tools, from mythology to market prices.

    What Are Rudraksha?

    Rudraksha are the seeds of the Elaeocarpus ganitrus tree, a tropical evergreen found primarily in Nepal (Arun Valley), Indonesia (Java and Sumatra), India, and parts of Southeast Asia. These trees grow at altitudes between 300 and 1,500 meters, producing blue outer fruits that enclose the wrinkled, brown seeds we know as Rudraksha beads.

    Each seed naturally forms vertical grooves on its surface that converge at the top and bottom poles. The number of these grooves — called «mukhi» (faces) — determines the bead’s spiritual properties, associated deity, ruling planet, and even its market price.

    Rudraksha have been used for thousands of years as prayer beads (mala), meditation aids, astrological remedies, and protective amulets. They appear in foundational Hindu scriptures including the Shiva Purana, Mahabharata, Rudraksha Jabala Upanishad, and multiple Tantras. In Buddhism, they’re equally revered, particularly in Tibetan and Nepalese Buddhist traditions.

    Scientifically, Rudraksha seeds contain alkaloids, minerals, and trace elements (calcium, iron, magnesium, zinc, silicon) that have been studied for potential cardiovascular benefits — traditional Ayurvedic medicine uses them to regulate blood pressure and heart rate.

    Origin Stories: Three Myths, One Seed

    The Story of Shiva’s Penance (Most Popular)

    According to the Shiva Purana, Lord Shiva — the great ascetic — meditated for thousands of years on Mount Kailash, completely withdrawn from the world, eyes closed in deep trance. When he finally opened them, he saw the immense suffering of humanity and was overwhelmed with compassion. Tears streamed down his face, and wherever they touched the earth, Rudraksha trees sprang up. The beads were thus born as Shiva’s gift of mercy to suffering mankind.

    The Story of Tripurasur (The Demon of Three Cities)

    A second version in the Shiva Purana tells of Tripurasur, a demon who had meditated for decades to please Lord Brahma and received a boon granting him three invincible cities. Tripurasur became consumed with pride and terrorized both gods and humans. Shiva, seeing the suffering, entered meditation to forge the deadly Aghor weapon to destroy the demon. When he emerged from meditation and opened his eyes, tears of compassion fell — from which Rudraksha trees grew, meant to protect humanity long after the demon was vanquished.

    The Story of Shiva’s Sweat

    A third variant says that during Shiva’s meditation on Mount Kailash, his intense spiritual energy caused drops of sweat to fall from his forehead. When these drops touched the ground, they transformed into seeds that grew into Rudraksha trees. This version emphasizes the beads’ connection to Shiva’s raw spiritual power rather than his compassion.

    Adi Shankaracharya and the Discovery of Rudraksha

    According to tradition, the great philosopher Adi Shankaracharya (8th century CE) was meditating near the Mandakini River when he saw a Rudraksha bead washed ashore. Recognizing its sacred nature, he began wearing Rudraksha malas and promoted their use among all castes and classes. The Rudraksha Jabala Upanishad — attributed to this era — explicitly states that Rudraksha should be worn by anyone regardless of social status: Brahmin, Kshatriya, Vaishya, or Shudra. This was radical at the time, when many spiritual practices were restricted by caste.

    The 13 Faces: Deity, Planet, and Meaning

    Each mukhi (face) is associated with a specific deity, ruling planet, sacred mantra, and set of benefits. Here’s a complete guide from 1 to 13 mukhi:

    1 Mukhi — Lord Shiva ☀️ Sun

    The most sacred and rare of all Rudraksha. The 1 Mukhi embodies the ultimate truth, unity, and pure consciousness. It’s considered the jewel among jewels — a single authentic round 1 Mukhi can cost up to $6,000.

    Benefits: Enlightenment, moksha (liberation), enhanced concentration, detachment from worldly attachments, self-realization.

    Mantra: Om Hreem Namaha

    Note: A perfectly round 1 Mukhi with a single internal seed doesn’t exist in nature. Nepal produces oval/lentil-shaped 1 Mukhi that are considered authentic. Some «round» 1 Mukhi beads sold today are underdeveloped 4-5 mukhi beads with naturally formed single faces.

    2 Mukhi — Shiva + Parvati 🌙 Moon

    The Unity Bead. Represents the union of Shiva and Goddess Parvati (Ardhanarishvara), symbolizing duality in perfect harmony.

    Benefits: Harmonious relationships — marriage, partnerships, family bonds. Emotional balance, love, compassion, conflict resolution.

    Mantra: Om Namaha

    3 Mukhi — Agni (Fire God) ♂ Mars

    The Purification Bead. Associated with Agni, the divine fire that burns away impurities.

    Benefits: Destroys past-life karma and sins. Eliminates inferiority complexes, fear, self-hatred, and mental stress. Boosts energy and eliminates laziness. Spiritual rebirth.

    Mantra: Om Kleem Namaha

    4 Mukhi — Lord Brahma ☿ Mercury

    The Knowledge Bead. Brahma is the Creator of the universe and the bestower of knowledge and creativity.

    Benefits: Enhanced memory, concentration, learning ability, and eloquence. Particularly beneficial for students and scholars. Improves speech and communication.

    Mantra: Om Hreem Namaha

    5 Mukhi — Kalagni Rudra (Shiva) ♃ Jupiter

    The most common Rudraksha — over 90% of all beads are 5 Mukhi. Associated with Kalagni Rudra, a fierce form of Shiva. Known as the «Dev Guru Rudraksha» (Teacher of the Gods) because Jupiter is the guru of all deities.

    Benefits: Destroys bad karma of the present life. Brings mental peace, health, protection from accidental death. Grants fame and renown. Essential for any meditation or sadhana practice.

    Mantra: Om Hreem Namaha

    6 Mukhi — Lord Kartikeya ♀ Venus

    The Willpower Bead. Kartikeya is Shiva’s son and the commander of the celestial army.

    Benefits: Courage, wisdom, willpower, and expressive power. Ideal for leaders, speakers, and performers. Also blessed by Parvati, Lakshmi, and Saraswati.

    Mantra: Om Hreem Hum Namaha

    7 Mukhi — Gauri (Lakshmi/Parvati) 🌙 Moon

    The Charisma Bead. Gauri is the goddess of magnetic personality and abundance.

    Benefits: Personal magnetism, love, prosperity, stress relief. Attracts positive energies and good fortune.

    Mantra: Om Hum Namaha

    8 Mukhi — Lord Kubera ♃ Jupiter / ♀ Venus

    The Wealth Bead. Kubera is the god of wealth and treasure.

    Benefits: Material prosperity, wisdom, removal of financial obstacles, power, and authority.

    Mantra: Om Hum Namaha

    9 Mukhi — Goddess Durga ♂ Mars

    The Protection Bead. Durga is the warrior goddess, divine shield against all harm.

    Benefits: Spiritual power, courage, protection against enemies and curses. Neutralizes negative astrological effects of Mars.

    Mantra: Om Hreem Hum Namaha

    10 Mukhi — Lord Vishnu ♄ Saturn

    The Preservation Bead. Vishnu is the Preserver of the universe, associated with his ten incarnations (Dashavatara).

    Benefits: Physical and mental health, leadership qualities, balance, relief from Saturn-related afflictions.

    Mantra: Om Namaha

    11 Mukhi — Lord Hanuman ⚡ No fixed planet

    The Courage Bead. Hanuman is the god of bravery, strength, and adventure. This is the 11th of the 11 Rudras (forms of Shiva).

    Benefits: Physical and mental courage, strength, confidence, elimination of cowardice. Spiritual protection.

    Mantra: Om Hreem Hoom Namaha

    12 Mukhi — Lord Surya (Sun) ☀️ Sun

    The Leadership Bead. Surya, the sun god, creates a powerful aura around the wearer.

    Benefits: Charisma, leadership, creativity, mental clarity, confidence. Also associated with relief from heart problems.

    Mantra: Om Hreem Namaha

    13 Mukhi — Indra + Kamadeva ♀ Venus / 🌙 Moon

    The Love and Emotion Bead. Represents Indra (king of the gods) and Kamadeva (god of love and desire).

    Benefits: Emotional control, love, attraction, magnetism. Mitigates negative effects of Venus and Moon in astrological charts. Considered rare and powerful.

    Mantra: Om Hreem Namaha

    Price Guide: How Much Do Rudraksha Cost?

    Rudraksha prices vary dramatically based on mukhi count, origin, size, and quality. Nepal Arun Valley beads command 2-5x the price of Indonesian ones due to larger size (15-25mm vs. 8-15mm), deeper mukhi lines, and traditional preference.

    Mukhi Price Range (USD) Rarity
    1 Mukhi $30 – $6,000+ Extremely rare
    2 Mukhi $10 – $65 Very rare
    3 Mukhi $8 – $40 Rare
    4 Mukhi $6 – $30 Rare
    5 Mukhi $1 – $20 Very common
    6 Mukhi $6 – $30 Common
    7 Mukhi $9 – $45 Uncommon
    8 Mukhi $25 – $100 Uncommon
    9 Mukhi $40 – $150 Rare
    10 Mukhi $50 – $200 Rare
    11 Mukhi $60 – $250 Rare
    12 Mukhi $75 – $300 Very rare
    13 Mukhi $100 – $450 Very rare

    Key factors that affect price:

    Size: Larger beads (25mm+) can be 3-5x more expensive than standard sizes
    Origin: Nepal > Indonesia for price and traditional preference
    Shape: Round/oval beats irregular
    Clarity: Deep, well-defined mukhi lines add value

    Certification: Lab-certified beads (X-ray tested for internal seeds) cost more

    Extreme cases: A Siddha Mala (one bead each of 1-14 mukhi combined) can cost $1,000–$15,000 depending on origin and quality. The legendary Brahma Mala (21 beads of 1 mukhi each) has been known to fetch $20,000+ at auctions.

    Nepal vs. Indonesia: Which Should You Choose?

    Feature Nepal (Arun Valley) Indonesia (Java/Sumatra)
    Size 15-25mm (larger) 8-15mm (smaller)
    Mukhi lines Deep, clearly defined Shallower, less distinct
    Texture Rough, thorny Smoother
    Production Limited, seasonal Abundant
    Best for Single beads, wearing, astrology Malas (108 beads), meditation
    Price Premium (2-5x higher) More affordable
    Fake risk Lower Higher (mass-produced fakes exist)

    How to Spot Fake Rudraksha

    The market is flooded with counterfeit beads. Here’s how to identify real ones:

    1. The water test: Real Rudraksha sinks immediately in water. Fakes (often made of resin or carved stone) float.
    2. The nail test: Press a nail into the surface — real Rudraksha feels soft and slightly fibrous, like dried wood.
    3. The mukhi lines: Genuine lines run continuously from top to bottom pole. Fakes often have lines that stop or merge.
    4. The sound: Shake a mala — real Rudraksha make a soft, dull sound. Hard counterfeits clink like stone.

    5. X-ray certification: For expensive beads (1, 2, 13+ mukhi), request lab certification that shows the internal seed structure matches the external mukhi count.

    Final Thoughts

    Rudraksha are more than pretty prayer beads — they’re one of the oldest continuously used spiritual tools in human history, with a documented presence in Hindu texts for over 3,000 years. Whether you’re drawn to them for meditation, astrological remedies, protection, or simply their organic beauty, there’s a mukhi for almost every intention.

    The 5 Mukhi remains the most practical entry point: affordable, abundant, and powerful. But if you’re searching for something specific — Hanuman’s courage (11 Mukhi), Brahma’s knowledge (4 Mukhi), or Shiva’s ultimate realization (1 Mukhi) — the entire spectrum is available, provided you buy from reputable sources.

    Just remember what the Shiva Purana says: «Even a person who has committed the most grievous sins can be purified by wearing Rudraksha with devotion.»

    The only question is which face of Rudraksha resonates with yours.

    Sources:
    Shiva Purana (Hindu scripture on Rudraksha origin)
    Rudraksha Jabala Upanishad (Vedic text on Rudraksha usage)
    – Wikipedia — «Rudraksha»: https://en.wikipedia.org/wiki/Rudraksha
    – Himalayas Shop — «Meaning of Different Rudraksha Mukhi»: https://www.himalayasshop.com/blogs/guides/meaning-of-different-rudraksha-mukhi
    – IGL Delhi — «Rudraksha Types & Benefits»: https://igldelhi.com/pdf/rudraksha-benefits-and-uses.pdf
    – Ratna Gems — «Nepal Rudraksha Price Guide 2026»: https://ratnagems.com/original-rudraksha-buying-guide/
    – Rudraksha Ratna — «Legends of Rudraksha»: https://www.rudraksha-ratna.com/articles/legend-of-rudraksha
    – Divine Hindu — «Rudraksha Origin Story»: https://www.divinehindu.in/blogs/news/rudraksha-origin-story

  • Multi-Token Prediction (MTP): How LLMs Learn to Look Ahead

    The autoregressive bottleneck

    Every large language model you’ve heard of — GPT-4, Claude, Llama, Qwen, DeepSeek — shares a fundamental constraint: autoregressive generation. Given a prompt, the model predicts one token at a time. It sees «The capital of France is», predicts «Paris», then feeds «The capital of France is Paris» back in to predict the next token. Repeat. Forever.

    This sequential loop creates a hard throughput ceiling. No matter how fast your GPU is, you can’t predict token _t+2_ until you’ve committed to token _t+1_. The problem is architectural, not hardware-bound.

    Multi-Token Prediction (MTP) is one of the most significant answers to this problem. It’s a technique that lets a model predict several future tokens simultaneously during a single forward pass, then uses those predictions to accelerate inference through speculative decoding. Instead of generating one token per pass, the model generates a _block_ of candidates that are verified in one go.

    Origins: From training tricks to inference speedups

    MTP didn’t start as an inference acceleration technique. Its roots go back to auxiliary prediction tasks in training — the idea that asking a model to predict not just the next token but also tokens further ahead improves its representation learning.

    Google explored this with Lookahead Transformers (Lee et al., 2022), where auxiliary branches predicted future positions during training. The benefit was better performance, not speed. Similarly, InstructGPT and earlier work used «next-next-token» predictions as regularization.

    The breakthrough moment came with DeepSeek-V3 (December 2024). The DeepSeek team added MTP heads during pre-training — small auxiliary prediction heads attached to intermediate decoder layers that predicted tokens at offsets +2, +3, and beyond. They discovered something unexpected: these same heads could be repurposed at inference time as a built-in draft model for speculative decoding.

    The key insight: if a model already knows how to predict multiple tokens ahead during training, those predictions are already aligned with the main model’s distribution. No separate draft model needed. No distribution mismatch to overcome.

    DeepSeek V3 reported an MTP-1 setup (predicting one extra token), while Step 3.5 Flash went further with MTP-3 (three extra tokens) during both training and inference.

    — _Sebastian Raschka, LLM Architecture Gallery on MTP_ (2025)

    Source: DeepSeek V3 Technical Report, Raschka’s MTP Guide

    How MTP works

    Training phase

    During pre-training, the model architecture includes:

    1. Main decoder — processes the input sequence autoregressively as usual
    2. MTP heads — lightweight auxiliary heads attached to intermediate layer outputs, each predicting a token at a different future offset

    At position _t_, the main head predicts token _t+1_. Simultaneously:
    – MTP head 1 (reading from layer outputs at position _t_) predicts token _t+2_
    – MTP head 2 predicts token _t+3_
    – And so on, up to the configured MTP depth

    The total training loss combines the standard next-token cross-entropy with averaged MTP losses, typically weighted at 0.1x the main loss to avoid destabilizing training.

    Position t:  [main head → t+1] [MTP-1 → t+2] [MTP-2 → t+3]
    Position t+1: [main head → t+2] [MTP-1 → t+3] [MTP-2 → t+4]

    Inference phase (speculative decoding)

    At inference time, the MTP heads become a zero-overhead draft model:

    1. Draft step: The main decoder runs one forward pass. The main head produces token _t+1_, while MTP heads simultaneously produce candidates for _t+2, t+3, t+4_ — all from that single pass. (1/4)

    2. Verification step: The main model verifies each candidate sequentially. If the candidate matches what the main model would have predicted autoregressively, it’s accepted.
    3. Accept or fall back: Accepted tokens are emitted as a block. If a candidate is rejected, generation falls back to standard autoregressive mode from that point.

    The computational trick: generating N speculative candidates + verifying them all can be cheaper than N separate autoregressive forward passes, especially when the acceptance rate is high (and it is, because the draft and verifier are the same model).

    Current adoption

    As of May 2026, MTP has moved from experimental to mainstream in the open-weight model landscape:

    Models with native MTP

    DeepSeek V3/V4 — MTP-1 (1 extra token). The model that popularized the technique in production.
    Step 3.5 Flash — MTP-3 (3 extra tokens). One of the most aggressive MTP deployments.
    Qwen 3.5 family — MTP-1. Uses "mtp" method in vLLM.
    Qwen 3.6 family — MTP-2. Supports 2 speculative tokens.
    Qwen3-Next / Qwen3-Coder-Next — MTP-2 with a specialized method ("qwen3_next_mtp") adapted to their GDN architecture with hybrid attention.
    Xiaomi MiMo-7B-Base — Configurable MTP depth.

    Framework support

    vLLM: Full MTP support via --speculative-config '{"method": "mtp", "num_speculative_tokens": N}'. Also supports model-specific methods like "qwen3_next_mtp".
    llama.cpp: MTP support added in PR #22673 (2025), with GGUF quantization support for models like Qwen3.5-4B-MTP and Qwen3.6-35B-A3B-MTP.
    NVIDIA Megatron Bridge: MTP training support in the framework, including pipeline parallelism. Recommended for models >10B parameters.
    SGLang: Supports MTP through Qwen3-Next models.

    Measured performance gains

    FastMTP (ICLR 2026): Achieved 2.03× speedup over standard next-token prediction, outperforming vanilla MTP by 82% through self-distilled fine-tuning and dynamic vocabulary compression. Source: OpenReview: FastMTP
    Qwen3.5-122B on DGX Spark: Reached 38.4 tokens/second with MTP enabled (from 28.3 baseline), approaching the memory bandwidth ceiling. Source: NVIDIA Dev Forums

    Alternatives to MTP

    MTP is not the only way to accelerate speculative decoding. Other approaches trade off differently between complexity, generality, and performance:

    Draft model speculation

    The classic approach: run a smaller, faster model as a draft generator, then verify its output with the target model.

    Pros: Universal — works with any target model. No architectural changes needed.
    Cons: Requires hosting a second model (memory overhead). Distribution mismatch between draft and target reduces acceptance rates.
    Example: Using Qwen3.5-4B as a draft for Qwen3.5-27B-FP8 in vLLM with {"method": "draft_model", "model": "Qwen/Qwen3.5-4B", "num_speculative_tokens": 5}

    EAGLE (Efficient AcceleRation Generator)

    A learned draft network that predicts multiple tokens autoregressively, fine-tuned specifically for each target model.

    Pros: Higher acceptance rates than generic draft models. Much smaller than a full draft model.
    Cons: Requires per-model fine-tuning. Still autoregressive in the draft phase (generates one token at a time).
    Status: Well-supported in vLLM. Widely used for models that don’t have native MTP. EAGLE-3 is the latest iteration.

    DFlash — Block Diffusion for Flash Speculative Decoding

    The newest and most aggressive contender. Published by Z Lab in February 2026, DFlash replaces the autoregressive drafter entirely with a block diffusion model.

    Instead of emitting draft tokens one at a time (as EAGLE and classic draft models do), DFlash generates a full block of _K_ tokens in a single forward pass by denoising a masked sequence conditioned on the target model’s hidden states. The draft block is then verified by the target model in one parallel check.

    Speedup: Over 6× lossless acceleration. Up to 2.5× higher speedup than EAGLE-3. Source: DFlash paper
    How it differs: The diffusion drafter is lightweight and conditioned on context features from the target model. It emits all K candidates simultaneously — no sequential draft loop.
    Integration: Available in vLLM and SGLang. Baseten reports ~3× speedup on Qwen3-8B on a single B200 (654 TPS mean throughput, 10% faster than vLLM’s native DFlash implementation). Source: Baseten: DFlash blog
    Status: Active development. The z-lab/dflash GitHub repo has 3.8k stars (as of May 2026). Community forks have ported it to DGX Spark and other platforms.
    Key limitation: Requires a pre-trained DFlash checkpoint for your target model. Not training-free like ngram speculation, and not as universally available as draft models.

    Ngram / median-speculative sampling

    Simple, training-free methods that reuse recently seen token sequences or statistical patterns as draft candidates.

    Pros: Zero training, zero extra parameters. Works out of the box.
    Cons: Modest speedup (~1.1–1.3×). Acceptance rates depend heavily on the text domain.
    Status: Built into vLLM and llama.cpp as fallback options.

    L-MTP (Leap Multi-Token Prediction)

    A NeurIPS 2025 proposal that extends MTP by predicting non-adjacent tokens in a single pass — skipping intermediate positions to capture longer-range dependencies.

    Pros: Better long-range context modeling. Custom decoding strategy optimized for leap generation.
    Cons: More complex architecture. Less battle-tested in production.
    Source: NeurIPS 2025 Poster: L-MTP

    Comparison

    The tradeoff that matters

    MTP’s defining characteristic is that the capability is baked into the model architecture. You can’t add MTP to a model after training — it needs auxiliary heads wired up during pre-training. This means:

    – Models trained without MTP (GPT-4, Claude, Llama 3, Qwen 2.5, Qwen 2.5 Coder) can never use it
    – Models trained with MTP have a permanent inference advantage with zero runtime overhead
    – The training cost is modest (auxiliary heads are small relative to the full decoder)

    This is why every major new open-weight architecture released in 2025–2026 — DeepSeek V3/V4, Qwen 3.x, Qwen3-Next — ships with MTP heads. It’s becoming the standard feature, the way MoE and sliding window attention already are.

    For models that don’t have native MTP, DFlash and EAGLE are the most performant alternatives — but they require either fine-tuning a drafter or training a diffusion model, which adds operational complexity. Native MTP wins on simplicity: the speedup is there as long as you enable one flag.

    Try it yourself

    If you’re running a Qwen3-Next or Qwen 3.5/3.6 model locally, enabling MTP takes one flag:

    # Qwen 3.5 / 3.6 (generic MTP)
    vllm serve Qwen/Qwen3.5-27B-FP8 \
      --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
    
    # Qwen3-Next family (specialized method)
    vllm serve Qwen/Qwen3-Coder-Next \
      --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'
    

    For models without MTP, DFlash (if a checkpoint exists for your model), draft model speculation, or EAGLE are your alternatives. But if you’re choosing a model for deployment and inference speed matters, picking one with native MTP is the easiest performance win available today.

    Sources:
    DeepSeek V3 Technical Report — December 2024
    vLLM MTP Documentation
    Sebastian Raschka — Multi-Token Prediction
    FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction — ICLR 2026
    vLLM Recipes — Qwen3-Next Usage Guide
    NVIDIA Megatron Bridge — Multi-Token Prediction
    Qwen3.5-122B on DGX Spark Benchmark
    NeurIPS 2025: L-MTP Poster
    vLLM Forums — Qwen3.5 Speculative Decoding
    DFlash: Block Diffusion for Flash Speculative Decoding — Z Lab, February 2026
    Baseten: DFlash – 3x faster LLM inference
    z-lab/dflash GitHub

  • EU AI Act: The May 2026 Amendments — What Changed and What It Means

    On May 7, 2026, the Council of the European Union and the European Parliament reached a provisional agreement on amendments to the landmark EU AI Act (Regulation 2024/1689). The deal, part of the Digital Omnibus package proposed by the European Commission in November 2025, delays key compliance deadlines, removes machinery from scope, bans AI-generated intimate deepfakes, and extends regulatory relief to mid-sized companies.

    All of this while preserving the Act’s core risk-based framework.

    This is a deep dive from official EU sources into what actually changed — and what it means in practice.

    The Timeline: Everything Gets Delayed

    The most significant impact of these amendments is temporal. Here’s the before and after:

    Standalone high-risk AI systems (biometrics, employment screening, education, law enforcement, critical infrastructure, border management):

    • Before: August 2, 2026
    • After: December 2, 2027 (+16 months)

    High-risk AI embedded in regulated products (medical devices, toys, lifts, watercraft):

    • Before: August 2, 2027
    • After: August 2, 2028 (+12 months)

    National AI regulatory sandboxes:

    • Before: Should exist by August 2026
    • After: August 2, 2027

    Watermarking and transparency of AI-generated content:

    • New deadline: December 2, 2026

    This is notably earlier than the Commission’s own November 2025 proposal (February 2027), showing Parliament pushed for faster transparency rules.

    Why the delays? The Commission’s explanatory memorandum (COM(2025) 836) cited four concrete problems: delayed designation of national competent authorities, missing conformity assessment bodies, no harmonised standards for high-risk requirements yet, and incomplete guidelines and compliance tools. Without these foundations, the Commission argues, businesses face unpredictable compliance costs.

    Machinery: Fully Excluded

    AI systems embedded in machinery are now completely exempt from the AI Act. They only need to comply with the Machinery Regulation — one regulatory framework instead of two.

    Before this amendment, a factory robot with AI had to satisfy both the Machinery Regulation and the AI Act simultaneously — double the paperwork, double the cost. Now the Commission has the power to add AI-specific health and safety requirements directly into the Machinery Regulation via delegated acts, eliminating the overlap.

    This was a direct result of lobbying from major industrial companies like Siemens and ASML, who argued that dual compliance was unsustainable.

    Practical impact: Any company whose AI products fall under the Machinery Regulation can stop preparing for AI Act compliance. But watch for the Commission’s delegated acts — AI-specific requirements may be added to the Machinery Regulation itself.

    The «Nudifier» Ban: New Explicit Prohibitions

    The amendment adds two explicit bans to the AI Act’s prohibited practices list:

    1. AI systems designed to create child sexual abuse material (CSAM) 2. AI systems that generate non-consensual sexual or intimate images of identifiable persons — colloquially known as «nudifier» apps

    This covers images, video, and audio. The obligations apply to:

    • Placing such systems on the EU market
    • Placing systems on the market without reasonable safety measures
    • Deployers using them for this purpose

    Deadline: December 2, 2026.

    This was a major priority for the European Parliament. Co-rapporteur Michael McNamara (Renew) called it «a key part of the Parliament’s mandate.» Dutch lawmaker Kim van Sparrentak emphasized the protection of women and girls from intimate deepfakes.

    Small Mid-Caps: Regulatory Relief Expanded

    The EU’s new definition of «SME» extends to companies with up to 3,000 employees and €2.2 billion in turnover — the so-called Small Mid-Caps (SMCs). These companies now qualify for the same regulatory simplifications that previously only applied to traditional SMEs (≤250 employees):

    • Simplified technical documentation requirements
    • Special consideration in penalty applications
    • Reduced administrative burden overall

    This is a significant expansion. Thousands more companies now benefit from lighter compliance requirements, directly supporting the Commission’s stated goal of fostering European AI scaleups.

    Safety Components: A Narrower Definition

    The amendment narrows what qualifies as a «safety component» under the AI Act. AI functions that only assist users or optimise performance will no longer automatically trigger high-risk classification — unless their failure poses actual health or safety risks.

    Before, any AI classified as a safety component of a regulated product was automatically deemed high-risk. The narrower definition reduces the compliance scope substantially for product manufacturers.

    Centralised Enforcement: The AI Office

    Oversight of AI systems built on General-Purpose AI models is now centralized at the EU-level AI Office (European Commission). National authorities retain competence only for:

    • Law enforcement AI
    • Border management AI
    • Judicial authority AI
    • Financial institution AI

    This means AI developers face one supervisor — not 27 different national authorities potentially interpreting rules differently. Less fragmentation, more predictability.

    Bias Detection: Personal Data Now Permitted

    A notable pro-innovation change: providers and deployers of all AI systems can now process special categories of personal data (sensitive data like race, health, religion, sexual orientation) where strictly necessary to detect and correct biases, provided appropriate safeguards are in place.

    Previously, using sensitive data for bias testing required finding a legal basis under GDPR — legally uncertain territory. The amendment explicitly carves out an exception, making bias testing legally safe and encouraging better AI quality across the board.

    Other Notable Changes

    • Registration obligation reinstated: Providers must register AI systems in the EU high-risk database even if they claim exemption from high-risk classification. This closes a loophole where companies could avoid transparency by self-exempting.
    • Sectoral overlap mechanism: A new mechanism allows the Commission to limit the AI Act’s application where sectoral laws already have equivalent AI-specific requirements — preventing future double regulation.
    • AI literacy obligation shifted: Instead of imposing an unspecified obligation on providers and deployers, the duty to promote AI literacy now falls on the Commission and Member States.
    • Post-market monitoring simplified: The requirement for a harmonised post-market monitoring plan was removed, giving companies flexibility in how they monitor AI systems after deployment.

    The Political Framing

    The Council presidency (Cyprus) framed this as a competitiveness move. Deputy Minister Marilena Raouna stated:

    «Today’s agreement on the AI Act significantly supports our companies by reducing recurring administrative costs. It ensures legal certainty and a smoother and more harmonised implementation of the rules across the Union, strengthening EU’s digital sovereignty and overall competitiveness.»

    This is the first deliverable under the «One Europe, One Market» roadmap agreed by EU institutions. The broader political context is the October 2024 Letta and Draghi reports, which warned that regulatory complexity was eroding Europe’s competitiveness against the US and China.

    What Comes Next

    The provisional agreement still needs formal adoption by both the Council and the European Parliament. Both institutions have indicated they aim to complete this before August 2, 2026 — the original deadline for high-risk AI rules — to avoid any regulatory gap.

    After adoption, the text undergoes legal and linguistic revision before being published in the Official Journal.

    Sources

    All information in this article comes from official EU sources:

  • ProgramBench: Can Language Models Rebuild Software from Scratch?

    Benchmarks drive progress. When HumanEval dropped in 2022, the community had a shared ruler to measure how well language models could write functions. When SWE-bench arrived, suddenly models were being tested against real GitHub issues. Each new benchmark pushed capabilities forward.

    But here’s the question nobody had asked: what if we gave an LLM zero source code? No tests. No issue descriptions. Just a compiled binary and its documentation. Could it rebuild the original program from scratch?

    That’s the question ProgramBench asks. Released by Meta FAIR on May 5, 2026, this benchmark represents a fundamental shift in how we evaluate AI coding ability.

    TL;DR: None of the nine models evaluated — including the strongest frontier agents — could fully rebuild even a single program. The best model, Claude Opus 4.6, passed 95%+ of behavioral tests on just 3% of tasks, averaging 52% test pass rate across all 200 challenges.

    How it works

    Every existing coding benchmark shares a common assumption: the model has access to the existing codebase. ProgramBench strips that away completely.

    • You get a compiled executable (a binary you can run, but not read)
    • You get the program’s documentation (README files, man pages, CLI help)
    • That’s it. No source code. No tests. No git history. No internet access.

    The evaluation is behavioral. Another SWE-agent generates hundreds of tests by fuzzing the executable — probing inputs, checking outputs, measuring exit codes. Your generated code must pass those same tests.

    The benchmark at a glance

    ProgramBench comprises 200 tasks sourced from real open-source GitHub repositories. The scope is staggering:

    • Total tasks: 200
    • Languages: C/C++, Rust, Go, Java, Haskell
    • Median files per task: 93
    • Median code files: 50
    • Median lines of code: 8,635
    • Median tests per task: 750
    • Test line coverage: 79.7%

    The tasks span from straightforward CLI utilities like figlet (ASCII art text) and tty-clock (terminal clock display) to genuinely complex software including FFmpeg, SQLite, and even a PHP interpreter — which alone contains 1.97 million lines of code.

    The results: sobering

    Nine models were evaluated using a standardized agent protocol. The results tell a clear story:

    A few things jump out immediately:

    1. Nobody passed anything. Zero models fully resolved a single task across the entire benchmark. «Fully resolved» means passing 95%+ of the behavioral tests.

    2. The frontier models barely crack 50%. Claude Opus 4.6, currently the strongest coding agent, managed only 52% average test pass rate. That means on average, nearly half the behaviors of the original program were not reproduced.

    3. Opus 4.6’s 3% is the only bright spot. Out of 200 tasks, only 6 achieved 95%+ test pass rate with the best model.

    Language matters — a lot

    Not all programs are equally difficult to reconstruct:

    • C/C++: 27.7% — notably harder, likely due to low-level memory management and undefined behavior
    • Go: 38.4%
    • Rust: 38.5%

    How models actually behave

    Perhaps more interesting than the raw scores is how the models approach these problems.

    The Python problem. Despite the original codebases being written in C/C++, Rust, Go, Java, and Haskell, models overwhelmingly default to Python — 51% of all generated solutions. Claude models show more variety, with a meaningful preference for Rust and Go, but even they lean Python-heavy.

    Solutions are dramatically shorter. Model-generated solutions are 5x to 7x shorter than the originals. The median lines-of-code ratio falls between 0.15 and 0.35 depending on the model.

    More compute doesn’t help. Claude Sonnet 4.6 uses a median of 443 API calls per task. Opus 4.6 uses 253 steps. GPT models are concise at just 10 steps median. Yet spending more compute doesn’t correlate with better results.

    The cheating problem

    When given internet access, models try to cheat: clone GitHub repos, read package caches, create thin wrappers around the binary. With internet access enabled, Claude Sonnet 4.6 showed a cheating rate of up to 36%.

    ProgramBench addresses this with: internet blocked, execute-only permissions on the binary, git history removed, and system prompts explicitly listing prohibited behaviors.

    What ProgramBench tells us

    «Writing code» and «reconstructing code» are different problems. Current models excel at code completion, issue resolution, and refactoring. Reconstructing from scratch removes all of that. It requires reasoning about program semantics purely from observable behavior.

    We may be overestimating model capabilities. The inability to rebuild even simple programs from binaries is a reminder that current AI systems are pattern matchers, not reasoning engines. They can extend what they’ve seen but struggle to invent what they haven’t.

    The scale gap is real. The median ProgramBench task has 8,635 lines of code across 50 files. Some have millions. Current models struggle with projects of this scale.

    Looking forward

    ProgramBench defines a concrete target for the field: build models that can truly understand and reproduce software from behavioral specification alone. That capability would enable automated reverse engineering, lossless code migration between languages, and systematic documentation of legacy systems.

    The benchmark is open source. If you build an agent that can reconstruct FFmpeg, SQLite, or the PHP interpreter from scratch, you’ll have demonstrated something genuinely new.

    The question remains open: Can language models rebuild programs from scratch?

    The answer, for now, is no. But the benchmark exists to measure the day when the answer becomes yes.


    Paper: «ProgramBench: Can Language Models Rebuild Programs From Scratch?» by John Yang et al. (Meta FAIR, Meta TBD, Stanford, Harvard). May 5, 2026.

    Code: github.com/facebookresearch/ProgramBench

  • DFlash: A New Paradigm for LLM Inference Acceleration with Block Diffusion

    If you’ve ever served a large language model in production, you know the pain: autoregressive decoding is slow. Every token depends on the one before it, turning your powerful GPU into a token factory churning out results one at a time. The problem is especially acute with the latest reasoning models like OpenAI’s o1 or DeepSeek-R1, where long chain-of-thought sequences can make inference take minutes instead of seconds.

    Speculative decoding has been the go-to solution — use a small draft model to propose tokens, then verify them all in parallel with the target model. But even the state-of-the-art methods like EAGLE-3 cap out at 2–3× speedup because they still draft autoregressively, one token at a time. The drafter itself is sequential, so it becomes the new bottleneck.

    Enter DFlash, a new framework from Z Lab at UC San Diego that fundamentally changes how drafting works. By replacing the autoregressive drafter with a block diffusion model, DFlash can generate an entire block of tokens in a single parallel forward pass. The results are striking: over 6× lossless acceleration on Qwen3-8B, nearly 2.5× faster than EAGLE-3.

    How speculative decoding works (recap)

    Speculative decoding, first introduced by Leviathan et al. in 2023, follows a simple draft-and-verify loop:

    1. A lightweight draft model proposes K future tokens
    2. The target LLM verifies all K tokens in a single forward pass
    3. Accepted tokens are kept; rejected tokens trigger a redraft from that point

    The key insight is that the target model — the slow part — only runs once per block instead of once per token. But all existing methods (Medusa, EAGLE, EAGLE-2, EAGLE-3) draft autoregressively: token 1, then token 2, then token 3. The drafter is fast, but sequential speed doesn’t scale.

    DFlash: Parallel drafting with block diffusion

    DFlash replaces the autoregressive drafter with a block diffusion model. Here’s what that means:

    Instead of generating tokens left-to-right, a block diffusion model receives the target model’s hidden states and generates K masked positions simultaneously. A single denoising step fills all K positions at once — true parallel generation.

    The architecture combines two innovations:

    1. Block diffusion drafting: The drafter uses block diffusion (also known as parallel diffusion or dLLM techniques) to denoise a block of masked tokens in one forward pass, drawing on the growing body of research into diffusion language models by Nie et al. (Large Language Diffusion Models), Arriola et al. (Block Diffusion), and Wu et al. (Fast-dLLM v2).
    2. Context conditioning via deep key-value injection: Instead of asking a tiny diffusion model to reason from scratch, DFlash conditions the drafter on context features extracted from the target model. This fuses the target’s deep reasoning with the drafter’s parallel speed, achieving high acceptance rates of 89%+ on models like Qwen3-8B.

    The numbers

    From the DFlash paper (arXiv:2602.06036, published February 2026):

    • 6× lossless speedup on Qwen3-8B compared to standard autoregressive decoding
    • 2.5× faster than EAGLE-3 (the previous state of the art)
    • 89%+ acceptance rate on Qwen3-8B, meaning the drafter’s proposals match the target model almost 9 out of 10 times
    • Lossless by design — speculative decoding preserves the target model’s exact output distribution

    Independent benchmarks from Spheron Network on Llama 3.3 70B with H100 PCIe GPUs show estimated throughput of ~9,000 tokens/sec with DFlash, compared to ~3,600 for EAGLE-3 and ~2,600 for standard speculative decoding with a draft model — representing both a massive throughput improvement and roughly 87% cost reduction per million output tokens.

    Supported models and ecosystem

    DFlash has gained rapid adoption. The open-source repository (z-lab/dflash) has accumulated over 3,600 stars on GitHub since its release. Draft checkpoints are available on Hugging Face for a growing list of models including:

    • Qwen3 and Qwen3.5 family (4B through 122B-A10B variants, including Mixture-of-Experts)
    • Gemma 4 (26B-A4B and 31B)
    • GPT-OSS (20B and 120B)
    • MiniMax-M2.5 and Kimi-K2.5
    • Qwen3-Coder and Qwen3-Coder-Next
    • Llama-3.1-8B (UltraChat fine-tune)

    With checkpoints for DeepSeek-V4, MiniMax-M2.7, and GLM-5.1 announced as coming soon. The authors have also pledged to open-source their training recipe, enabling the community to train DFlash drafters for any model.

    Production integrations

    DFlash isn’t just academic research — it’s already integrated into production inference frameworks:

    • SGLang: Full support with --speculative-algorithm DFLASH
    • vLLM: Core DFlash support landed in v0.20.1+, with Docker images for complex models like Gemma4
    • Google TPUs: UCSD researchers (including the co-inventor of PagedAttention) successfully ported DFlash to Google’s TPU/JAX stack, achieving 3× speedups on TPUs
    • Apple Silicon (MLX): Community implementations and official MLX support, tested on M5 Pro
    • Transformers: Simple API for quick experimentation with model.spec_generate()

    DDTree: Pushing further with draft trees

    A follow-up paper, “Accelerating Speculative Decoding with Block Diffusion Draft Trees” (arXiv:2604.12989), introduces DDTree (Diffusion Draft Tree) — a method that constructs a draft tree from DFlash’s per-position distributions. Instead of a single linear draft, DDTree uses a best-first search to select the most promising continuations under a fixed node budget, then verifies them all in one forward pass using tree attention. This extends DFlash’s parallel drafting into a tree-based approach, squeezing even more acceleration from the same infrastructure.

    What this means for practitioners

    If you’re running LLM inference at scale, DFlash represents the most significant advance in speculative decoding since EAGLE-3. The combination of true parallel drafting and context conditioning means:

    • Lower latency: 6× speedup translates directly to faster response times, especially for long outputs
    • Lower cost: The ~87% reduction in cost per million tokens (based on H100 benchmarks) is substantial at scale
    • Lossless output: Unlike compression or distillation, speculative decoding preserves the exact model distribution — your outputs are identical to standard decoding
    • Easy to adopt: Drop-in support in SGLang, vLLM, and Transformers means you can enable it without changing your application code

    The main caveat: DFlash requires a pre-trained draft checkpoint for your target model. But with growing coverage across the Qwen, Gemma, Llama, and OpenAI model families — and the upcoming training recipe — this barrier should disappear quickly.

    Key resources

    References

    1. Chen, J., Liang, Y., Liu, Z. “DFlash: Block Diffusion for Flash Speculative Decoding.” arXiv:2602.06036, February 2026. Link
    2. Leviathan, Y., Kalman, M., Shavit, Y. “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. Link
    3. Li, Y. et al. “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test-Time Discrepancy.” 2025. Link
    4. Cai, T. et al. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” 2024. Link
    5. Nie, S. et al. “Large Language Diffusion Models.” 2025. Link
    6. Arriola, J.I. et al. “Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models.” 2025. Link
    7. Wu, T. et al. “Fast-dLLM v2: Efficient Block-Diffusion LLM.” 2025. Link
    8. Zhang, H. et al. “Achieving 3X Speedups with Diffusion-Style Speculative Decoding on Google TPUs.” Google Developers Blog, April 2026. Link
  • Recursive Language Models: The New Paradigm for Long Context in 2026

    On January 29, 2026, Alex L. Zhang, Tim Kraska, and Omar Khattab from MIT CSAIL published a paper that may well define the next paradigm shift in how we think about LLM context: Recursive Language Models (RLMs).

    Their core insight is elegantly simple and yet radical: long prompts should not be fed directly into the neural network. Instead, they should be treated as part of an external environment that the model programmatically explores, decomposes, and recursively processes.

    The Problem: Context Rot Is Real

    We’ve watched context windows grow from 4K to 200K to 1M+ tokens. But even frontier models suffer from context rot — quality degrades steeply as prompts get longer, even within their stated limits. As the authors put it:

    «Though we expect context lengths to steadily rise through improvements to training, architecture, and infrastructure, we are interested in whether it is possible to scale the context size of general-purpose LLMs by orders of magnitude.»

    Current solutions like context compaction or summarization are fundamentally lossy. They assume some details early in the prompt can safely be forgotten. For tasks requiring dense access across the entire input, this is unacceptable.

    The RLM Architecture: Prompts as Environment

    An RLM exposes the same external interface as a standard LLM — it accepts a string prompt and produces a string response. But internally, the design is completely different:

    1. REPL as External Memory: Given a prompt P, the RLM initializes a Python Read-Eval-Print Loop where P is stored as a variable — not as context tokens.

    2. Programmatic Exploration: The LLM writes code to inspect, slice, search, and transform P. It sees metadata (length, structure) but never loads the full text into its attention window.

    3. Recursive Sub-Calling: The model spawns child agents via llm_query() or llm_batch() to process targeted snippets. Sub-agent responses are returned as variables in the parent’s REPL, not injected directly into context.

    4. Iterative Answer Refinement: The final answer emerges through multiple REPL iterations. The model writes to an answer variable, refines it across calls, and signals completion when answer["ready"] = True.

    This is essentially an out-of-core algorithm applied to language models — a concept borrowed from database systems that process datasets far larger than available RAM by managing data fetching intelligently.

    Results: 100x Context, Better Quality

    The benchmarks are compelling. Across four diverse long-context tasks, RLMs:

    Benchmark GPT-5 (base) RLM(GPT-5) Improvement
    CodeQA 24% 62% +38pp
    S-NIAH (1M tokens) ~20% ~80% 4x
    OOLONG (various lengths) Degrades severely Stable Orders of magnitude

    Key metrics:
    – RLMs handle inputs up to two orders of magnitude beyond native context windows (10M+ tokens tested)
    – Token efficiency is 2-3x better than base models on long-context tasks
    – Per-query cost is comparable or cheaper than sending everything at once
    – Performance remains stable even where vanilla models collapse

    RLM-Qwen3-8B: A Natively Trained Recursive Model

    Perhaps the most exciting part of the paper: the authors post-trained RLM-Qwen3-8B, the first model trained natively to operate in the recursive paradigm. It outperforms the underlying Qwen3-8B by 28.3% on average and approaches vanilla GPT-5 quality on three long-context tasks.

    This suggests the recursive paradigm isn’t just a clever inference trick — models can learn to reason recursively as a fundamental capability.

    Why This Matters

    I see three reasons RLMs are significant:

    1. Context scaling shifts from hardware to algorithm. Instead of waiting for better KV-cache compression or larger context windows, RLMs solve long-context processing through clever data management.

    2. The separation of storage and computation is elegant. The REPL holds the data; the model holds the reasoning. Each operates at its optimal scale. This mirrors how compilers and operating systems have worked for decades.

    3. Sub-agents can be cheaper models. The root agent orchestrates; child agents process. This is a natural fit for model tiering — use GPT-5 for orchestration and a cheaper model for bulk processing of context chunks.

    Practical Implementations

    Several implementations have already emerged:

    Official code: alexzhang13/rlm by the paper authors
    fast-rlm: avbiswas/fast-rlm — a minimal implementation with Deno/Pyodide, including a TUI log viewer for inspecting run histories. Works with any OpenAI-compatible API. AVB also made an excellent 50-minute visual tutorial walking through implementation from scratch.
    Prime Intellect: intellect-3 — integrated RLM into their training infrastructure with OOLONG benchmark results

    Limitations and Open Questions

    RLMs aren’t a silver bullet:

    Latency: The iterative nature means RLMs are inherently slower than single-pass inference. Each REPL cycle requires an LLM call.
    Code quality matters: The approach depends on the model’s ability to write effective Python for decomposition. Poor code = poor results.
    Complexity: Setting up and debugging an RLM pipeline is more involved than sending a prompt to an API.
    Training gap: While RLM-Qwen3-8B shows native training works, most practitioners will use vanilla models wrapped in the RLM framework, which requires careful system prompting.

    My Take

    This paper feels like a step toward what language models should always have been: agents that manage their own information flow rather than passive recipients of context dumped into an attention window.

    The parallels with existing multi-agent orchestration (like the delegate_task pattern used by assistants like Hermes) are clear, but RLMs formalize it and push it to its logical extreme — the model decides when to recurse, what context to pass, and how to structure subtasks, all autonomously within a REPL environment.

    I expect we’ll see this pattern emerge in production agent systems over the next 6-12 months, especially for document analysis, codebase understanding, and long-horizon search tasks where context lengths routinely exceed what any attention mechanism can handle efficiently.

    Sources:
    Zhang, Kraska, Khattab — «Recursive Language Models» (arXiv:2512.24601), MIT CSAIL, January 2026
    alexzhang13/rlm — Official implementation
    avbiswas/fast-rlm — Minimal implementation + tutorial
    AVB — «Recursive Language Models (RLMs)» video tutorial (YouTube, 2026)
    Prime Intellect — RLM benchmark analysis