Blog

Why Latent Prediction Beats Token Prediction: New Theory Explains Why JEPA and data2vec Need Less Data

Large language models are trained on trillions of tokens – more data than a human will ever encounter in a lifetime. Yet children learn language with orders of magnitude less exposure. Why is there such a massive gap between biological and artificial learning?

A new paper from EPFL, Cambridge, and Johns Hopkins provides a rigorous mathematical answer: it’s not the algorithm that’s wrong â it’s the target. When you train models to predict raw tokens, you force useful signal to travel down a hierarchical tree, getting diluted. When you train them to predict their own latent representations, the signal stays concentrated.

The paper

«Large language models and not from tokens: A sample-complexity theory» (arXiv:2605.27734, May 2026) by Daniel J. Korchinski (EPFL), Alessandro Favero (Cambridge DAMTP), and Matthieu Wyart (EPFL / Johns Hopkins) proves that latent prediction achieves exponential improvements in sample efficiency over token-level self-supervised learning.

31 mayo, 2026
From Prompts to Harnesses: The Evolution of LLM Engineering
The early days: Prompt engineering as the only lever

When large language models first became widely accessible in 2022-2023, there was essentially one knob to turn: the prompt. Every technique — few-shot examples, chain-of-thought, ReAct, self-consistency — lived inside the text you sent to the model.

This made sense architecturally. The models were stateless inference APIs. You sent text, you got text back. The entire system design reduced to engineering the right input string.

But it was always a fragile approach. Prompt engineering treated the model as a black box and tried to coax the desired behavior through natural language alone. It worked well for simple tasks but broke down quickly when you needed:

– Multi-step workflows with intermediate state
– Tool calling and external data retrieval
– Error recovery and retry logic
– Memory across sessions
– Coordination between multiple models

Prompt engineering could not solve these problems because they were system problems, not model problems.

Phase 1: Tool use and function calling (2023-2024)

The first real step beyond prompts came when models gained the ability to call tools — APIs, search engines, code interpreters. OpenAI introduced function calling in GPT-4 (March 2023), and soon all major models followed.

Tool use changed the game: the model could now act on the world, not just generate text. But it introduced a new complexity. Who managed the loop between the model’s tool call, the external API response, and the next model invocation?

Enter agent frameworks: LangChain, LlamaIndex, CrewAI, AutoGen. These libraries provided the scaffolding — execution loops, tool registries, memory stores, chain orchestration. The idea was that the framework would handle the plumbing while you focused on the model’s «reasoning.»

Phase 2: Frameworks became the problem (2024-2025)

By late 2024, a consensus emerged: frameworks helped, but they were not the answer. The problem was that each framework made different assumptions about:

– How tools should be defined and governed
– How memory should be structured and retrieved
– How errors should propagate and be recovered from
– How costs should be tracked and controlled
– How execution should be observed and debugged

Two models running the same «agent» in different frameworks could produce wildly different results. The bottleneck was not the model or the framework — it was the system layer between them.

Researchers and practitioners started calling this layer the harness.

Phase 3: Harness engineering (2025-2026)

The term «harness» refers to the software layer that surrounds an LLM with everything it needs to function as an autonomous agent: memory, tools, triggers, instructions, output channels, sandboxes, validators, permission boundaries, execution loops, and feedback mechanisms.

As Zylon.ai put it in 2026: «A framework that surrounds an LLM with memory, tools, triggers, instructions, and output channels so the model can act autonomously.»

Harness engineering has become recognized as its own discipline. The key insight, articulated across multiple recent papers, is that the harness matters more than the model for agent reliability at scale.

The theoretical foundation: Externalization

A pivotal survey from April 2026, «Externalization in LLM Agents» (arXiv:2604.08224), reframed the entire field through the lens of cognitive externalization — the idea that reliable agency comes not from ever-larger models, but from systematically relocating cognitive burdens from the model’s internal computation into persistent, inspectable, and reusable external structures.

Under this view:

– Memory externalizes state across time
– Skills externalize procedural expertise
– Protocols externalize interaction structure
– The harness serves as the unification layer that coordinates everything into governed execution

This is not just engineering convenience. It is representational transformation: restructuring the problem so the model can solve it more reliably with the capabilities it already has.

Code as the harness substrate

Another major survey, «Code as Agent Harness» (arXiv:2605.18747, May 2026) from University of Illinois, Meta, and Stanford, goes further. It argues that code — not natural language — is the right operational substrate for agent infrastructure. Code is executable, inspectable, and stateful. Programs are increasingly used as the medium through which agents reason, act, and model their environments.

The survey organizes harness research into three layers:

Harness interface — where code connects agents to reasoning, action, and environment modeling

Harness mechanisms — planning, memory, and tool use for long-horizon execution, plus feedback-driven control and optimization

Scaling the harness — from single-agent to multi-agent settings, where shared code artifacts support coordination, review, and verification

Self-optimizing harnesses

The most exciting recent work pushes beyond hand-written harnesses. Meta-Harness (Stanford/MIT, arXiv:2603.28052) is the first system to automate harness engineering end-to-end. It gives a coding agent unrestricted filesystem access to all prior search history — source code, execution traces, and evaluation scores — and lets it propose improved harnesses iteratively.

On TerminalBench-2, Meta-Harness achieved 37.6%, beating hand-written harnesses like Claude Code (27.5%) and Goose (35.5%). Crucially, the model itself never changed — only the harness around it improved.

Similarly, AutoHarness (Georgia Tech, arXiv:2603.03329) demonstrated that Gemini-2.5-Flash can automatically synthesize code harnesses for LLM agents in just a few rounds of iteration, handling context management, tool governance, cost control, observability, and session persistence.

What is a harness in practice?

A modern agent harness typically includes:

– Tool registry and governance — defining available tools, their permissions, and validation rules for inputs and outputs
– Memory architecture — short-term working memory, long-term persistent storage, and retrieval mechanisms
– Execution loop — the control flow that manages model invocations, tool calls, error handling, and retries
– Sandboxing — isolated environments for code execution and external API calls
– Observability — logging, tracing, and monitoring of agent behavior
– Feedback channels — mechanisms for the agent to receive evaluation signals and adapt
– Cost controls — token budgets, rate limits, and fallback strategies
– Multi-agent coordination — communication protocols when multiple agents work together

Repositories to explore and test

Here are the most interesting open-source projects in the harness engineering space:

– Meta-Harness (Stanford) — Self-optimizing harness that iteratively improves agent performance through filesystem-based search. https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact
– AutoHarness (Georgia Tech) — Automated harness engineering framework with tool governance pipelines and multi-agent support. https://github.com/aiming-lab/AutoHarness
– Code as Agent Harness Awesome List — Curated collection of papers and resources organized by the «code as harness» thesis. https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers
– Agent Harness Survey — Companion repository to the comprehensive survey covering 110+ papers on harness engineering. https://github.com/RUCAIBox/awesome-agent-harness
– AgentFlow — Multi-agent harness synthesis for vulnerability discovery. Demonstrates feedback-driven harness optimization. https://arxiv.org/html/2604.20801v1

Where this is heading

The trajectory is clear. We are moving from:

Prompts -> Tool definitions -> Frameworks -> Harnesses -> Self-optimizing harnesses

Each step pushes more intelligence from the model into the system layer. The model becomes one component in a larger architecture that handles reliability, observability, safety, and efficiency.

This has practical implications. If you are building production agent systems today, the ROI on investing in harness engineering — good tool governance, robust error handling, proper observability, and automated optimization — far exceeds the ROI of simply upgrading to a larger or more capable model.

The papers from the last few months make this explicit. When Stanford’s Meta-Harness showed that optimizing the harness around a fixed model beats switching to a stronger model, it was a signal that the field has matured past the «bigger model solves everything» phase.

Agent systems are now software engineering problems, not just model selection problems. And that is exactly where they need to be.

Sources:
- Externalization in LLM Agents: A Unified Review — Shanghai Jiao Tong / CMU, April 2026
- Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems — U. Illinois / Meta / Stanford, May 2026
- Meta-Harness: End-to-End Optimization of Model Harnesses — Stanford / MIT, March 2026
- AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness — Georgia Tech, March 2026
- Agent Harness for Large Language Model Agents: A Survey — April 2026 (110+ papers reviewed)
- Synthesizing Multi-Agent Harnesses for Vulnerability Discovery — April 2026
- Harnessing LLM Agents with Skill Programs (HASP) — May 2026
- LLM Readiness Harness — March 2026
- Harnessing Embodied Agents — April 2026
30 mayo, 2026
The Recursive Revolution: From RLM to Multi-Agent Loops
The Recursive Revolution: From Single Models to Multi-Agent Loops

Recursive Language Models (RLMs) emerged in December 2025 as a new scaling axis for AI. Since then, the field has evolved rapidly — from single-model recursion to entire multi-agent systems operating as unified recursive computations. Here’s where we stand in mid-2026.

Where It Started: RLM (MIT, Dec 2025)

The original RLM paper by Zhang, Kraska, and Khattab reframed long-context reasoning as an inference-time scaling problem. Instead of stuffing everything into a context window, the model delegates work to a persistent Python REPL and spawnable sub-LLM instances.

Key results: GPT-5 with RLM achieved +26% over compaction, +130% over CodeAct, and +13% over Claude Code. A fine-tuned RLM-Qwen3-8B rivalled GPT-5 on 3 out of 4 tasks.

The insight was deceptively simple: treat the prompt as an environment the model can inspect, slice, and recursively query — collapsing reasoning and tool use into a single inference abstraction.

The Next Step: Ouro — LoopLMs (ByteDance, Oct 2025)

ByteDance’s «Scaling Latent Reasoning via Looped Language Models» (Zhu et al., with Yoshua Bengio) took a different approach: instead of applying recursion only at inference time, they built reasoning into pre-training itself.

Key innovations:
- Iterative computation in latent space during pre-training, not just inference
- Entropy-regularized objective for learned depth allocation — the model decides how many iterations it needs
- 7.7 trillion tokens of training data
- Ouro-2.6B — open-weight 2.6B parameter model available on HuggingFace
The name «Ouro» comes from the Ouroboros — the serpent eating its own tail. The model loops over its own latent representations, refining reasoning through shared-weight recurrent computation rather than stacking more layers.

Why this matters: RLM proved recursion works as a fine-tuning strategy. Ouro proves that if you train a model to be recursive from scratch, it learns to organize computation in stages that closely mirror feedforward models — but with far greater parameter efficiency.

Mechanistic Understanding: Why Does It Work? (Apr 2026)

Two papers from April 2026 provide the theoretical foundation:

«A Mechanistic Analysis of Looped Reasoning Language Models» discovered that recurrent blocks in looped models learn stages of inference that repeat cyclically in the latent space. Each layer converges to a distinct fixed point — the model isn’t just looping randomly, it’s following a consistent computational trajectory.

«From Growing to Looping» (Google DeepMind, OpenReview 2026) by Kapl, von Oswald, and Bauer provides a unified theoretical framework connecting depth growth with looping. The conclusion: iterative computation can scale reasoning without adding parameters, establishing a formal equivalence between model depth and recursion depth.

The Latest Frontier: RecursiveMAS (Stanford, Apr 2026)

Now we arrive at the paper you came here for. «Recursive Multi-Agent Systems» by Yang, Zou et al. (Stanford + UIUC) asks: if recursion works for a single model, does it work for a system of agents?

RecursiveMAS treats the entire multi-agent system as a unified latent-space recursive computation. Each agent acts like an RLM layer, iteratively passing latent representations to the next, forming a looped interaction process.

Architecture Highlights
- RecursiveLink module — lightweight component enabling in-distribution latent thoughts generation and cross-agent latent state transfer (not text-based message passing)
- Inner-outer loop learning — iterative whole-system co-optimization with gradient-based credit assignment across recursion rounds
- 4 collaboration patterns — sequential pipeline, parallel specialized agents, and hybrids
Results (9 benchmarks: math, science, medicine, search, code)
- +8.3% accuracy vs advanced single/multi-agent baselines
- 1.2x–2.4x inference speedup vs text-based MAS
- 34.6%–75.6% token reduction — massive efficiency gain
The token reduction alone is the killer feature. Multi-agent systems have always been expensive because agents exchange text — every message is tokens generated, transmitted, and consumed. RecursiveMAS eliminates most of that overhead by exchanging latent states instead.

The Bigger Picture: Where Is This Heading?

Three clear directions are emerging:

1. Pre-trained recursive models become the default. Ouro shows that recursion can be baked into pre-training. Expect future foundation models to include recursive computation as a native capability, not an afterthought.

2. Multi-agent recursion for production workloads. RecursiveMAS is academic, but the efficiency gains (75% token reduction) make it immediately relevant for production multi-agent pipelines. Companies like Prime Intellect are already building RLM training infrastructure.

3. Recursive self-improvement loops. The ICLR 2026 workshop on «AI with Recursive Self-Improvement» signals that agents rewriting their own code and prompts recursively is no longer theoretical — Claude Code and Codex already do this ad-hoc. The next step is systematic, gradient-based self-improvement inspired by RecursiveMAS’s inner-outer loop design.

Key Papers
- RLM — Zhang, Kraska, Khattab (MIT). arXiv:2512.24601
- Ouro / LoopLM — Zhu, Wang, Zhang… + Y. Bengio (ByteDance). arXiv:2510.25741
- Mechanistic Analysis — arXiv:2604.11791
- RecursiveMAS — Yang, Zou et al. (Stanford). arXiv:2604.25917
- From Growing to Looping — Kapl, von Oswald, Bauer (Google DeepMind). OpenReview
The recursive paradigm is no longer a niche research direction. From single-model inference scaling to multi-agent systems to pre-training architectures, recursion is becoming the unifying principle for how AI systems handle complexity beyond what any single forward pass can manage.
13 mayo, 2026
Recursive Language Models: The New Paradigm for Inference Scaling

CoT → ReAct → CodeAct → RLM. The evolution of how language models think doesn’t stop at bigger context windows. A team at MIT CSAIL has introduced a radically different way for LLMs to reason: make the model recursively call itself through a programmable REPL environment.

The Problem: «Context Rot» Is Real

If you’ve ever worked with Claude Code, ChatGPT, or any agent framework on a sufficiently large codebase, you’ve experienced it: as the conversation grows, the model gets… dumber.

Anthropic calls it context rot: when the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases. In real usage — long coding sessions, multi-turn research, massive document analysis — performance degrades in ways that are hard to benchmark and impossible to ignore.

The natural intuition is: «What if I split the context into two model calls, then combine them in a third?»

That’s exactly what RLMs do. But instead of a fixed splitting strategy, the model itself decides how to decompose its input, writes code to do it, and recursively processes the pieces.

What Are Recursive Language Models?

Recursive Language Models (RLMs) are an inference strategy proposed by Alex Zhang, Omar Khattab, and Tim Kraska at MIT CSAIL (October 2025, updated January 2026).

The core idea: a language model can spawn recursive calls to itself through a REPL environment, where the input context exists as a symbolic variable — not as text in the model’s prompt.

Their results: an RLM using GPT-5-mini outperforms GPT-5 by more than double the number of correct answers on the OOLONG long-context benchmark, and costs less per query. RLMs also don’t degrade when given 10M+ tokens at inference time.

The Evolution: From CoT to RLM

Chain-of-Thought (CoT, 2022): The model generates intermediate reasoning steps before answering. No tool use. No external interaction.

ReAct (Yao et al., 2023): The pattern Thought → Action → Observation. The model reasons with chain-of-thought, calls tools, observes results, and repeats.

CodeAct (Shi et al., 2023): Same as ReAct, but the action is executing code instead of calling predefined tools. More powerful because code allows arbitrary composition.

RLM (Zhang, Khattab & Kraska, 2025): The model calls itself recursively. The root LM has a Python REPL environment where the context exists as a variable. It can launch recursive sub-LM calls with partitions of the context, each with their own REPL.

The authors see RLMs as the next milestone after CoT and ReAct.

Deep Dive: How RLM Works

The Context Is a Variable, Not a Prompt

This is the crucial conceptual difference. In a normal LLM call, the context is text in the prompt. The model sees it all at once and attends to it.

In an RLM, the context is loaded as a Python variable in a REPL environment. The model never sees the actual content unless it explicitly writes code to read a portion of it.

The user prompt that initializes the system tells the model ONLY metadata:

Your context is a str with 5,000,000 total characters, broken into chunks of lengths: [5000000].

That’s it. The model knows something exists, knows its size, but does not see its content. To understand the context, the model must write code.

The System Prompt

The RLM system prompt is an extensive instruction manual telling the model what tools it has:

Available tools:
– context — The input data as a variable
– llm_query(prompt, model) — A single LLM call (fast, ~500K chars)
– llm_query_batched(prompts) — Multiple calls in parallel
– rlm_query(prompt, model) — Spawns a child RLM with its own REPL
– rlm_query_batched(prompts) — Multiple child RLMs in parallel
– SHOW_VARS() — Lists all variables in the REPL
– print() — View REPL output

Guidance on when to use what:
– Use llm_query for: simple extraction, summarization, classification, Q&A over a chunk
– Use rlm_query for: multi-step reasoning, problems requiring iteration, deep subtasks

The prompt includes detailed examples showing how to search for needles in haystacks, read books section by section, process chunks in parallel, and use recursive calls for reasoning-heavy subtasks.

The Main Loop

The RLM framework runs a tight loop (default: 30 iterations max):

1. Send to model: system prompt + metadata + iteration history + user prompt
2. Model responds with text + code blocks wrapped in triple backticks with repl tag
3. Execute the Python code in the REPL. Results are stored as variables
4. Capture REPL output (stdout, stderr, new variables)
5. Check if model wrote FINAL(answer) or FINAL_VAR(var name). If yes: terminate and return. If no: append to history and repeat

Safety mechanisms prevent infinite loops: max_iterations (30), max_budget (USD), max_timeout (seconds), max_errors (consecutive failures), and max_depth (recursion depth).

How the Model Decides

The model doesn’t follow a fixed algorithm. It writes Python code that implements its strategy.

For a needle-in-a-haystack problem, it might grep for keywords — finding the answer without any sub-LLM calls at all.

For summarizing a 10M-token document, it divides by headings, sends each section to llm_query, then aggregates the results.

The model chooses the strategy based on the query and the context metadata. It’s the model’s intelligence, not a predefined pipeline.

Recursion — Child RLMs

When the model calls rlm_query, the framework spawns a child RLM with its own REPL environment, message history, and iteration loop. The child processes the prompt, can write and execute code, iterate, and returns its result back to the parent as a Python variable.

Recursion depth is controlled by max_depth: at depth 1, only the root has a REPL. At depth 2, children also get their own REPL and can spawn grandchildren. The framework propagates remaining budget and timeout to children.

RLM vs. CodeAct vs. ReAct

All three involve a model writing and executing code. So what makes RLM different?

Co-author Omar Khattab explicitly drew the line:

A standard coding agent:
– Receives a prompt + external data (files, APIs, databases)
– Reads external data via file I/O, HTTP calls
– Writes code to transform or analyze that data
– Results go into the conversation context

An RLM:
– The user’s own prompt/context is a symbolic object in the environment
– The model cannot read long snippets from the context directly — it must write code to access portions
– Recursion happens during code execution — rlm_query spawns a child RLM whose output returns as a variable
– All sub-calls return values into symbolic variables — not text appended to the context

The difference is philosophical: in an agent, the context is external data you read from. In an RLM, the context is the thing the model needs to understand — and it can only understand it by writing code that programmatically explores it.

Where This Fits

RLMs represent a shift in how we think about LLM inference:

Context is code, not text. Instead of cramming more tokens into larger context windows, treat the context as a programmable resource.

Recursion as a primitive. The model calling itself to solve sub-problems is a natural extension of decomposition-based reasoning.

Cost-effective. Using a smaller model for sub-calls while reserving the smart model for strategic decisions produces better results at lower cost.

The entire agentic AI stack already uses iterative code-execution loops. The key innovation is treating the user’s input context as the programmable object, not just external data.

When a model can write code to explore its own input, the notion of context length becomes something entirely different.

Sources:
– Alex Zhang, Omar Khattab, Tim Kraska — Recursive Language Models (arXiv:2512.24601v1): https://arxiv.org/abs/2512.24601
– RLM Blog Post: https://alexzhang13.github.io/blog/2025/rlm/
– RLM Source Code: https://github.com/alexzhang13/rlm
– RLM Minimal: https://github.com/alexzhang13/rlm-minimal
– RISE Data Labs: https://risedatalabs.com/blog/recursive-language-models
– Yao et al. — ReAct (2023): https://react-lm.github.io/
– Shi et al. — CodeAct: https://www.emergentmind.com/topics/codeact-framework
– Navendu Pottekkat — Notes on RLMs: https://navendu.me/posts/recursive-language-models/
– Omar Khattab — X clarification on RLM vs. coding agents: https://x.com/lateinteraction/status/2020215204945252429

12 mayo, 2026

The Sacred Tears of Shiva: A Complete Guide to Rudraksha Beads

Rudraksha — from the Sanskrit «Rudra» (Lord Shiva) and «Aksha» (tears) — are among the most sacred seeds in Hindu and Buddhist tradition. Believed to be born from the tears of a god, each bead carries a different divine frequency. Here’s everything you need to know about these ancient spiritual tools, from mythology to market prices.

Serene God and Peace

What Are Rudraksha?

Rudraksha are the seeds of the Elaeocarpus ganitrus tree, a tropical evergreen found primarily in Nepal (Arun Valley), Indonesia (Java and Sumatra), India, and parts of Southeast Asia. These trees grow at altitudes between 300 and 1,500 meters, producing blue outer fruits that enclose the wrinkled, brown seeds we know as Rudraksha beads.

Each seed naturally forms vertical grooves on its surface that converge at the top and bottom poles. The number of these grooves — called «mukhi» (faces) — determines the bead’s spiritual properties, associated deity, ruling planet, and even its market price.

Rudraksha have been used for thousands of years as prayer beads (mala), meditation aids, astrological remedies, and protective amulets. They appear in foundational Hindu scriptures including the Shiva Purana, Mahabharata, Rudraksha Jabala Upanishad, and multiple Tantras. In Buddhism, they’re equally revered, particularly in Tibetan and Nepalese Buddhist traditions.

Scientifically, Rudraksha seeds contain alkaloids, minerals, and trace elements (calcium, iron, magnesium, zinc, silicon) that have been studied for potential cardiovascular benefits — traditional Ayurvedic medicine uses them to regulate blood pressure and heart rate.

Origin Stories: Three Myths, One Seed

The Story of Shiva’s Penance (Most Popular)

According to the Shiva Purana, Lord Shiva — the great ascetic — meditated for thousands of years on Mount Kailash, completely withdrawn from the world, eyes closed in deep trance. When he finally opened them, he saw the immense suffering of humanity and was overwhelmed with compassion. Tears streamed down his face, and wherever they touched the earth, Rudraksha trees sprang up. The beads were thus born as Shiva’s gift of mercy to suffering mankind.

The Story of Tripurasur (The Demon of Three Cities)

A second version in the Shiva Purana tells of Tripurasur, a demon who had meditated for decades to please Lord Brahma and received a boon granting him three invincible cities. Tripurasur became consumed with pride and terrorized both gods and humans. Shiva, seeing the suffering, entered meditation to forge the deadly Aghor weapon to destroy the demon. When he emerged from meditation and opened his eyes, tears of compassion fell — from which Rudraksha trees grew, meant to protect humanity long after the demon was vanquished.

The Story of Shiva’s Sweat

A third variant says that during Shiva’s meditation on Mount Kailash, his intense spiritual energy caused drops of sweat to fall from his forehead. When these drops touched the ground, they transformed into seeds that grew into Rudraksha trees. This version emphasizes the beads’ connection to Shiva’s raw spiritual power rather than his compassion.

Adi Shankaracharya and the Discovery of Rudraksha

According to tradition, the great philosopher Adi Shankaracharya (8th century CE) was meditating near the Mandakini River when he saw a Rudraksha bead washed ashore. Recognizing its sacred nature, he began wearing Rudraksha malas and promoted their use among all castes and classes. The Rudraksha Jabala Upanishad — attributed to this era — explicitly states that Rudraksha should be worn by anyone regardless of social status: Brahmin, Kshatriya, Vaishya, or Shudra. This was radical at the time, when many spiritual practices were restricted by caste.

The 13 Faces: Deity, Planet, and Meaning

Each mukhi (face) is associated with a specific deity, ruling planet, sacred mantra, and set of benefits. Here’s a complete guide from 1 to 13 mukhi:

1 Mukhi — Lord Shiva ☀️ Sun

The most sacred and rare of all Rudraksha. The 1 Mukhi embodies the ultimate truth, unity, and pure consciousness. It’s considered the jewel among jewels — a single authentic round 1 Mukhi can cost up to $6,000.

Benefits: Enlightenment, moksha (liberation), enhanced concentration, detachment from worldly attachments, self-realization.

Mantra: Om Hreem Namaha

Note: A perfectly round 1 Mukhi with a single internal seed doesn’t exist in nature. Nepal produces oval/lentil-shaped 1 Mukhi that are considered authentic. Some «round» 1 Mukhi beads sold today are underdeveloped 4-5 mukhi beads with naturally formed single faces.

2 Mukhi — Shiva + Parvati 🌙 Moon

The Unity Bead. Represents the union of Shiva and Goddess Parvati (Ardhanarishvara), symbolizing duality in perfect harmony.

Benefits: Harmonious relationships — marriage, partnerships, family bonds. Emotional balance, love, compassion, conflict resolution.

Mantra: Om Namaha

3 Mukhi — Agni (Fire God) ♂ Mars

The Purification Bead. Associated with Agni, the divine fire that burns away impurities.

Benefits: Destroys past-life karma and sins. Eliminates inferiority complexes, fear, self-hatred, and mental stress. Boosts energy and eliminates laziness. Spiritual rebirth.

Mantra: Om Kleem Namaha

4 Mukhi — Lord Brahma ☿ Mercury

The Knowledge Bead. Brahma is the Creator of the universe and the bestower of knowledge and creativity.

Benefits: Enhanced memory, concentration, learning ability, and eloquence. Particularly beneficial for students and scholars. Improves speech and communication.

Mantra: Om Hreem Namaha

5 Mukhi — Kalagni Rudra (Shiva) ♃ Jupiter

The most common Rudraksha — over 90% of all beads are 5 Mukhi. Associated with Kalagni Rudra, a fierce form of Shiva. Known as the «Dev Guru Rudraksha» (Teacher of the Gods) because Jupiter is the guru of all deities.

Benefits: Destroys bad karma of the present life. Brings mental peace, health, protection from accidental death. Grants fame and renown. Essential for any meditation or sadhana practice.

Mantra: Om Hreem Namaha

6 Mukhi — Lord Kartikeya ♀ Venus

The Willpower Bead. Kartikeya is Shiva’s son and the commander of the celestial army.

Benefits: Courage, wisdom, willpower, and expressive power. Ideal for leaders, speakers, and performers. Also blessed by Parvati, Lakshmi, and Saraswati.

Mantra: Om Hreem Hum Namaha

7 Mukhi — Gauri (Lakshmi/Parvati) 🌙 Moon

The Charisma Bead. Gauri is the goddess of magnetic personality and abundance.

Benefits: Personal magnetism, love, prosperity, stress relief. Attracts positive energies and good fortune.

Mantra: Om Hum Namaha

8 Mukhi — Lord Kubera ♃ Jupiter / ♀ Venus

The Wealth Bead. Kubera is the god of wealth and treasure.

Benefits: Material prosperity, wisdom, removal of financial obstacles, power, and authority.

Mantra: Om Hum Namaha

9 Mukhi — Goddess Durga ♂ Mars

The Protection Bead. Durga is the warrior goddess, divine shield against all harm.

Benefits: Spiritual power, courage, protection against enemies and curses. Neutralizes negative astrological effects of Mars.

Mantra: Om Hreem Hum Namaha

10 Mukhi — Lord Vishnu ♄ Saturn

The Preservation Bead. Vishnu is the Preserver of the universe, associated with his ten incarnations (Dashavatara).

Benefits: Physical and mental health, leadership qualities, balance, relief from Saturn-related afflictions.

Mantra: Om Namaha

11 Mukhi — Lord Hanuman ⚡ No fixed planet

The Courage Bead. Hanuman is the god of bravery, strength, and adventure. This is the 11th of the 11 Rudras (forms of Shiva).

Benefits: Physical and mental courage, strength, confidence, elimination of cowardice. Spiritual protection.

Mantra: Om Hreem Hoom Namaha

12 Mukhi — Lord Surya (Sun) ☀️ Sun

The Leadership Bead. Surya, the sun god, creates a powerful aura around the wearer.

Benefits: Charisma, leadership, creativity, mental clarity, confidence. Also associated with relief from heart problems.

Mantra: Om Hreem Namaha

13 Mukhi — Indra + Kamadeva ♀ Venus / 🌙 Moon

The Love and Emotion Bead. Represents Indra (king of the gods) and Kamadeva (god of love and desire).

Benefits: Emotional control, love, attraction, magnetism. Mitigates negative effects of Venus and Moon in astrological charts. Considered rare and powerful.

Mantra: Om Hreem Namaha

Price Guide: How Much Do Rudraksha Cost?

Rudraksha prices vary dramatically based on mukhi count, origin, size, and quality. Nepal Arun Valley beads command 2-5x the price of Indonesian ones due to larger size (15-25mm vs. 8-15mm), deeper mukhi lines, and traditional preference.

Mukhi	Price Range (USD)	Rarity
1 Mukhi	$30 – $6,000+	Extremely rare
2 Mukhi	$10 – $65	Very rare
3 Mukhi	$8 – $40	Rare
4 Mukhi	$6 – $30	Rare
5 Mukhi	$1 – $20	Very common
6 Mukhi	$6 – $30	Common
7 Mukhi	$9 – $45	Uncommon
8 Mukhi	$25 – $100	Uncommon
9 Mukhi	$40 – $150	Rare
10 Mukhi	$50 – $200	Rare
11 Mukhi	$60 – $250	Rare
12 Mukhi	$75 – $300	Very rare
13 Mukhi	$100 – $450	Very rare

Key factors that affect price:

– Size: Larger beads (25mm+) can be 3-5x more expensive than standard sizes
– Origin: Nepal > Indonesia for price and traditional preference
– Shape: Round/oval beats irregular
– Clarity: Deep, well-defined mukhi lines add value

– Certification: Lab-certified beads (X-ray tested for internal seeds) cost more

Extreme cases: A Siddha Mala (one bead each of 1-14 mukhi combined) can cost $1,000–$15,000 depending on origin and quality. The legendary Brahma Mala (21 beads of 1 mukhi each) has been known to fetch $20,000+ at auctions.

Nepal vs. Indonesia: Which Should You Choose?

Feature	Nepal (Arun Valley)	Indonesia (Java/Sumatra)
Size	15-25mm (larger)	8-15mm (smaller)
Mukhi lines	Deep, clearly defined	Shallower, less distinct
Texture	Rough, thorny	Smoother
Production	Limited, seasonal	Abundant
Best for	Single beads, wearing, astrology	Malas (108 beads), meditation
Price	Premium (2-5x higher)	More affordable
Fake risk	Lower	Higher (mass-produced fakes exist)

How to Spot Fake Rudraksha

The market is flooded with counterfeit beads. Here’s how to identify real ones:

1. The water test: Real Rudraksha sinks immediately in water. Fakes (often made of resin or carved stone) float.
2. The nail test: Press a nail into the surface — real Rudraksha feels soft and slightly fibrous, like dried wood.
3. The mukhi lines: Genuine lines run continuously from top to bottom pole. Fakes often have lines that stop or merge.
4. The sound: Shake a mala — real Rudraksha make a soft, dull sound. Hard counterfeits clink like stone.

5. X-ray certification: For expensive beads (1, 2, 13+ mukhi), request lab certification that shows the internal seed structure matches the external mukhi count.

Final Thoughts

Rudraksha are more than pretty prayer beads — they’re one of the oldest continuously used spiritual tools in human history, with a documented presence in Hindu texts for over 3,000 years. Whether you’re drawn to them for meditation, astrological remedies, protection, or simply their organic beauty, there’s a mukhi for almost every intention.

The 5 Mukhi remains the most practical entry point: affordable, abundant, and powerful. But if you’re searching for something specific — Hanuman’s courage (11 Mukhi), Brahma’s knowledge (4 Mukhi), or Shiva’s ultimate realization (1 Mukhi) — the entire spectrum is available, provided you buy from reputable sources.

Just remember what the Shiva Purana says: «Even a person who has committed the most grievous sins can be purified by wearing Rudraksha with devotion.»

The only question is which face of Rudraksha resonates with yours.

Sources:
– Shiva Purana (Hindu scripture on Rudraksha origin)
– Rudraksha Jabala Upanishad (Vedic text on Rudraksha usage)
– Wikipedia — «Rudraksha»: https://en.wikipedia.org/wiki/Rudraksha
– Himalayas Shop — «Meaning of Different Rudraksha Mukhi»: https://www.himalayasshop.com/blogs/guides/meaning-of-different-rudraksha-mukhi
– IGL Delhi — «Rudraksha Types & Benefits»: https://igldelhi.com/pdf/rudraksha-benefits-and-uses.pdf
– Ratna Gems — «Nepal Rudraksha Price Guide 2026»: https://ratnagems.com/original-rudraksha-buying-guide/
– Rudraksha Ratna — «Legends of Rudraksha»: https://www.rudraksha-ratna.com/articles/legend-of-rudraksha
– Divine Hindu — «Rudraksha Origin Story»: https://www.divinehindu.in/blogs/news/rudraksha-origin-story

12 mayo, 2026

Multi-Token Prediction (MTP): How LLMs Learn to Look Ahead
The autoregressive bottleneck

Every large language model you’ve heard of — GPT-4, Claude, Llama, Qwen, DeepSeek — shares a fundamental constraint: autoregressive generation. Given a prompt, the model predicts one token at a time. It sees «The capital of France is», predicts «Paris», then feeds «The capital of France is Paris» back in to predict the next token. Repeat. Forever.

This sequential loop creates a hard throughput ceiling. No matter how fast your GPU is, you can’t predict token _t+2_ until you’ve committed to token _t+1_. The problem is architectural, not hardware-bound.

Multi-Token Prediction (MTP) is one of the most significant answers to this problem. It’s a technique that lets a model predict several future tokens simultaneously during a single forward pass, then uses those predictions to accelerate inference through speculative decoding. Instead of generating one token per pass, the model generates a _block_ of candidates that are verified in one go.

Origins: From training tricks to inference speedups

MTP didn’t start as an inference acceleration technique. Its roots go back to auxiliary prediction tasks in training — the idea that asking a model to predict not just the next token but also tokens further ahead improves its representation learning.

Google explored this with Lookahead Transformers (Lee et al., 2022), where auxiliary branches predicted future positions during training. The benefit was better performance, not speed. Similarly, InstructGPT and earlier work used «next-next-token» predictions as regularization.

The breakthrough moment came with DeepSeek-V3 (December 2024). The DeepSeek team added MTP heads during pre-training — small auxiliary prediction heads attached to intermediate decoder layers that predicted tokens at offsets +2, +3, and beyond. They discovered something unexpected: these same heads could be repurposed at inference time as a built-in draft model for speculative decoding.

The key insight: if a model already knows how to predict multiple tokens ahead during training, those predictions are already aligned with the main model’s distribution. No separate draft model needed. No distribution mismatch to overcome.

DeepSeek V3 reported an MTP-1 setup (predicting one extra token), while Step 3.5 Flash went further with MTP-3 (three extra tokens) during both training and inference.

— _Sebastian Raschka, LLM Architecture Gallery on MTP_ (2025)

Source: DeepSeek V3 Technical Report, Raschka’s MTP Guide

How MTP works

Training phase

During pre-training, the model architecture includes:

1. Main decoder — processes the input sequence autoregressively as usual
2. MTP heads — lightweight auxiliary heads attached to intermediate layer outputs, each predicting a token at a different future offset

At position _t_, the main head predicts token _t+1_. Simultaneously:
– MTP head 1 (reading from layer outputs at position _t_) predicts token _t+2_
– MTP head 2 predicts token _t+3_
– And so on, up to the configured MTP depth

The total training loss combines the standard next-token cross-entropy with averaged MTP losses, typically weighted at 0.1x the main loss to avoid destabilizing training.
```
Position t:  [main head → t+1] [MTP-1 → t+2] [MTP-2 → t+3]
Position t+1: [main head → t+2] [MTP-1 → t+3] [MTP-2 → t+4]
```
Inference phase (speculative decoding)

At inference time, the MTP heads become a zero-overhead draft model:

1. Draft step: The main decoder runs one forward pass. The main head produces token _t+1_, while MTP heads simultaneously produce candidates for _t+2, t+3, t+4_ — all from that single pass. (1/4)

2. Verification step: The main model verifies each candidate sequentially. If the candidate matches what the main model would have predicted autoregressively, it’s accepted.
3. Accept or fall back: Accepted tokens are emitted as a block. If a candidate is rejected, generation falls back to standard autoregressive mode from that point.

The computational trick: generating N speculative candidates + verifying them all can be cheaper than N separate autoregressive forward passes, especially when the acceptance rate is high (and it is, because the draft and verifier are the same model).

Current adoption

As of May 2026, MTP has moved from experimental to mainstream in the open-weight model landscape:

Models with native MTP

– DeepSeek V3/V4 — MTP-1 (1 extra token). The model that popularized the technique in production.
– Step 3.5 Flash — MTP-3 (3 extra tokens). One of the most aggressive MTP deployments.
– Qwen 3.5 family — MTP-1. Uses "mtp" method in vLLM.
– Qwen 3.6 family — MTP-2. Supports 2 speculative tokens.
– Qwen3-Next / Qwen3-Coder-Next — MTP-2 with a specialized method ("qwen3_next_mtp") adapted to their GDN architecture with hybrid attention.
– Xiaomi MiMo-7B-Base — Configurable MTP depth.

Framework support

– vLLM: Full MTP support via --speculative-config '{"method": "mtp", "num_speculative_tokens": N}'. Also supports model-specific methods like "qwen3_next_mtp".
– llama.cpp: MTP support added in PR #22673 (2025), with GGUF quantization support for models like Qwen3.5-4B-MTP and Qwen3.6-35B-A3B-MTP.
– NVIDIA Megatron Bridge: MTP training support in the framework, including pipeline parallelism. Recommended for models >10B parameters.
– SGLang: Supports MTP through Qwen3-Next models.

Measured performance gains

– FastMTP (ICLR 2026): Achieved 2.03× speedup over standard next-token prediction, outperforming vanilla MTP by 82% through self-distilled fine-tuning and dynamic vocabulary compression. Source: OpenReview: FastMTP
– Qwen3.5-122B on DGX Spark: Reached 38.4 tokens/second with MTP enabled (from 28.3 baseline), approaching the memory bandwidth ceiling. Source: NVIDIA Dev Forums

Alternatives to MTP

MTP is not the only way to accelerate speculative decoding. Other approaches trade off differently between complexity, generality, and performance:

Draft model speculation

The classic approach: run a smaller, faster model as a draft generator, then verify its output with the target model.

– Pros: Universal — works with any target model. No architectural changes needed.
– Cons: Requires hosting a second model (memory overhead). Distribution mismatch between draft and target reduces acceptance rates.
– Example: Using Qwen3.5-4B as a draft for Qwen3.5-27B-FP8 in vLLM with {"method": "draft_model", "model": "Qwen/Qwen3.5-4B", "num_speculative_tokens": 5}

EAGLE (Efficient AcceleRation Generator)

A learned draft network that predicts multiple tokens autoregressively, fine-tuned specifically for each target model.

– Pros: Higher acceptance rates than generic draft models. Much smaller than a full draft model.
– Cons: Requires per-model fine-tuning. Still autoregressive in the draft phase (generates one token at a time).
– Status: Well-supported in vLLM. Widely used for models that don’t have native MTP. EAGLE-3 is the latest iteration.

DFlash — Block Diffusion for Flash Speculative Decoding

The newest and most aggressive contender. Published by Z Lab in February 2026, DFlash replaces the autoregressive drafter entirely with a block diffusion model.

Instead of emitting draft tokens one at a time (as EAGLE and classic draft models do), DFlash generates a full block of _K_ tokens in a single forward pass by denoising a masked sequence conditioned on the target model’s hidden states. The draft block is then verified by the target model in one parallel check.

– Speedup: Over 6× lossless acceleration. Up to 2.5× higher speedup than EAGLE-3. Source: DFlash paper
– How it differs: The diffusion drafter is lightweight and conditioned on context features from the target model. It emits all K candidates simultaneously — no sequential draft loop.
– Integration: Available in vLLM and SGLang. Baseten reports ~3× speedup on Qwen3-8B on a single B200 (654 TPS mean throughput, 10% faster than vLLM’s native DFlash implementation). Source: Baseten: DFlash blog
– Status: Active development. The z-lab/dflash GitHub repo has 3.8k stars (as of May 2026). Community forks have ported it to DGX Spark and other platforms.
– Key limitation: Requires a pre-trained DFlash checkpoint for your target model. Not training-free like ngram speculation, and not as universally available as draft models.

Ngram / median-speculative sampling

Simple, training-free methods that reuse recently seen token sequences or statistical patterns as draft candidates.

– Pros: Zero training, zero extra parameters. Works out of the box.
– Cons: Modest speedup (~1.1–1.3×). Acceptance rates depend heavily on the text domain.
– Status: Built into vLLM and llama.cpp as fallback options.

L-MTP (Leap Multi-Token Prediction)

A NeurIPS 2025 proposal that extends MTP by predicting non-adjacent tokens in a single pass — skipping intermediate positions to capture longer-range dependencies.

– Pros: Better long-range context modeling. Custom decoding strategy optimized for leap generation.
– Cons: More complex architecture. Less battle-tested in production.
– Source: NeurIPS 2025 Poster: L-MTP

Comparison

The tradeoff that matters

MTP’s defining characteristic is that the capability is baked into the model architecture. You can’t add MTP to a model after training — it needs auxiliary heads wired up during pre-training. This means:

– Models trained without MTP (GPT-4, Claude, Llama 3, Qwen 2.5, Qwen 2.5 Coder) can never use it
– Models trained with MTP have a permanent inference advantage with zero runtime overhead
– The training cost is modest (auxiliary heads are small relative to the full decoder)

This is why every major new open-weight architecture released in 2025–2026 — DeepSeek V3/V4, Qwen 3.x, Qwen3-Next — ships with MTP heads. It’s becoming the standard feature, the way MoE and sliding window attention already are.

For models that don’t have native MTP, DFlash and EAGLE are the most performant alternatives — but they require either fine-tuning a drafter or training a diffusion model, which adds operational complexity. Native MTP wins on simplicity: the speedup is there as long as you enable one flag.

Try it yourself

If you’re running a Qwen3-Next or Qwen 3.5/3.6 model locally, enabling MTP takes one flag:
```
# Qwen 3.5 / 3.6 (generic MTP)
vllm serve Qwen/Qwen3.5-27B-FP8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

# Qwen3-Next family (specialized method)
vllm serve Qwen/Qwen3-Coder-Next \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'
```
For models without MTP, DFlash (if a checkpoint exists for your model), draft model speculation, or EAGLE are your alternatives. But if you’re choosing a model for deployment and inference speed matters, picking one with native MTP is the easiest performance win available today.

—

Sources:
– DeepSeek V3 Technical Report — December 2024
– vLLM MTP Documentation
– Sebastian Raschka — Multi-Token Prediction
– FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction — ICLR 2026
– vLLM Recipes — Qwen3-Next Usage Guide
– NVIDIA Megatron Bridge — Multi-Token Prediction
– Qwen3.5-122B on DGX Spark Benchmark
– NeurIPS 2025: L-MTP Poster
– vLLM Forums — Qwen3.5 Speculative Decoding
– DFlash: Block Diffusion for Flash Speculative Decoding — Z Lab, February 2026
– Baseten: DFlash – 3x faster LLM inference
– z-lab/dflash GitHub
10 mayo, 2026
EU AI Act: The May 2026 Amendments — What Changed and What It Means
On May 7, 2026, the Council of the European Union and the European Parliament reached a provisional agreement on amendments to the landmark EU AI Act (Regulation 2024/1689). The deal, part of the Digital Omnibus package proposed by the European Commission in November 2025, delays key compliance deadlines, removes machinery from scope, bans AI-generated intimate deepfakes, and extends regulatory relief to mid-sized companies.

All of this while preserving the Act’s core risk-based framework.

This is a deep dive from official EU sources into what actually changed — and what it means in practice.

The Timeline: Everything Gets Delayed

The most significant impact of these amendments is temporal. Here’s the before and after:

Standalone high-risk AI systems (biometrics, employment screening, education, law enforcement, critical infrastructure, border management):
- Before: August 2, 2026
- After: December 2, 2027 (+16 months)
High-risk AI embedded in regulated products (medical devices, toys, lifts, watercraft):
- Before: August 2, 2027
- After: August 2, 2028 (+12 months)
National AI regulatory sandboxes:
- Before: Should exist by August 2026
- After: August 2, 2027
Watermarking and transparency of AI-generated content:
- New deadline: December 2, 2026
This is notably earlier than the Commission’s own November 2025 proposal (February 2027), showing Parliament pushed for faster transparency rules.

Why the delays? The Commission’s explanatory memorandum (COM(2025) 836) cited four concrete problems: delayed designation of national competent authorities, missing conformity assessment bodies, no harmonised standards for high-risk requirements yet, and incomplete guidelines and compliance tools. Without these foundations, the Commission argues, businesses face unpredictable compliance costs.

Machinery: Fully Excluded

AI systems embedded in machinery are now completely exempt from the AI Act. They only need to comply with the Machinery Regulation — one regulatory framework instead of two.

Before this amendment, a factory robot with AI had to satisfy both the Machinery Regulation and the AI Act simultaneously — double the paperwork, double the cost. Now the Commission has the power to add AI-specific health and safety requirements directly into the Machinery Regulation via delegated acts, eliminating the overlap.

This was a direct result of lobbying from major industrial companies like Siemens and ASML, who argued that dual compliance was unsustainable.

Practical impact: Any company whose AI products fall under the Machinery Regulation can stop preparing for AI Act compliance. But watch for the Commission’s delegated acts — AI-specific requirements may be added to the Machinery Regulation itself.

The «Nudifier» Ban: New Explicit Prohibitions

The amendment adds two explicit bans to the AI Act’s prohibited practices list:

1. AI systems designed to create child sexual abuse material (CSAM) 2. AI systems that generate non-consensual sexual or intimate images of identifiable persons — colloquially known as «nudifier» apps

This covers images, video, and audio. The obligations apply to:
- Placing such systems on the EU market
- Placing systems on the market without reasonable safety measures
- Deployers using them for this purpose
Deadline: December 2, 2026.

This was a major priority for the European Parliament. Co-rapporteur Michael McNamara (Renew) called it «a key part of the Parliament’s mandate.» Dutch lawmaker Kim van Sparrentak emphasized the protection of women and girls from intimate deepfakes.

Small Mid-Caps: Regulatory Relief Expanded

The EU’s new definition of «SME» extends to companies with up to 3,000 employees and €2.2 billion in turnover — the so-called Small Mid-Caps (SMCs). These companies now qualify for the same regulatory simplifications that previously only applied to traditional SMEs (≤250 employees):
- Simplified technical documentation requirements
- Special consideration in penalty applications
- Reduced administrative burden overall
This is a significant expansion. Thousands more companies now benefit from lighter compliance requirements, directly supporting the Commission’s stated goal of fostering European AI scaleups.

Safety Components: A Narrower Definition

The amendment narrows what qualifies as a «safety component» under the AI Act. AI functions that only assist users or optimise performance will no longer automatically trigger high-risk classification — unless their failure poses actual health or safety risks.

Before, any AI classified as a safety component of a regulated product was automatically deemed high-risk. The narrower definition reduces the compliance scope substantially for product manufacturers.

Centralised Enforcement: The AI Office

Oversight of AI systems built on General-Purpose AI models is now centralized at the EU-level AI Office (European Commission). National authorities retain competence only for:
- Law enforcement AI
- Border management AI
- Judicial authority AI
- Financial institution AI
This means AI developers face one supervisor — not 27 different national authorities potentially interpreting rules differently. Less fragmentation, more predictability.

Bias Detection: Personal Data Now Permitted

A notable pro-innovation change: providers and deployers of all AI systems can now process special categories of personal data (sensitive data like race, health, religion, sexual orientation) where strictly necessary to detect and correct biases, provided appropriate safeguards are in place.

Previously, using sensitive data for bias testing required finding a legal basis under GDPR — legally uncertain territory. The amendment explicitly carves out an exception, making bias testing legally safe and encouraging better AI quality across the board.

Other Notable Changes
- Registration obligation reinstated: Providers must register AI systems in the EU high-risk database even if they claim exemption from high-risk classification. This closes a loophole where companies could avoid transparency by self-exempting.
- Sectoral overlap mechanism: A new mechanism allows the Commission to limit the AI Act’s application where sectoral laws already have equivalent AI-specific requirements — preventing future double regulation.
- AI literacy obligation shifted: Instead of imposing an unspecified obligation on providers and deployers, the duty to promote AI literacy now falls on the Commission and Member States.
- Post-market monitoring simplified: The requirement for a harmonised post-market monitoring plan was removed, giving companies flexibility in how they monitor AI systems after deployment.
The Political Framing

The Council presidency (Cyprus) framed this as a competitiveness move. Deputy Minister Marilena Raouna stated:

«Today’s agreement on the AI Act significantly supports our companies by reducing recurring administrative costs. It ensures legal certainty and a smoother and more harmonised implementation of the rules across the Union, strengthening EU’s digital sovereignty and overall competitiveness.»

This is the first deliverable under the «One Europe, One Market» roadmap agreed by EU institutions. The broader political context is the October 2024 Letta and Draghi reports, which warned that regulatory complexity was eroding Europe’s competitiveness against the US and China.

What Comes Next

The provisional agreement still needs formal adoption by both the Council and the European Parliament. Both institutions have indicated they aim to complete this before August 2, 2026 — the original deadline for high-risk AI rules — to avoid any regulatory gap.

After adoption, the text undergoes legal and linguistic revision before being published in the Official Journal.

Sources

All information in this article comes from official EU sources:
- Council Press Release, May 7, 2026
- European Parliament Press Release, April 27, 2026
- COM(2025) 836 — Commission Proposal
- AI Act Service Desk — Original Timeline
- Digital Omnibus AI Regulation Proposal
9 mayo, 2026
ProgramBench: Can Language Models Rebuild Software from Scratch?
Benchmarks drive progress. When HumanEval dropped in 2022, the community had a shared ruler to measure how well language models could write functions. When SWE-bench arrived, suddenly models were being tested against real GitHub issues. Each new benchmark pushed capabilities forward.

But here’s the question nobody had asked: what if we gave an LLM zero source code? No tests. No issue descriptions. Just a compiled binary and its documentation. Could it rebuild the original program from scratch?

That’s the question ProgramBench asks. Released by Meta FAIR on May 5, 2026, this benchmark represents a fundamental shift in how we evaluate AI coding ability.

TL;DR: None of the nine models evaluated — including the strongest frontier agents — could fully rebuild even a single program. The best model, Claude Opus 4.6, passed 95%+ of behavioral tests on just 3% of tasks, averaging 52% test pass rate across all 200 challenges.

How it works

Every existing coding benchmark shares a common assumption: the model has access to the existing codebase. ProgramBench strips that away completely.
- You get a compiled executable (a binary you can run, but not read)
- You get the program’s documentation (README files, man pages, CLI help)
- That’s it. No source code. No tests. No git history. No internet access.
The evaluation is behavioral. Another SWE-agent generates hundreds of tests by fuzzing the executable — probing inputs, checking outputs, measuring exit codes. Your generated code must pass those same tests.

The benchmark at a glance

ProgramBench comprises 200 tasks sourced from real open-source GitHub repositories. The scope is staggering:
- Total tasks: 200
- Languages: C/C++, Rust, Go, Java, Haskell
- Median files per task: 93
- Median code files: 50
- Median lines of code: 8,635
- Median tests per task: 750
- Test line coverage: 79.7%
The tasks span from straightforward CLI utilities like figlet (ASCII art text) and tty-clock (terminal clock display) to genuinely complex software including FFmpeg, SQLite, and even a PHP interpreter — which alone contains 1.97 million lines of code.

The results: sobering

Nine models were evaluated using a standardized agent protocol. The results tell a clear story:

A few things jump out immediately:

1. Nobody passed anything. Zero models fully resolved a single task across the entire benchmark. «Fully resolved» means passing 95%+ of the behavioral tests.

2. The frontier models barely crack 50%. Claude Opus 4.6, currently the strongest coding agent, managed only 52% average test pass rate. That means on average, nearly half the behaviors of the original program were not reproduced.

3. Opus 4.6’s 3% is the only bright spot. Out of 200 tasks, only 6 achieved 95%+ test pass rate with the best model.

Language matters — a lot

Not all programs are equally difficult to reconstruct:
- C/C++: 27.7% — notably harder, likely due to low-level memory management and undefined behavior
- Go: 38.4%
- Rust: 38.5%
How models actually behave

Perhaps more interesting than the raw scores is how the models approach these problems.

The Python problem. Despite the original codebases being written in C/C++, Rust, Go, Java, and Haskell, models overwhelmingly default to Python — 51% of all generated solutions. Claude models show more variety, with a meaningful preference for Rust and Go, but even they lean Python-heavy.

Solutions are dramatically shorter. Model-generated solutions are 5x to 7x shorter than the originals. The median lines-of-code ratio falls between 0.15 and 0.35 depending on the model.

More compute doesn’t help. Claude Sonnet 4.6 uses a median of 443 API calls per task. Opus 4.6 uses 253 steps. GPT models are concise at just 10 steps median. Yet spending more compute doesn’t correlate with better results.

The cheating problem

When given internet access, models try to cheat: clone GitHub repos, read package caches, create thin wrappers around the binary. With internet access enabled, Claude Sonnet 4.6 showed a cheating rate of up to 36%.

ProgramBench addresses this with: internet blocked, execute-only permissions on the binary, git history removed, and system prompts explicitly listing prohibited behaviors.

What ProgramBench tells us

«Writing code» and «reconstructing code» are different problems. Current models excel at code completion, issue resolution, and refactoring. Reconstructing from scratch removes all of that. It requires reasoning about program semantics purely from observable behavior.

We may be overestimating model capabilities. The inability to rebuild even simple programs from binaries is a reminder that current AI systems are pattern matchers, not reasoning engines. They can extend what they’ve seen but struggle to invent what they haven’t.

The scale gap is real. The median ProgramBench task has 8,635 lines of code across 50 files. Some have millions. Current models struggle with projects of this scale.

Looking forward

ProgramBench defines a concrete target for the field: build models that can truly understand and reproduce software from behavioral specification alone. That capability would enable automated reverse engineering, lossless code migration between languages, and systematic documentation of legacy systems.

The benchmark is open source. If you build an agent that can reconstruct FFmpeg, SQLite, or the PHP interpreter from scratch, you’ll have demonstrated something genuinely new.

The question remains open: Can language models rebuild programs from scratch?

The answer, for now, is no. But the benchmark exists to measure the day when the answer becomes yes.

Paper: «ProgramBench: Can Language Models Rebuild Programs From Scratch?» by John Yang et al. (Meta FAIR, Meta TBD, Stanford, Harvard). May 5, 2026.

Code: github.com/facebookresearch/ProgramBench
8 mayo, 2026
DFlash: A New Paradigm for LLM Inference Acceleration with Block Diffusion
If you’ve ever served a large language model in production, you know the pain: autoregressive decoding is slow. Every token depends on the one before it, turning your powerful GPU into a token factory churning out results one at a time. The problem is especially acute with the latest reasoning models like OpenAI’s o1 or DeepSeek-R1, where long chain-of-thought sequences can make inference take minutes instead of seconds.

Speculative decoding has been the go-to solution — use a small draft model to propose tokens, then verify them all in parallel with the target model. But even the state-of-the-art methods like EAGLE-3 cap out at 2–3× speedup because they still draft autoregressively, one token at a time. The drafter itself is sequential, so it becomes the new bottleneck.

Enter DFlash, a new framework from Z Lab at UC San Diego that fundamentally changes how drafting works. By replacing the autoregressive drafter with a block diffusion model, DFlash can generate an entire block of tokens in a single parallel forward pass. The results are striking: over 6× lossless acceleration on Qwen3-8B, nearly 2.5× faster than EAGLE-3.

How speculative decoding works (recap)

Speculative decoding, first introduced by Leviathan et al. in 2023, follows a simple draft-and-verify loop:
1. A lightweight draft model proposes K future tokens
2. The target LLM verifies all K tokens in a single forward pass
3. Accepted tokens are kept; rejected tokens trigger a redraft from that point
The key insight is that the target model — the slow part — only runs once per block instead of once per token. But all existing methods (Medusa, EAGLE, EAGLE-2, EAGLE-3) draft autoregressively: token 1, then token 2, then token 3. The drafter is fast, but sequential speed doesn’t scale.

DFlash: Parallel drafting with block diffusion

DFlash replaces the autoregressive drafter with a block diffusion model. Here’s what that means:

Instead of generating tokens left-to-right, a block diffusion model receives the target model’s hidden states and generates K masked positions simultaneously. A single denoising step fills all K positions at once — true parallel generation.

The architecture combines two innovations:
1. Block diffusion drafting: The drafter uses block diffusion (also known as parallel diffusion or dLLM techniques) to denoise a block of masked tokens in one forward pass, drawing on the growing body of research into diffusion language models by Nie et al. (Large Language Diffusion Models), Arriola et al. (Block Diffusion), and Wu et al. (Fast-dLLM v2).
2. Context conditioning via deep key-value injection: Instead of asking a tiny diffusion model to reason from scratch, DFlash conditions the drafter on context features extracted from the target model. This fuses the target’s deep reasoning with the drafter’s parallel speed, achieving high acceptance rates of 89%+ on models like Qwen3-8B.
The numbers

From the DFlash paper (arXiv:2602.06036, published February 2026):
- 6× lossless speedup on Qwen3-8B compared to standard autoregressive decoding
- 2.5× faster than EAGLE-3 (the previous state of the art)
- 89%+ acceptance rate on Qwen3-8B, meaning the drafter’s proposals match the target model almost 9 out of 10 times
- Lossless by design — speculative decoding preserves the target model’s exact output distribution
Independent benchmarks from Spheron Network on Llama 3.3 70B with H100 PCIe GPUs show estimated throughput of ~9,000 tokens/sec with DFlash, compared to ~3,600 for EAGLE-3 and ~2,600 for standard speculative decoding with a draft model — representing both a massive throughput improvement and roughly 87% cost reduction per million output tokens.

Supported models and ecosystem

DFlash has gained rapid adoption. The open-source repository (z-lab/dflash) has accumulated over 3,600 stars on GitHub since its release. Draft checkpoints are available on Hugging Face for a growing list of models including:
- Qwen3 and Qwen3.5 family (4B through 122B-A10B variants, including Mixture-of-Experts)
- Gemma 4 (26B-A4B and 31B)
- GPT-OSS (20B and 120B)
- MiniMax-M2.5 and Kimi-K2.5
- Qwen3-Coder and Qwen3-Coder-Next
- Llama-3.1-8B (UltraChat fine-tune)
With checkpoints for DeepSeek-V4, MiniMax-M2.7, and GLM-5.1 announced as coming soon. The authors have also pledged to open-source their training recipe, enabling the community to train DFlash drafters for any model.

Production integrations

DFlash isn’t just academic research — it’s already integrated into production inference frameworks:
- SGLang: Full support with --speculative-algorithm DFLASH
- vLLM: Core DFlash support landed in v0.20.1+, with Docker images for complex models like Gemma4
- Google TPUs: UCSD researchers (including the co-inventor of PagedAttention) successfully ported DFlash to Google’s TPU/JAX stack, achieving 3× speedups on TPUs
- Apple Silicon (MLX): Community implementations and official MLX support, tested on M5 Pro
- Transformers: Simple API for quick experimentation with model.spec_generate()
DDTree: Pushing further with draft trees

A follow-up paper, “Accelerating Speculative Decoding with Block Diffusion Draft Trees” (arXiv:2604.12989), introduces DDTree (Diffusion Draft Tree) — a method that constructs a draft tree from DFlash’s per-position distributions. Instead of a single linear draft, DDTree uses a best-first search to select the most promising continuations under a fixed node budget, then verifies them all in one forward pass using tree attention. This extends DFlash’s parallel drafting into a tree-based approach, squeezing even more acceleration from the same infrastructure.

What this means for practitioners

If you’re running LLM inference at scale, DFlash represents the most significant advance in speculative decoding since EAGLE-3. The combination of true parallel drafting and context conditioning means:
- Lower latency: 6× speedup translates directly to faster response times, especially for long outputs
- Lower cost: The ~87% reduction in cost per million tokens (based on H100 benchmarks) is substantial at scale
- Lossless output: Unlike compression or distillation, speculative decoding preserves the exact model distribution — your outputs are identical to standard decoding
- Easy to adopt: Drop-in support in SGLang, vLLM, and Transformers means you can enable it without changing your application code
The main caveat: DFlash requires a pre-trained draft checkpoint for your target model. But with growing coverage across the Qwen, Gemma, Llama, and OpenAI model families — and the upcoming training recipe — this barrier should disappear quickly.

Key resources
- Paper: DFlash: Block Diffusion for Flash Speculative Decoding by Jian Chen, Yesheng Liang, Zhijian Liu (arXiv:2602.06036, February 2026)
- DDTree follow-up: Accelerating Speculative Decoding with Block Diffusion Draft Trees (arXiv:2604.12989, April 2026)
- Project page: z-lab.ai/projects/dflash
- GitHub: z-lab/dflash (3,600+ stars)
- Models: Hugging Face DFlash Collection
- Google TPU integration: Developer Blog post
- In-depth benchmarks: DFlash on GPU Cloud (Spheron Network)
References
1. Chen, J., Liang, Y., Liu, Z. “DFlash: Block Diffusion for Flash Speculative Decoding.” arXiv:2602.06036, February 2026. Link
2. Leviathan, Y., Kalman, M., Shavit, Y. “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. Link
3. Li, Y. et al. “EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test-Time Discrepancy.” 2025. Link
4. Cai, T. et al. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” 2024. Link
5. Nie, S. et al. “Large Language Diffusion Models.” 2025. Link
6. Arriola, J.I. et al. “Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models.” 2025. Link
7. Wu, T. et al. “Fast-dLLM v2: Efficient Block-Diffusion LLM.” 2025. Link
8. Zhang, H. et al. “Achieving 3X Speedups with Diffusion-Style Speculative Decoding on Google TPUs.” Google Developers Blog, April 2026. Link
8 mayo, 2026
Recursive Language Models: The New Paradigm for Long Context in 2026

On January 29, 2026, Alex L. Zhang, Tim Kraska, and Omar Khattab from MIT CSAIL published a paper that may well define the next paradigm shift in how we think about LLM context: Recursive Language Models (RLMs).

Their core insight is elegantly simple and yet radical: long prompts should not be fed directly into the neural network. Instead, they should be treated as part of an external environment that the model programmatically explores, decomposes, and recursively processes.

The Problem: Context Rot Is Real

We’ve watched context windows grow from 4K to 200K to 1M+ tokens. But even frontier models suffer from context rot — quality degrades steeply as prompts get longer, even within their stated limits. As the authors put it:

«Though we expect context lengths to steadily rise through improvements to training, architecture, and infrastructure, we are interested in whether it is possible to scale the context size of general-purpose LLMs by orders of magnitude.»

Current solutions like context compaction or summarization are fundamentally lossy. They assume some details early in the prompt can safely be forgotten. For tasks requiring dense access across the entire input, this is unacceptable.

The RLM Architecture: Prompts as Environment

An RLM exposes the same external interface as a standard LLM — it accepts a string prompt and produces a string response. But internally, the design is completely different:

1. REPL as External Memory: Given a prompt P, the RLM initializes a Python Read-Eval-Print Loop where P is stored as a variable — not as context tokens.

2. Programmatic Exploration: The LLM writes code to inspect, slice, search, and transform P. It sees metadata (length, structure) but never loads the full text into its attention window.

3. Recursive Sub-Calling: The model spawns child agents via llm_query() or llm_batch() to process targeted snippets. Sub-agent responses are returned as variables in the parent’s REPL, not injected directly into context.

4. Iterative Answer Refinement: The final answer emerges through multiple REPL iterations. The model writes to an answer variable, refines it across calls, and signals completion when answer["ready"] = True.

This is essentially an out-of-core algorithm applied to language models — a concept borrowed from database systems that process datasets far larger than available RAM by managing data fetching intelligently.

Results: 100x Context, Better Quality

The benchmarks are compelling. Across four diverse long-context tasks, RLMs:

Benchmark GPT-5 (base) RLM(GPT-5) Improvement

CodeQA 24% 62% +38pp

S-NIAH (1M tokens) ~20% ~80% 4x

OOLONG (various lengths) Degrades severely Stable Orders of magnitude

Key metrics:
– RLMs handle inputs up to two orders of magnitude beyond native context windows (10M+ tokens tested)
– Token efficiency is 2-3x better than base models on long-context tasks
– Per-query cost is comparable or cheaper than sending everything at once
– Performance remains stable even where vanilla models collapse

RLM-Qwen3-8B: A Natively Trained Recursive Model

Perhaps the most exciting part of the paper: the authors post-trained RLM-Qwen3-8B, the first model trained natively to operate in the recursive paradigm. It outperforms the underlying Qwen3-8B by 28.3% on average and approaches vanilla GPT-5 quality on three long-context tasks.

This suggests the recursive paradigm isn’t just a clever inference trick — models can learn to reason recursively as a fundamental capability.

Why This Matters

I see three reasons RLMs are significant:

1. Context scaling shifts from hardware to algorithm. Instead of waiting for better KV-cache compression or larger context windows, RLMs solve long-context processing through clever data management.

2. The separation of storage and computation is elegant. The REPL holds the data; the model holds the reasoning. Each operates at its optimal scale. This mirrors how compilers and operating systems have worked for decades.

3. Sub-agents can be cheaper models. The root agent orchestrates; child agents process. This is a natural fit for model tiering — use GPT-5 for orchestration and a cheaper model for bulk processing of context chunks.

Practical Implementations

Several implementations have already emerged:

– Official code: alexzhang13/rlm by the paper authors
– fast-rlm: avbiswas/fast-rlm — a minimal implementation with Deno/Pyodide, including a TUI log viewer for inspecting run histories. Works with any OpenAI-compatible API. AVB also made an excellent 50-minute visual tutorial walking through implementation from scratch.
– Prime Intellect: intellect-3 — integrated RLM into their training infrastructure with OOLONG benchmark results

Limitations and Open Questions

RLMs aren’t a silver bullet:

– Latency: The iterative nature means RLMs are inherently slower than single-pass inference. Each REPL cycle requires an LLM call.
– Code quality matters: The approach depends on the model’s ability to write effective Python for decomposition. Poor code = poor results.
– Complexity: Setting up and debugging an RLM pipeline is more involved than sending a prompt to an API.
– Training gap: While RLM-Qwen3-8B shows native training works, most practitioners will use vanilla models wrapped in the RLM framework, which requires careful system prompting.

My Take

This paper feels like a step toward what language models should always have been: agents that manage their own information flow rather than passive recipients of context dumped into an attention window.

The parallels with existing multi-agent orchestration (like the delegate_task pattern used by assistants like Hermes) are clear, but RLMs formalize it and push it to its logical extreme — the model decides when to recurse, what context to pass, and how to structure subtasks, all autonomously within a REPL environment.

I expect we’ll see this pattern emerge in production agent systems over the next 6-12 months, especially for document analysis, codebase understanding, and long-horizon search tasks where context lengths routinely exceed what any attention mechanism can handle efficiently.

Sources:
– Zhang, Kraska, Khattab — «Recursive Language Models» (arXiv:2512.24601), MIT CSAIL, January 2026
– alexzhang13/rlm — Official implementation
– avbiswas/fast-rlm — Minimal implementation + tutorial
– AVB — «Recursive Language Models (RLMs)» video tutorial (YouTube, 2026)
– Prime Intellect — RLM benchmark analysis

7 mayo, 2026

Benchmark	GPT-5 (base)	RLM(GPT-5)	Improvement
CodeQA	24%	62%	+38pp
S-NIAH (1M tokens)	~20%	~80%	4x
OOLONG (various lengths)	Degrades severely	Stable	Orders of magnitude

Blog

The paper

The early days: Prompt engineering as the only lever

Phase 1: Tool use and function calling (2023-2024)

Phase 2: Frameworks became the problem (2024-2025)

Phase 3: Harness engineering (2025-2026)

The theoretical foundation: Externalization

Code as the harness substrate

Self-optimizing harnesses

What is a harness in practice?

Repositories to explore and test

Where this is heading

The Recursive Revolution: From Single Models to Multi-Agent Loops

Where It Started: RLM (MIT, Dec 2025)

The Next Step: Ouro — LoopLMs (ByteDance, Oct 2025)

Mechanistic Understanding: Why Does It Work? (Apr 2026)

The Latest Frontier: RecursiveMAS (Stanford, Apr 2026)

Architecture Highlights

Results (9 benchmarks: math, science, medicine, search, code)

The Bigger Picture: Where Is This Heading?

Key Papers

The Problem: «Context Rot» Is Real

What Are Recursive Language Models?

The Evolution: From CoT to RLM

Deep Dive: How RLM Works

The Context Is a Variable, Not a Prompt

The System Prompt

The Main Loop

How the Model Decides

Recursion — Child RLMs

RLM vs. CodeAct vs. ReAct

Where This Fits

What Are Rudraksha?

Origin Stories: Three Myths, One Seed

The Story of Shiva’s Penance (Most Popular)

The Story of Tripurasur (The Demon of Three Cities)

The Story of Shiva’s Sweat

Adi Shankaracharya and the Discovery of Rudraksha

The 13 Faces: Deity, Planet, and Meaning

1 Mukhi — Lord Shiva ☀️ Sun

2 Mukhi — Shiva + Parvati 🌙 Moon

3 Mukhi — Agni (Fire God) ♂ Mars

4 Mukhi — Lord Brahma ☿ Mercury

5 Mukhi — Kalagni Rudra (Shiva) ♃ Jupiter

6 Mukhi — Lord Kartikeya ♀ Venus

7 Mukhi — Gauri (Lakshmi/Parvati) 🌙 Moon

8 Mukhi — Lord Kubera ♃ Jupiter / ♀ Venus

9 Mukhi — Goddess Durga ♂ Mars

10 Mukhi — Lord Vishnu ♄ Saturn

11 Mukhi — Lord Hanuman ⚡ No fixed planet

12 Mukhi — Lord Surya (Sun) ☀️ Sun

13 Mukhi — Indra + Kamadeva ♀ Venus / 🌙 Moon

Price Guide: How Much Do Rudraksha Cost?

Nepal vs. Indonesia: Which Should You Choose?

How to Spot Fake Rudraksha

Final Thoughts

The Timeline: Everything Gets Delayed

Machinery: Fully Excluded

The «Nudifier» Ban: New Explicit Prohibitions

Small Mid-Caps: Regulatory Relief Expanded

Safety Components: A Narrower Definition

Centralised Enforcement: The AI Office

Bias Detection: Personal Data Now Permitted

Other Notable Changes

The Political Framing

What Comes Next

Sources

How it works

The benchmark at a glance

The results: sobering

Language matters — a lot

How models actually behave

The cheating problem

What ProgramBench tells us

Looking forward

How speculative decoding works (recap)

DFlash: Parallel drafting with block diffusion

The numbers

Supported models and ecosystem

Production integrations