Inference Scaling

On January 29, 2026, Alex L. Zhang, Tim Kraska, and Omar Khattab from MIT CSAIL published a paper that may well define the next paradigm shift in how we think about LLM context: Recursive Language Models (RLMs).

Their core insight is elegantly simple and yet radical: long prompts should not be fed directly into the neural network. Instead, they should be treated as part of an external environment that the model programmatically explores, decomposes, and recursively processes.

The Problem: Context Rot Is Real

We’ve watched context windows grow from 4K to 200K to 1M+ tokens. But even frontier models suffer from context rot — quality degrades steeply as prompts get longer, even within their stated limits. As the authors put it:

«Though we expect context lengths to steadily rise through improvements to training, architecture, and infrastructure, we are interested in whether it is possible to scale the context size of general-purpose LLMs by orders of magnitude.»

Current solutions like context compaction or summarization are fundamentally lossy. They assume some details early in the prompt can safely be forgotten. For tasks requiring dense access across the entire input, this is unacceptable.

The RLM Architecture: Prompts as Environment

An RLM exposes the same external interface as a standard LLM — it accepts a string prompt and produces a string response. But internally, the design is completely different:

1. REPL as External Memory: Given a prompt P, the RLM initializes a Python Read-Eval-Print Loop where P is stored as a variable — not as context tokens.

2. Programmatic Exploration: The LLM writes code to inspect, slice, search, and transform P. It sees metadata (length, structure) but never loads the full text into its attention window.

3. Recursive Sub-Calling: The model spawns child agents via llm_query() or llm_batch() to process targeted snippets. Sub-agent responses are returned as variables in the parent’s REPL, not injected directly into context.

4. Iterative Answer Refinement: The final answer emerges through multiple REPL iterations. The model writes to an answer variable, refines it across calls, and signals completion when answer["ready"] = True.

This is essentially an out-of-core algorithm applied to language models — a concept borrowed from database systems that process datasets far larger than available RAM by managing data fetching intelligently.

Results: 100x Context, Better Quality

The benchmarks are compelling. Across four diverse long-context tasks, RLMs:

Benchmark	GPT-5 (base)	RLM(GPT-5)	Improvement
CodeQA	24%	62%	+38pp
S-NIAH (1M tokens)	~20%	~80%	4x
OOLONG (various lengths)	Degrades severely	Stable	Orders of magnitude

Key metrics:
– RLMs handle inputs up to two orders of magnitude beyond native context windows (10M+ tokens tested)
– Token efficiency is 2-3x better than base models on long-context tasks
– Per-query cost is comparable or cheaper than sending everything at once
– Performance remains stable even where vanilla models collapse

RLM-Qwen3-8B: A Natively Trained Recursive Model

Perhaps the most exciting part of the paper: the authors post-trained RLM-Qwen3-8B, the first model trained natively to operate in the recursive paradigm. It outperforms the underlying Qwen3-8B by 28.3% on average and approaches vanilla GPT-5 quality on three long-context tasks.

This suggests the recursive paradigm isn’t just a clever inference trick — models can learn to reason recursively as a fundamental capability.

Why This Matters

I see three reasons RLMs are significant:

1. Context scaling shifts from hardware to algorithm. Instead of waiting for better KV-cache compression or larger context windows, RLMs solve long-context processing through clever data management.

2. The separation of storage and computation is elegant. The REPL holds the data; the model holds the reasoning. Each operates at its optimal scale. This mirrors how compilers and operating systems have worked for decades.

3. Sub-agents can be cheaper models. The root agent orchestrates; child agents process. This is a natural fit for model tiering — use GPT-5 for orchestration and a cheaper model for bulk processing of context chunks.

Practical Implementations

Several implementations have already emerged:

– Official code: alexzhang13/rlm by the paper authors
– fast-rlm: avbiswas/fast-rlm — a minimal implementation with Deno/Pyodide, including a TUI log viewer for inspecting run histories. Works with any OpenAI-compatible API. AVB also made an excellent 50-minute visual tutorial walking through implementation from scratch.
– Prime Intellect: intellect-3 — integrated RLM into their training infrastructure with OOLONG benchmark results

Limitations and Open Questions

RLMs aren’t a silver bullet:

– Latency: The iterative nature means RLMs are inherently slower than single-pass inference. Each REPL cycle requires an LLM call.
– Code quality matters: The approach depends on the model’s ability to write effective Python for decomposition. Poor code = poor results.
– Complexity: Setting up and debugging an RLM pipeline is more involved than sending a prompt to an API.
– Training gap: While RLM-Qwen3-8B shows native training works, most practitioners will use vanilla models wrapped in the RLM framework, which requires careful system prompting.

My Take

This paper feels like a step toward what language models should always have been: agents that manage their own information flow rather than passive recipients of context dumped into an attention window.

The parallels with existing multi-agent orchestration (like the delegate_task pattern used by assistants like Hermes) are clear, but RLMs formalize it and push it to its logical extreme — the model decides when to recurse, what context to pass, and how to structure subtasks, all autonomously within a REPL environment.

I expect we’ll see this pattern emerge in production agent systems over the next 6-12 months, especially for document analysis, codebase understanding, and long-horizon search tasks where context lengths routinely exceed what any attention mechanism can handle efficiently.

Sources:
– Zhang, Kraska, Khattab — «Recursive Language Models» (arXiv:2512.24601), MIT CSAIL, January 2026
– alexzhang13/rlm — Official implementation
– avbiswas/fast-rlm — Minimal implementation + tutorial
– AVB — «Recursive Language Models (RLMs)» video tutorial (YouTube, 2026)
– Prime Intellect — RLM benchmark analysis

Etiqueta: Inference Scaling

Recursive Language Models: The New Paradigm for Long Context in 2026