Production Agents

Summary of the article by Rohit (@rohit4verse), FullStack engineer specializing in Applied AI, published on X, April 29, 2026. Original: x.com/rohit4verse/status/2049548305408131349

Every day brings a new framework, a new benchmark, a new «10x» launch. The question stops being how do I keep up? and becomes what is actually signal here?

The author spent two years building in this space, cracked multiple offers north of $250k, and now runs technical at a stealth company. Here is what he sends to someone asking «what should I actually be paying attention to right now?»

The Filter

You need a filter, not a feed. Run every launch through five tests before it touches your stack:

Will this matter in two years? Wrappers around frontier models — probably not.
Has someone you respect built something real and written about it honestly? Postmortems count. Marketing posts do not.
Does adopting it require you to throw away your tracing, retries, or config? Frameworks-trying-to-be-platforms have a 90% mortality rate.
What does it cost you to skip this for six months? For most launches, the answer is nothing.
Can you measure whether it actually helps your agents? If you cannot, you are guessing.

«When something new launches, write down what you would need to see in six months to believe it matters. Then come back and check.»

What to Learn

Focus on concepts that survive model swaps and paradigm shifts. Understand them deeply and you can pick up any new tool in a weekend.

Context engineering. Context is state. Every token of irrelevant noise costs reasoning quality. By step eight of a ten-step task, the original goal can be buried under tool output. The teams that ship reliable agents actively summarize, compress, and prune — they think about the context window the way an experienced engineer thinks about RAM.
Tool design. Five to ten well-named tools beat twenty mediocre ones. Tool names should read like English verb phrases. Error messages should be feedback the model can act on.
The orchestrator-subagent pattern. Naive multi-agent systems fail catastrophically. The pattern that works: an orchestrator that delegates narrowly scoped read-only tasks to isolated subagents, then synthesizes their results. Default to single-agent. Reach for orchestrator-subagent only when you hit a real wall.
Evals and golden datasets. Every team that ships reliable agents has evals. Every team that does not, does not. This is the single highest-leverage habit in the field. An eval is a unit test that holds the agent honest while everything else changes underneath it. Build a labeled set on day one — fifty examples hand-labeled in an afternoon. There is no excuse.
Think-act-observe loop with file-system-as-state. The model is stateless. The harness has to be stateful. The harness is doing more work than the model in any production agent worth its compute bill.
MCP conceptually. Do not just learn how to call MCP servers — learn the model. A clean separation between agent capabilities, tools, and resources with an extensible auth and transport story underneath.
Sandboxing as a primitive. Process isolation, network egress controls, secret scoping, auth boundaries. Not a feature you add when a customer asks — primitive infrastructure.

What to Build With (April 2026)

Pick boringly here. These picks will shift, but slowly.

Category	Pick
Orchestration	LangGraph (production default). Mastra for TypeScript. Pydantic AI for type-safety fans.
Protocol	MCP, full stop. The registry has crossed the point where you can almost always find a server before you need to build one.
Memory	Mem0 (chat personalization). Zep (production conversational). Letta (multi-day coherence). Most teams will not need this — add only when you can articulate the failure mode it solves.
Observability	Langfuse (OSS default). LangSmith (if LangChain shop). Braintrust (research-style evals).
Sandbox	E2B (code execution). Browserbase (browser automation). Do not run unsandboxed code execution. Ever.
Models	Sonnet 4.6 is the cost-performance sweet spot. GPT-5.4/5.5 for CLI reasoning. Gemini 2.5/3 for long-context. DeepSeek-V3.2 or Qwen 3.6 when cost matters. Treat models as swappable. Re-evaluate quarterly, not weekly.

What to Skip

The cost of skipping is low. The time saved is large.

AutoGen / AG2 for production (stalled releases, abstractions do not match production needs)
CrewAI for new production builds (demos easily, engineers have moved off it)
Microsoft Semantic Kernel (unless locked into the Microsoft enterprise stack)
DSPy (niche — for optimizing prompt programs at scale)
Standalone code-writing agents as an architecture choice (interesting research, not a production-default pattern)
SWE-bench and OSWorld leaderboard chasing (nearly every public benchmark can be gamed)
Naive parallel multi-agent architectures (five agents chatting over shared memory falls apart in production)
Per-seat SaaS pricing for new agent products (market moved to outcome and usage based)
The next framework on Hacker News this week. Wait six months.

How to Actually Move

A sequence that is boring but works:

Pick one outcome that already matters. Some specific problem you are suffering from right now. This constrains every subsequent decision.
Set up tracing and evals before you ship anything. Fifty labeled examples is enough. The cost of building this later is roughly 10x the cost of building it now.
Start with a single-agent loop. LangGraph or Pydantic AI. Three to seven well-designed tools. The file system or a database as state.
Treat the agent as a product, not a project. Every prompt change, every model swap, every tool change goes through evals before deployment.
Add scope only when you have earned it. Let failure modes pull subagents, memory frameworks, and browser use in — do not pre-architect.
Watch your unit economics from day one. A $0.50/run PoC becomes $50K/month at moderate volume. Teams that do not see it coming get a CFO meeting they do not enjoy.

The Actual Point

The conventional path — pick a stack, master it for years, climb a ladder — worked when the stack was stable for a decade. The stack now changes every quarter. The people winning stopped optimizing for stack mastery and started optimizing for taste, primitives, and ship velocity.

«You do not need to learn everything. You need to learn the things that compound and skip the things that do not. Build things. Put them on the internet. The era rewards people who make the thing more than people who can describe the thing. There has never been a better window to be the one making.»

Source: Original article by Rohit (@rohit4verse) on X, April 29, 2026.

Etiqueta: Production Agents

What to Learn, Build, and Skip in AI Agents (2026)