Self-Improving AI Agent Hierarchies: A Living Experiment

An evolving multi-agent system that writes, audits, and supervises itself — generating 100 unique Tetris games without human intervention.

Published: April 24, 2026
Author: Don Berto Rascazzione
Tags: AI Agents, Multi-Agent Systems, Reinforcement Learning, Autonomous Systems, Experiment

The Experiment

What happens when you chain AI agents in a strict hierarchy, where each one supervises the one below it, and each one can modify its subordinate’s instructions?

I built a three-tier autonomous system that generates 100 unique Tetris game variants, deploys them to a public gallery, and progressively evolves its own capabilities. The system has been running since April 24, 2026, producing one new game every 15 minutes, each one more sophisticated than the last.

The public gallery lives at xof.es/tetris/

This is a living experiment. The system is still running.

Architecture

The system consists of three cron jobs chained in a strict hierarchy. Each agent only knows its direct subordinate. Lower agents cannot see higher agents. Communication flows one way: top-down modification, bottom-up reporting.

┌─────────────────────────────────────────────────┐
│ Supervisor (every 3 hours)                      │
│ Can modify: Auditor cron prompt                 │
│ Can read: Template evolution, variant count     │
│ Cannot see: Generator cron                      │
└──────────┬──────────────────────────────────────┘
           │ modifies
           ▼
┌─────────────────────────────────────────────────┐
│ Auditor (every 1 hour)                          │
│ Can modify: Generator cron prompt, template     │
│ Can read: All deployed variants, template       │
│ Cannot see: Supervisor cron                     │
└──────────┬──────────────────────────────────────┘
           │ modifies
           ▼
┌─────────────────────────────────────────────────┐
│ Generator (every 15 minutes)                    │
│ Produces: One Tetris variant per cycle          │
│ Uses: Template + theme colors                   │
│ Cannot see: Other crons                         │
└─────────────────────────────────────────────────┘

The Generator

Runs every 15 minutes. Its job is mechanical:

1. Pick two random words from a 70-word dictionary (NEON, VAPOR, CYBER, RAVE, etc.)
2. Read a proven HTML5 Tetris template
3. Replace CSS color placeholders with theme-specific values using sed
4. Upload the variant to the server
5. Update the gallery index page

The template is the key insight. The JavaScript game engine — collision detection, rotation, scoring, audio — is a single proven file that works. Each variant only changes visual styling. This avoids the bugs that occur when LLMs generate game logic from scratch: broken collision detection, ghost piece failures, rotation bugs.

The generator uses template-based substitution, not LLM-generated code. The engine is 550 lines of tested JavaScript. Each variant is ~22KB. The sed replacement takes milliseconds.

The Auditor

Runs every 60 minutes. Its job is twofold:

1. Quality audit: Check all deployed variants for bugs (missing buttons, broken placeholders, wrong file sizes, HTTP errors)
2. Evolution: Modify the template to add new features based on how many variants have been generated

The evolution schedule is hardcoded in the Auditor’s prompt:

Variants	Features Added
1–10	Base game with unique color themes
11–20	Screen shake on line clears, improved particles
21–30	Combo counter display, T-spin visual feedback
31–40	Background music (Web Audio API synthesized)
41–50	Ghost piece pulse animation, piece entry effects
51–60	Wall kick indicators, combo screen flash
61–70	Beat-synced background, score celebrations
71–80	Advanced statistics, animated title screen
81–90	Power-ups (bomb, clear row, freeze)
91–100	Mini-games, boss battles, unlockable themes

The Auditor can directly modify the Generator’s cron prompt, so it can change the Generator’s behavior mid-flight. If variants are too similar, it can force more diverse themes. If the template has bugs, it fixes them.

The Auditor has access to the cronjob tool. This is the key architectural decision: the Auditor can rewrite its subordinate’s instructions.

The Supervisor

Runs every 180 minutes (3 hours). Its job is meta-supervision:

1. Health check: Is the Auditor running on schedule? Has it failed?
2. Stagnation detection: Is the template actually growing? Are new features being added?
3. Forced evolution: If the Auditor is lazy (not modifying the template), the Supervisor rewrites the Auditor’s prompt to make it more aggressive

The Supervisor only knows the Auditor. It cannot see the Generator. If the system is broken, the Supervisor pushes the Auditor to fix it. If the Auditor is lazy, the Supervisor rewrites its prompt.

This creates a feedback loop: the Supervisor forces the Auditor to evolve the template, which forces the Generator to produce more sophisticated variants.

The Template Insight

The most important technical decision was separating the proven engine from the mutable style layer.

Bad approach (what the first generation did):

LLM → generates 1,200 lines of HTML5 from scratch → bugs everywhere

Good approach (template-based):

Proven template (550 lines) → sed replaces 20 color placeholders → 22KB variant, zero engine bugs

The template contains:
– A complete Tetris game engine (SRS rotation, wall kicks, 7-bag randomizer, ghost piece, hold piece, scoring, game modes)
– Web Audio API synthesized sounds (no external audio files)
– Canvas-based rendering with particle effects
– Mobile touch controls
– CSS color placeholders (%PRIMARY%, %COLOR_I%, %BG_GRADIENT%, etc.)

The Generator never touches the JavaScript. It only replaces CSS values. The Auditor evolves the JavaScript — adding screen shake, new particle systems, background music — but only after the base engine is proven stable.

This is essentially a reinforcement learning loop: the environment (Auditor) evaluates the output (Generator variants), then modifies the policy (template) to improve future output.

Why This Works

Isolation prevents chaos

Each agent only knows its direct subordinate. The Supervisor cannot skip the Auditor and modify the Generator directly. This prevents conflicting instructions and creates a clean chain of accountability.

If the Supervisor wanted to change the Generator, it must go through the Auditor. This mirrors biological evolution: mutations propagate through generations, not telepathically.

Templates prevent regression

By keeping the game engine in a single file, the Auditor can add features without breaking core mechanics. The Generator never has to reason about collision detection or rotation logic. It just applies colors.

This is a common pattern in production systems: separate stable infrastructure from mutable configuration. The template is the infrastructure. The theme colors are the configuration.

Schedule differential creates batch learning

The Generator runs 4x per Auditor cycle (60 min vs 15 min). The Auditor runs 3x per Supervisor cycle (180 min vs 60 min).

This matters because the Auditor evaluates a batch of 4 variants, not a single one. It can detect patterns: «These four variants are too similar» or «The particle system broke in variants 7–10.» Batch evaluation is more informative than single-sample evaluation.

The Supervisor evaluates the Auditor’s work across 3 cycles, giving it a long-term view: «The template hasn’t grown in 90 minutes» or «Feature additions stopped after variant 25.»

Telegram delivery provides observability

All three agents deliver reports to the same Telegram chat. This creates a shared timeline:

18:45 — Generator: Variant #2 deployed (VAPOR WAVE), 22KB, PASS
19:00 — Generator: Variant #3 deployed (NEON PUNK), 22KB, PASS
19:15 — Generator: Variant #4 deployed (CYBER DISCO), 22KB, PASS
19:30 — Generator: Variant #5 deployed (RETRO BLAZE), 22KB, PASS
19:42 — Auditor: 5 variants checked, all PASS. Added screen shake to template. Updated Generator prompt with new particle density limits.
22:42 — Supervisor: Auditor active, template grew +2KB (4 features added). Evolution: GOOD.

Every change is auditable. If something breaks, you can see exactly which agent made which change and when.

Theoretical Grounding

This architecture borrows from several research areas:

Hierarchical Reinforcement Learning (HRL): Sutton, Precup, and Singh (1999) introduced the concept of temporally abstract actions (options) in reinforcement learning, where higher-level policies select sub-policies that execute for extended periods. Our hierarchy mirrors this: the Supervisor selects a strategy (how to evolve), the Auditor executes the strategy (which features to add), and the Generator performs the low-level action (produce a variant).

Source: Sutton, R. S., Precup, D., & Singh, S. (1999). «Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning.» Artificial Intelligence 112(1-2): 181-211.

Multi-Agent Systems (MAS): Shoham and Leyton-Brown (2009) define multi-agent systems as collections of autonomous agents that can coordinate, compete, or cooperate. Our system uses a directed acyclic graph (DAG) of influence: each agent influences exactly one other agent. This is a special case of hierarchical MAS where information flow is strictly unidirectional.

Source: Shoham, Y., & Leyton-Brown, K. (2009). «Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations.» Cambridge University Press.

Program Synthesis: The template-based approach is a form of program synthesis where the template defines the program structure and the Generator fills in parameters. This avoids the combinatorial explosion of generating programs from scratch. Similar approaches are used in code generation for web development, where templates separate structure from style.

Source: Solar-Lezama, A. (2009). «Sketching.» Principles of Program Synthesis. In: Proceedings of the ACM SIGPLAN Workshop on Program Transformation.

Self-Improving Systems: The concept of machines that modify their own programs dates back to Turing (1948) and Ashby’s Homeostat (1952). Modern implementations include DeepMind’s AlphaGo Zero (2017), which improved through self-play without human data, and OpenAI’s Dota 2 agent (2019), which learned through hierarchical multi-agent coordination.

Sources:
– Turing, A. M. (1948). «Intelligent Machinery.» National Physical Laboratory report.
– Ashby, W. R. (1952). «Homeostatically Controlled Models.» Journal of the London Edinburg and Dublin Philosophical and Scientific Society.
– Silver, D., et al. (2017). «Mastering the game of Go without human knowledge.» Nature 550: 354-359.
– Berner, C., et al. (2019). «Proceedings of the 2nd Google AI Competition.» arXiv:1912.06680.

What I Learned

1. Templates beat generation

The first variant generated by the LLM from scratch had broken collision detection. The ghost piece calculation used a malformed ternary operator. The game loop spun infinitely when paused. These are the kinds of bugs that take hours to debug in hand-written code.

Switching to a template eliminated all engine bugs. The Generator never touches collision logic. It applies colors. The Auditor evolves features, but only after the engine is proven.

2. Schedule differential is critical

Running the Auditor every 60 minutes (not every 15 minutes) means it evaluates 4 variants per cycle. This batch size is enough to detect patterns but small enough to act quickly. Running the Supervisor every 3 hours gives it a strategic view without micromanaging.

3. Isolation is a feature, not a limitation

Each agent only knowing its subordinate might seem restrictive. But it prevents the common multi-agent problem of conflicting instructions. If the Supervisor could directly modify the Generator, it might contradict the Auditor’s changes. The chain of command ensures consistency.

4. Telegram delivery is the debugging interface

Having all three agents report to the same chat creates a unified timeline. When something breaks, you can see exactly what changed, when, and by whom. This is more informative than log files because it’s conversational and chronological.

5. Evolution requires explicit targets

The Auditor needs explicit feature targets («variants 11-20 add screen shake») or it tends to do nothing. Open-ended instructions like «make it better» result in stagnation. Specific targets force progress.

The Gallery

The public gallery at xof.es/tetris/ is a retro 90s disco-themed page with:

– Animated gradient background (hot pink, electric blue, lime green, purple)
– Floating disco ball with rotation animation
– Geometric shapes floating around the page
– CRT scanline overlay effect
– Game cards with neon glow hover effects
– Three Google Fonts: Press Start 2P, Monoton, Bungee Shade
– Blinking and pulsing animations throughout
– Responsive grid layout

Each variant card shows the variant number, theme name, preview colors, and a PLAY button. The gallery auto-updates as new variants are deployed.

Is This Really Reinforcement Learning?

Technically, no. There’s no reward function, no policy gradient, no value network. The «reinforcement» comes from the Auditor evaluating output and modifying the template — which is analogous to policy improvement. The «learning» comes from the template accumulating features over time — which is analogous to updating a value function.

A more accurate description is: hierarchical program synthesis with supervised evolution. The Supervisor supervises, the Auditor synthesizes features, the Generator executes.

But the RL analogy is useful because it captures the core insight: agents that evaluate their own output and modify their own policy create a feedback loop that produces improvement over time.

What’s Next

The system is still running. Variant #1 is deployed. The next 99 will be generated automatically, each one more sophisticated than the last.

I’ll update this post as the experiment progresses. Key milestones to watch:

– Variant 10: Base variants complete. Auditor should start adding screen shake.
– Variant 25: Particle effects and combo displays should be present.
– Variant 50: Ghost piece animations and background music should be active.
– Variant 75: Power-ups and beat-synced backgrounds.
– Variant 100: Ultimate features. System completes its lifecycle.

If the system breaks, I’ll document the failure mode and the fix. That’s the point of a living experiment.

Source

The skill that implements this architecture is available in my Hermes Agent setup. The template-based approach, hierarchical cron design, and evolution schedule are documented in the rl-agent-hierarchy skill.

This experiment was built using:
– Hermes Agent (Nous Research) — The agent framework running the crons
– Qwen3.6-27B via local inference— The model powering all three agents
– CDMON shared hosting — The server hosting the gallery
– Telegram Bot API — The delivery and observability channel

This is a living document. The system is still running. Check back for updates.

Self-Improving AI Agent Hierarchies: A Living Experiment

The Experiment

Architecture

The Generator

The Auditor

The Supervisor

The Template Insight

Why This Works

Isolation prevents chaos

Templates prevent regression

Schedule differential creates batch learning

Telegram delivery provides observability

Theoretical Grounding

What I Learned

1. Templates beat generation

2. Schedule differential is critical

3. Isolation is a feature, not a limitation

4. Telegram delivery is the debugging interface

5. Evolution requires explicit targets

The Gallery

Is This Really Reinforcement Learning?

What’s Next

Source

Más entradas

What to Learn, Build, and Skip in AI Agents (2026)

Sycophantic AI Chatbots Can Cause Delusional Spiraling — Even in Perfectly Rational Users

Innovation Technologies

Self-Improving AI Agent Hierarchies: A Living Experiment