Field Notes · AI Infrastructure

The Supervisor
Tier

Most agent systems bubble every failure up to a human supervisor at the top of the stack. The operator becomes the loop. Erlang solved this in 1986. EVE Online re-derived it for 1000-pilot battles in 2008. The agent layer is the third domain to need the pattern. The novelty is the port.

It works at ten agents. It collapses at a hundred.

A specific failure mode shows up the moment you run an agent orchestration system past about a hundred concurrent tasks. The human at the top of the stack stops thinking strategically and starts firefighting. Every red CI run, every stalled task, every silent infrastructure failure bubbles up. The operator becomes the supervision loop by default, because no other layer exists to catch what falls through.

This is the pattern in essentially every public agent framework. LangChain, CrewAI, AutoGen, the entire family. Orchestration is flat: an agent is dispatched, it runs, failures propagate upward. There is no specialist supervisor watching for stalled tasks. There is no watcher that catches an agent burning tokens without making progress. There is no autonomic loop that cancels a task when it hangs. The operator at the top of the stack does all of this manually, in real time, by reading log output and reacting.

The bottleneck is structural, not a function of operator skill. The structure makes the operator the supervision loop. At ten agents the structure works because the operator can keep up. At a hundred the operator stops keeping up. At a thousand the structure has no chance.

The most expensive component in your system is reading log output and clicking restart buttons. That is the bottleneck talking.

Erlang/OTP and the supervision tree.

Erlang and the OTP framework solved this for the telecom industry in 1986. The pattern is called a supervision tree, and it has been carrying production telecom switches for forty years.

The architecture is: every process has a supervisor. Supervisors do almost nothing themselves. They watch their children, restart them when they crash, and escalate to their own supervisor only when the restart strategy is exhausted. The system is shaped as a tree. Worker processes sit at the leaves, doing the actual work. A small number of supervisors sit above them, doing only supervision. The root of the tree is the only true single point of failure, and even root supervisors can be replicated when uptime requirements justify it.

"Let it crash" is the central philosophy. Workers are expected to fail. Supervisors are expected to restart them. The system tolerates worker death because death is a contained, automatic, structurally-handled event rather than a cascade that bubbles to the top. Defensive programming, try/catch everything, error-recovery sprinkled through worker logic - all of that is the wrong shape in an OTP system. Workers should be simple, do one thing, and crash hard when something goes wrong. Recovery is the supervisor's job, not the worker's.

The tooling is concrete: restart strategies per supervisor (one_for_one restarts only the failed child, rest_for_one restarts the failed child and every child started after it, one_for_all restarts every child together when any one fails). Restart intensity limits (max N restarts in T seconds) that prevent crash loops from running infinite recovery. Process linking and monitoring that propagates failure signals through the tree. None of this is new. It is forty years old. It runs the kind of telecom infrastructure that achieves nine nines of uptime.

EVE Online fleet coordination at scale.

EVE Online has been running 1000+ pilot battles since 2008. The coordination structure that scales to that pilot count is, independently, a supervision tree.

A 1000-pilot fleet is led by an FC (fleet commander) who personally communicates with about ten sub-FCs. Each sub-FC handles a 100-pilot wing. Each wing has squad anchors handling 10-pilot squads. No tier talks to more than about ten direct reports. Coordination scales because no individual operator is overloaded with more conversation partners than they can track.

Information flows through specialists. Logi (healers) watch their assigned pilots' health directly via shared overlay; the FC never receives "this pilot is taking damage" reports because the logi already saw it and started repping. Scouts watch the boundaries of the grid; the FC sees only "hostile fleet jumping in" calls, not raw position data. Primary-callers announce one target at a time so DPS converges. EWAR specialists track who is being jammed or warp-scrambled. Most coordination flows laterally within a tier, through shared overlay state and broadcast comms, not vertically up to the FC.

The FC of a thousand-pilot fleet is not the supervision loop. The specialists are.

The system survives node loss because recovery is local. If a pilot dies, their logi was already repping the next pilot in the assignment list. If a logi dies, another logi covers via the shared assignment overlay. The fleet does not stop because any one node failed. Each failure is handled at the smallest scope that can handle it; only failures that genuinely require fleet-level coordination ever surface to the FC.

When the server itself is overloaded, EVE invokes TiDi (time dilation): time slows for everyone equally rather than the system failing. Degradation is graceful. Nothing crashes. The fleet keeps fighting, just slower. The system's worst-case behavior is "slow," not "down."

Same pattern as OTP. Different domain. Completely independent re-derivation, twenty-two years later, by people who almost certainly never read the OTP literature.

This isn't analogy, and it isn't coincidence. Both OTP and EVE landed on the same coordination shape because the underlying problem (supervising many failure-prone workers while keeping the whole system alive) has the same answer regardless of substrate. The shape shows up faintly in human organizations too: military command structures, corporate hierarchies, and sports teams all use span-of-control discipline (no tier talking to more than ten reports) and tiered specialization. The discipline itself isn't new.

What separates OTP and EVE from human hierarchies is the mechanism. Workers crash and supervisors restart them automatically, with no negotiation, no review meetings, no exit interviews. Human organizations don't have automatic recovery; you don't restart a failed employee, you replace them slowly over months. The recovery mechanism is what makes the pattern portable to software, and it's why an agent fleet borrows from telecom and games rather than from organizational design.

Tiering the supervisors for agent fleets.

The pattern works for agent fleets the same way it works for telecom processes and spaceship pilots. The structure has three tiers:

Tier 1 · Autonomic (code only, no LLM, no token cost)

One supervisor per failure domain.

A stall canceller watches task progress and cancels work that hasn't ticked in too long. A review-tier watchdog watches the review service's heartbeat and respawns it if it stops responding. A resource-pool monitor watches for credential or worker exhaustion. A deploy-health monitor polls infrastructure deployments. Each one is a small in-process loop on a timer. Each watches a single, well-defined failure domain. None of them call an LLM. None of them cost tokens. Most failure modes the operator currently handles by hand are this shape: a well-defined rule that fires when a specific threshold is crossed.

Tier 2 · Judgment (selective agent dispatch on rare events)

LLM only when judgment is genuinely needed.

A failure-triage supervisor subscribes to terminal task events. For known failure patterns it takes coded action (regex match on the failure reason, predefined response). For unknown patterns it dispatches a small judgment-tier agent to read the task output and decide whether to retry, escalate, or surrender. The judgment-tier agent runs only when the rule "what should we do?" cannot be expressed in code. Most loops never need one. The token cost is bounded because the dispatch is rare and the scope is narrow.

Tier 3 · Command (human, only on true escalation)

Strategic intent and the rare event the tree cannot handle.

The operator handles values, constraints, vision, doctrine rewrites. They do not handle routine failure recovery. They do not read log output. They do not click restart buttons. By the time anything reaches the command tier, every autonomic and judgment-tier mechanism has tried and failed, and the failure represents something genuinely novel that the tree as designed cannot resolve. The command tier is for what only humans can decide, not for what humans happen to be doing because nothing else can.

The tier split mirrors the doctrine bifurcation directly. Autonomic supervisors are compiled doctrine: deterministic rules the runtime enforces in code, with no LLM involvement and no token cost. Judgment supervisors are interpretive doctrine: judgment-shaped reactions encoded in a small agent that runs only when the rule cannot be expressed in code. The compiled/interpretive split that defines doctrine engineering is the same split that defines supervision. The substrate's architecture is the bifurcation, applied recursively from doctrine down to operations.

The vocabulary doesn't exist yet.

The supervision-tree pattern is well-trodden in two completely independent twentieth-century domains. It is not in the agent literature. Anyone reading LangChain or CrewAI or AutoGen tutorials in 2026 will find no equivalent vocabulary. No Supervisor type. No restart policy. No event-bus-as-fleet-comms abstraction. No autonomic vs judgment tier split. No "let it crash" philosophy applied to agent failure handling. The pattern is absent, not because it has been considered and rejected, but because the agent community has not needed it yet.

The reason is straightforward. The 10-agent demos that dominate the public agent literature don't need this pattern. At ten agents, the operator can be the loop, and any framework that doesn't make the operator the loop is overbuilt for the use case. The supervision tree only becomes necessary at the scale where the operator stops being able to firefight in real time. Almost nobody has reached that scale yet, so almost nobody has reached the patterns that solve it.

The Erlang community already knew. The EVE community already knew. The agent layer has been failing to find these patterns for two years because the practitioner communities don't overlap. Nobody who has run a 1000-pilot EVE fleet has also been building agent infrastructure. Nobody building agent infrastructure has been reading Erlang books from the 1990s. The patterns existed in plain sight in both source domains, but the kind of person currently working on agent orchestration tools has no reason to encounter either of them.

The patterns are in plain sight. Just not in the literature the agent layer reads.

What changes when you take the supervision tier seriously.

Adopting the supervision-tree pattern in an agent system changes the shape of the work in concrete ways.

Workers become simpler. Agents stop carrying defensive code for failure modes their supervisor handles. The agent's job becomes "do the thing, return success or crash." Recovery is not the agent's problem. Retry policy is not the agent's prompt. Token cost drops because agents are no longer asked to reason about edge cases that a supervisor handles in code for free.

Failure becomes audit data, not panic. When a task stalls, the autonomic supervisor cancels it and emits an event. The event is recorded. The system continues. No human reads anything in real time. If a pattern of stalls emerges across many tasks, the judgment-tier supervisor surfaces it for triage. The first time a human sees the failure class is when judgment is genuinely required.

The operator changes job. The operator stops being the firefighter and becomes the doctrine author. The supervisor logic itself is something the operator writes (or, more often, dispatches an agent to write under doctrine review). The operator's leverage compounds because every supervisor they ship eliminates one class of failure they no longer have to handle by hand. The job becomes authoring the supervision policy, not running it.

Specialists scale by sharding, not by horizontally scaling one big watcher. When task density grows past what one supervision tree can handle, the answer is to shard by failure domain (tasks per repository, tasks per credential pool, tasks per region) with per-shard sub-trees and cross-shard supervisors above them. Same pattern that EVE uses to split a 1000-pilot fleet into 100-pilot wings. Same pattern OTP uses for distributed Erlang clusters. Same answer, different substrate.

The novelty isn't the pattern. It's the port.

Erlang/OTP solved fleet-scale process supervision in 1986. EVE Online re-derived the same coordination shape for 1000-pilot battles in 2008. Twenty-two years between rediscoveries, in two completely disjoint communities with no overlap in practitioners, both arriving at supervision trees with tiered restart policies, specialist supervisors per failure domain, lateral comms via shared state, and graceful degradation under load. The agent layer in 2026 is the third domain to need the pattern, and it is currently failing to find it because the practitioner communities still don't overlap.

The pattern is forty years old. The substrate is new. The work is the port.