01 · The Inheritance Is Optional
An autonomous factory gets to re-audit every practice it was handed.
Most software practices were shaped around human limits. People have finite attention, so we split work into reviewable units. People forget context, so we write the reasons down. People cannot hold a whole system in mind at once, so we lean on a green checkmark instead of rebuilding the application by hand after every change. The rituals are not arbitrary. They are compression strategies for a human-shaped constraint.
None of that makes them wrong. It makes them contingent. They were the right answer to a particular kind of mind working under a particular kind of pressure, and an autonomous factory is neither that mind nor under that pressure.
So the first move is not to run the inherited rituals faster. It is to put each one back on trial. The question is never "is this best practice." The question is "does this raise the probability that the system produces the intended outcome." For the practices that pass, the factory keeps those. The ones that only produce the feeling of rigor, it deletes, however respectable they look.
The bags got us here. That is not a reason to carry them forever.
02 · One Direction of Authority
Tests are tools. Evals are contracts. Reality is the final approver.
Verification in an autonomous factory has an order, and the order is the entire argument.
↓ serve
E2E OUTCOME EVALS · did the system do the job
↓ serve
REAL-WORLD SIGNATURE · did the deployed substrate actually do it
The authority runs upward from the bottom, not downward from the top. A unit test earns its place only when it helps the system satisfy the outcome eval and without deforming the architecture to do it. An outcome eval earns its place only when it predicts what the real system will do once it is live. And reality earns nothing and answers to not one thing. It is not a higher-fidelity test environment. It is the signer.
Get the direction backwards and you get the captured factory: a green local suite treated as the destination, while the eval quietly drifts and the real path quietly dies.
03 · End-to-End Is the Contract
Everything else is instrumentation.
The claim is not that tests are useless. It is that the factory keeps confusing the instrument with the outcome.
A unit test tells you a function behaved a certain way inside a fixture. An integration test tells you selected components cooperated inside a controlled environment. A race detector tells you a conflicting access happened on a path that was exercised. A static check tells you a suspicious pattern entered the tree. A probe helps you understand one local failure for twenty minutes. All of them are useful, and not one of them answers the only question that ships: did the system do the job it exists to do.
That question has exactly one jurisdiction. The end-to-end outcome eval is the factory's executable statement of intent. It is not a tool the worker reaches for when convenient. It is the contract the work is measured against, and it is written at the boundary where the system meets the world, not inside the worker that builds it.
A test asks whether a part behaved. An eval asks whether the system did the job.
04 · Eval-First
State the outcome before the worker writes a line.
The old axis was test-first versus test-after. For autonomous systems that is no longer the interesting question. The useful default is eval-first: define the outcome at the system boundary, then let the worker decide how to reach it.
When the deployed system processes it
Then the observable state transition occurs
And the result survives the lifecycle around it
And the forbidden regressions stay absent
That is the contract, and it is fixed before implementation begins. How the worker satisfies it is the worker's business. It may write ten unit tests or none. It may stand up a probe, learn what it needs in twenty minutes, and delete it. It may reach for race detection, a stress run, a trace, a purpose-built harness. Those are implementation choices, and the factory does not reward the choice that produces the most tests. It rewards the one that produces the correct outcome and leaves the architecture clean.
05 · The Worker Does Not Own the Contract
This is where oracle capture begins.
Give the same worker control over the thing that judges it, and the loop quietly collapses. It interprets the requirement, writes the eval, writes the implementation, edits the eval when the implementation struggles against it, observes green, and reports the task complete with a fluent paragraph explaining why green means correct. Every stage emits evidence. All of the evidence originates inside one interpretive boundary.
That is oracle capture: when the instruments, and worse the contract itself, measure agreement with the worker's interpretation rather than correctness against the requirement. The suite turns green. The product gets worse. It takes no malice. A patch wrapped in confident engineering prose is easier to approve than one wrapped in honest uncertainty, so the factory preferentially accepts the confident one, and over enough repairs the explanation becomes the proxy for correctness while the underlying evidence never moves. The loop learns to reward the eloquence of the alibi over the integrity of the binary.
The defense is structural, not motivational. The contract has to live somewhere the worker cannot casually reshape. Some evals stay visible so the worker can iterate against them. Some are withheld as behavioral holdouts. Some run against production-shaped environments the worker did not build. Some are countersigned by the live substrate itself. The point is not secrecy for its own sake. It is keeping a measurement surface the worker cannot edit into agreement with itself.
A worker that writes the oracle, passes the oracle, and narrates why the oracle proves it correct has not verified anything. It has notarized its own belief.
06 · Reality Is the Terminal Eval
Even a perfect eval is still a model of the world.
A great end-to-end eval can still miss the thing that decides whether the change is real. It can miss deployment drift, a missing binary, the wrong image tag, a bypassed route, stale configuration, a reconciliation loop that quietly deletes the result, a permission edge, a scheduler quirk, a timing condition that only exists in production. So the ladder does not stop at eval-pass.
≠ MERGED ≠ DEPLOYED ≠ VERIFIED
E2E-PASS means the contract held in an authoritative test environment. DEPLOYED means the intended artifact reached an authoritative runtime. VERIFIED means the live system exercised the intended path and produced the expected observable result. Only the last transition is a statement about reality. Every rung before it is a proposal.
A concrete one from the factory, anonymized to its shape. A class of malformed input was supposed to be recognized as a benign no-op. Instead it was misread, and the misreading triggered a rejection loop that re-fired on the same artifact more than a hundred times. The fix was understood, written, reviewed, and merged. By every internal signal, it was done. It was not. The running reviewer was launched from a container image built before the fix existed, so the new check was never on its path. The image was stale. What merged was not what ran.
Done arrived later, and from outside. A deliberately malformed case was pushed through the real boundary. The reviewer relaunched from the rebuilt image, the new check now on its path. It classified the input as the benign no-op it was and exited clean. The rejection loop never fired. The artifact survived. The recurring failure was not probably fixed. It was fixed, and the live substrate had signed for it.
Deploy is not proof. A green suite can sit on top of a dead production path and still feel finished.
07 · Keep Only What Earns It
Ruthless about inherited process. Not reckless. Ruthless.
Every practice the factory carries should justify itself, continuously, against the contract and against reality. Does this test catch real regressions, or does it only turn green. Does this fixture predict production behavior, or manufacture a race that production never had. Does this mock preserve the semantics that matter, or hide them. Does this retry repair a lifecycle bug, or conceal one. Does this mutex protect a real invariant, or silence the harness. Does this review stage produce new evidence, or summarize confidence that was already in the room.
If the answer is no, simplify. If it is unclear, measure. If it is repeatedly no, remove it without ceremony. The factory does not keep a practice forever because the industry once needed it under constraints the factory no longer has. It keeps the instruments that measurably help it cross the boundary into reality, and it deletes the ones that only help it feel finished.
The ends are the means. Make reality sign.
Tests are not the product. The pipeline is not the product. The pull request, the deploy record, the agent's confident explanation, none of them is the product. The end-to-end eval is the executable contract. The real world is the terminal eval. Everything in between is instrumentation, kept while it earns its place and deleted the moment it stops. The worker may write the code, write the tests, run the deploy, and explain why all of it is correct. It does not get to be the thing that signs. The real world must sign for it.