01 · Prove It
The most reasonable demand in this whole field, and the least answerable.
I posted that the agents I run advance their own work, review each other, and merge through a gate I don't sit in, and that the useful part isn't the code they write, it's that I stopped being the thing the pipeline waits on. A stranger replied: cut the shit, show a vid or some proof of it doing work.
Fair. It's the most reasonable thing to ask in a field this full of people claiming a sleeping robot ships their features. It's also close to unanswerable, and not because there's nothing to show. Because the thing being doubted does not fit in a frame.
I could post the commit log. Dozens of merged pull requests in a day, green checks, agent author names, timestamps at three in the morning. I could post the status the system writes about itself, the throughput counts, the repos, the overnight tally. And a real skeptic would be right to shrug at all of it, because a screenshot of a system reporting on itself proves exactly one thing: that I can produce a screenshot.
A still frame can't prove a moving thing. The doubt is about a process, and a process doesn't hold still for the camera.
So I'm not going to win the argument with a picture. What I can do is describe the machinery the picture leaves out, which is the part that actually matters and the part nobody building these demos wants to talk about, because it's the expensive, unglamorous eighty percent that doesn't photograph well.
02 · The Exhaust, Not the Engine
The commit log shows the result. It cannot show the two things you'd actually need to believe.
Here is a real day, as the system tallied it.
76
pull requests merged in 24 hours, across six repositories, every one cleared the same branch-protection gate: an independent approval I can't grant or skip.
30+
of them merged overnight and early morning, before I had started my day, with no input from me.
~15h
the supervision loop ran on its own before I opened a terminal. It was already most of the way through the day's work when I arrived.
Those numbers are real, and they are also the boring part. They are the exhaust the engine leaves behind. A commit log tells you that work landed. It cannot tell you the two things you'd actually need in order to believe the claim:
That the work is correct. Green checks mean the tests that exist passed. They don't mean the code does the right thing, and they certainly don't mean an agent didn't write a test that passes trivially to get itself unblocked. "Merged" is not "good." Anyone who has watched a model talk itself past its own quality bar knows the gap.
That I wasn't in the loop. The log shows author names, but names are cheap. The identity that does the work can be labeled anything. Whether a human was steering each merge is not visible in a list of merges. It lives in the process that produced them.
Both of those live in the machinery, not the artifact. So that's what's worth describing.
03 · An Approval I Can't Forge
No change lands without sign-off from an identity that isn't the author, and isn't me.
The first piece is structural, and it's the answer to "how do you trust work you didn't watch get made." Every change requires an approving review from an identity distinct from the one that wrote it. The agent authors the change. A separate review identity, a different principal entirely, approves it or sends it back. Only an approved, current branch with real checks is allowed to land. None of this runs on good manners. It runs on branch protection at the platform, which refuses to merge anything that hasn't cleared the independent review.
The part that matters is what branch protection refuses, including from me. It refuses a merge with no independent approval. It refuses an agent approving its own work. And with admin enforcement on, it refuses my own attempt to push something through the side door, because the operator is not exempt from the gate. I am not the review identity and I am not exempt from it. That's the only version of "merges through a gate I don't sit in" that means anything: the gate has to be one I can't sit in, not one I politely choose not to.
"A gate I don't sit in" is worth nothing if it's a gate I could sit in and chose not to. It has to be one I can't.
04 · Why It Can't Be a Script
A single session driving the chain dies before morning. Something has to outlive the session.
The popular version of this is a script that calls four agents in order, one after another, in a single run. Plan, then code, then test, then review, all inside one process babysitting the chain. It demos beautifully and it cannot do what's claimed, because that one process hits the model's per-turn limit long before a real feature is done. The orchestrator passes out before it finishes the surgery.
For work to actually run unattended for hours, across repositories, the thing advancing the chain has to outlive any single agent. The agent does its piece and exits. Something durable notices it finished, decides what comes next, and starts the next stage. Progression is owned by a layer that persists, not by a session holding its breath. That's the difference between "I kicked off a script before bed" and "the work advanced itself for fifteen hours." One is a long single breath. The other is a heartbeat.
05 · Why "Finished" Has to Mean "Trustworthy"
The whole game is what happens when a stage fails at 3am.
This is the part that separates a thing you can sleep through from a thing you can't. The naive pipeline has one move when a stage fails: stop and wait for the human. Which means "runs overnight" quietly becomes "stops the moment anything goes wrong and waits for you," and you wake up not to a finished feature but to a halted pipeline and a question. The difference between a pipeline you can sleep through and one you can't is whether failure routes instead of halts.
A concrete one, from the same day those numbers came from. Overnight, a chain failed when one of its own steps raced another. Nothing caught fire and nothing was lost. The supervision loop noticed the chain hadn't completed, an automatic triage worked out why, the runtime preserved the half-finished work on a clean salvage branch instead of discarding it, and because the failure classified as a genuine bug, it filed the underlying race as its own work item rather than papering over it. I read all of this in the morning report. The work was saved and the cause was queued before I was awake, waiting for me to carry it the last step.
What "finished" has to be able to do without me
retryRe-run a stage that failed on something transient, instead of stopping the line.
salvagePull real work out of a chain that broke and re-land it correctly, rather than discard it.
escalateFile the root cause as its own work item when the failure is a class, not a one-off.
refuseLeave a thing un-merged when it can't be made trustworthy, instead of forcing it green.
It is not airtight, and I won't pretend it is. A stage that burns its whole time budget before it has committed anything can still be killed with its work lost: no branch pushed, nothing to salvage. When that happens I find the gap in the morning, and the fix is to teach the system to route that failure the way it already routes the others. But the common failures route, and that is the difference between waking to a halted pipeline and waking to finished work.
That is the expensive eighty percent. Generating the code was never the hard part. A capable model writes a diff, scaffolds a service, writes the tests, and keeps going long after a person would have stopped. The hard part is the layer that makes "it finished" mean "it's correct," so that overnight produces work instead of trash. Nobody puts that in the thread, because it isn't a prompt you can copy and paste. It's a system you have to build and keep alive.
06 · The Honest Line
It's autonomous in the merge layer. It is not autonomous in the org.
Here is where I part company with the genre, because the genre's whole pitch depends on a lie I'm not going to tell. "Thirty merges overnight with no input from me" is true. "Zero input from me, ever" is not, and the gap between those two sentences is the entire honest version of this.
The autonomy is in a specific layer: authoring, review, progression, and merge. Inside that layer, for work already scoped, the system runs itself, including through failure, including while I sleep. That's real and it's most of the volume.
Outside that layer, I am very much still here. I write the doctrine the agents operate under, and I rewrite it when it's wrong. I decide what gets built and what doesn't. When something breaks in a way the supervision loop can't yet route around, I'm the one who fixes it, and then I make the system able to route around it next time. The work that's autonomous is the implementation tier. The judgment, the direction, and the firefighting are still mine. I didn't remove the human. I moved him upchain.
I didn't take myself out of the loop. I took myself out of the inner loop, and the inner loop is most of the work.
Anyone selling you the other version, the one where you go to sleep and wake up to a finished company, is selling the screenshot and skipping the machinery. The honest claim is narrower and far more useful: the part of the work that used to wait on me doesn't anymore, and that part is large enough to change what one person can hold.
You wanted proof in a picture. The only proof of an autonomous system is that it's still running tomorrow.
The commit log is the exhaust. The status report is the exhaust. Both are easy to produce and easy to doubt, and the doubt is fair, because neither one is the thing. The thing is the gate you can't slip past, the layer that outlives the session, and the failure that routed itself to a fix while I slept. None of that fits in a frame, which is why "show me a screenshot" is the wrong test. The right test is time. A demo runs once. A system is still merging tomorrow, and the day after, and was merging the night you weren't watching. That's the only receipt that means anything, and it's the one I can't hand you in a reply.
So here's the honest deal. I'm not going to win your skepticism with an artifact, and I'm not going to try. I'm going to keep building the thing that makes the artifact unnecessary: a view you can watch the system work in, live, instead of a picture you have to take my word on. Until then, stay skeptical. It's the correct posture. Just aim it at the right thing.