Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson
Using contracts instead of specs helps make AI agents keep their promises.
Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson — AI Engineer · 1:15:40
The frontier reliability problem isn’t agents that run for five seconds — it’s agents that run for hours. This Anthropic workshop tackles what changes when tasks span dozens of steps, multiple context windows, and real-world waiting time: interruptions, recovery, checkpointing, and scope drift. None of these show up in demos. All of them show up in production.
This video is part of a hand-curated collection — each video is picked one at a time, not pulled from a recommendation feed. The talks here are the ones worth a full watch if you’re trying to get serious about agentic engineering. Each written summary was drafted by an AI pass over the transcript and then edited by a human. The source video is linked at the top of every post, so if something reads off, go to the tape.
Why Builders Can’t Grade Themselves
Andrew and Ash (Anthropic, Applied AI) opened AI Engineer with a deceptively simple claim: the agents you see one-shotting browsers in demos aren’t doing the hard work. The hard work starts when an agent has to run for five, six, thirty hours — and the question isn’t whether the model is smart enough. It’s whether the scaffolding around the model can keep it honest that long. Over an hour they walked the room through Anthropic’s year of releases, the scaffold patterns that survived each model jump, and the one habit you can’t skip: reading the traces by hand.
Models and scaffolds co-evolve
Andrew’s history lesson made a point most agent posts miss. From Sonnet 3.7 (one hour of useful work in a minimal scaffold) to Opus 4.6 (twelve hours, same minimal scaffold), the model did the obvious work — but every release also shipped new scaffold primitives: artifacts, computer use, MCP, sub-agents, skills, checkpoints, server-side compaction, agent teams. The scaffold doesn’t shrink as models improve. It evolves. Each release closes one gap in the model and exposes a new gap further out, and the scaffold reorganizes around the new frontier. The Ralph loop — Geoffrey Huntley’s “run the same prompt until it’s done” pattern — went from clever hack to a shipped Claude Code plugin to mostly unnecessary in about nine months, because Sonnet 4.5 started managing its own context window.
The inversion: critics are cheap, self-critics aren’t
Here’s the part worth rereading. Everyone’s first instinct with a long-running agent is to tell it to check its own work. Ash threw that out. Pretend the agent is a generator. Spin up a second agent, a discriminator — different system prompt, different context window, harsh rubric, Playwright access — and let the two argue. The trick isn’t that the evaluator is somehow smarter. It’s that tuning a standalone critic to be picky is tractable in a way that tuning a builder to be self-critical isn’t. Same as humans. You can critique a painting in five minutes; you can’t paint one. Most teams are still wiring up self-evaluation loops and wondering why the model keeps marking its own half-built features as done. The fix is structural, not prompt-engineering: separate the contexts, give the critic real tools, and let it grade.
Contracts beat specs
The other clever move is the negotiation step. Before the generator writes a line of code, the generator and the evaluator argue — on disk, file by file — about what “done” means for the current sprint. The generator proposes a feature and a test plan. The evaluator pushes back: scope’s too big, tests are weak, you missed an edge case. They iterate until both agree. Then the generator builds, and the evaluator grades against the contract those two wrote — not against the planner’s original one-shot spec. This is the bit the Ralph loop never had. A plan.md sitting in a folder doesn’t push back. A peer with a different system prompt does. Ash’s retro game-maker demo went from “looks done, arrow keys do nothing” in a solo loop to a working sprite editor with live physics and an AI-assist sub-app — same model, same prompt, just the contract loop wrapped around it.
Taste is gradable if you write it down
Most people assume design quality can’t be evaluated because it’s subjective. Ash’s response: that’s a cop-out. Write your opinion down. Anthropic uses a four-criterion rubric — design, originality, craft, functionality — weighted toward the first two when the model is already strong at the last. Calibrate with a few-shot of reference sites and the evaluator’s taste converges on yours. You don’t get there with vague rubrics; the working version of the retro game-maker had 27 contract criteria for one app. Vague criteria produce vague critiques, and the generator shrugs and moves on.
Read the traces. There is no shortcut
Both presenters kept returning to this. The primary debugging loop isn’t running more experiments. It’s pulling agent transcripts into files and reading them line by line, finding where the model’s judgment diverged from yours, then editing the prompt for that. A second agent can help triage. You still have to read. Ash described the Claude-for-Chrome team literally closing their eyes and trying to click through web pages by feel, to build the empathy needed to write the right system prompt. Empathy for the model isn’t a soft skill — it’s the muscle that tells you which part of the scaffold to delete next, and when.
The frontier doesn’t shrink. It moves. Build the scaffold for where the model is weak today, then strip it the week the model fills that gap.

