The Agents That Work Are the Boring Ones
The industry is fixated on model capability, yet the true bottleneck to automation is the unglamorous engineering that surrounds the model.
The conversation about AI agents has lately been fixated on the model. Which one’s smarter, whose context window is longer, and who shaved another cent off a million tokens? It’s the wrong fixation. The agents quietly earning their keep inside real businesses aren’t running better models than the ones that failed; they’re running the same models, wrapped in three disciplines that have nothing to do with intelligence and everything to do with restraint.
Most teams skip the restraint and go straight for ambition. They build a “support agent,” a single program meant to answer anything a customer might ask, and then they’re surprised when it confidently invents a refund policy that doesn’t exist. The failure looks like a model problem. It isn’t. A smarter model would invent the policy more fluently. What’s missing is the work around the model, and that work follows a pattern.
Scope to a job small enough to grade
Ambition is the first thing to cut. The teams that win don’t build an agent that “handles customer support.” They build one agent that resets passwords, a different one that tracks a late shipment, and a third that processes a return. Each owns a single workflow with a finish line you can see.
The reason this works isn’t tidiness. It’s that a narrow job comes with a success metric attached. “Did the customer get back into their account?” has an answer. “Was the customer well supported?” doesn’t, not in any form you can measure on Tuesday and improve by Friday. When a logistics startup swapped its catch-all chat agent for a set of single-purpose ones, the resolution rate on order-status questions climbed sharply because the agent finally had one question to get right instead of a thousand it might get wrong.
Here’s the part teams resist: a narrow agent looks less impressive in the demo and performs better in production. The broad agent dazzles a boardroom and collapses under a real queue. You’re not choosing between an ambitious agent and a modest one. You’re choosing between an agent that works and one that demos.
Put a gate between the decision and the consequence
The second discipline is a wall. Anywhere an agent can do something irreversible (spend money, send a message a customer will read, change a record other systems trust) there has to be a check between the agent’s decision and the world feeling it.
Picture a billing agent authorized to issue refunds. The naive version refunds whatever it decides to refund. The disciplined version refunds on its own only when its confidence clears a threshold and routes everything below that line to a person. This isn’t a hedge against weak models. It’s the same structure any company already trusts with its people. A junior clerk can approve a small credit on their own authority; a large one goes to a manager. Nobody calls that a failure of the clerk. They call it a control.
Agents need the same controls, written in code instead of org charts. A workable rule: no action that touches the outside world ships without either a programmatic check or a structured output the system can validate before acting. The agent can be as creative as it likes inside the gate. Past the gate, it has to show its work in a form a machine can verify.
Teams that skip this learn the cost in public. An agent talked into selling a product for a dollar, a chatbot that promises a discount the company never offered: these aren’t really hallucinations. They’re missing walls. The model did what models do. The system let it reach the world unchecked.
Treat the agent as alive, not finished
The third discipline is the one almost everyone gets wrong because it contradicts how we think about software. We ship software and move on. The build ends, the artifact is done, and maintenance is a tax we resent. An agent doesn’t behave like that, and treating it like a finished artifact is how a working agent slowly stops working.
Consider what happens around a competent agent over six months. The product ships new features. The pricing changes. A policy gets rewritten. The agent, frozen at the moment it launched, keeps answering with last quarter’s truth and answering it confidently. Nobody touched it, so nobody expects it to be wrong. That’s what makes the failure dangerous: it stays invisible until a customer hits it.
The teams that keep agents healthy run them like a service that learns. Every real failure in production becomes a test case. You collect the questions the agent botched, turn them into an eval harness, and run that harness against every change to the prompt or the model. Before you push a new instruction, you already know whether it fixes the thing you wanted without quietly breaking nine things you’d forgotten about. It’s test-driven development pointed at behavior instead of functions. An insurer running this loop on its claims-intake agent caught a regression that would have rejected a whole category of valid claims before a single customer saw it.
This is the discipline that compounds. Scope and gates make an agent safe on day one. The eval loop is what keeps it safe on day two hundred, while the product underneath it keeps moving.
What this asks of you
None of this is exotic. Scope the work small enough to grade. Gate the actions that reach the world. Keep the thing alive with tests fed by its own failures. The reason most agents don’t do these things isn’t that the techniques are hard; it’s that they’re unglamorous, and the shine of a smart model is easy to mistake for progress.
The companies pulling real value out of agents worked out something the demos hide: the intelligence was never the scarce part. The discipline was. Build the boring agent, and you’ll have the one that’s still running next year.


