The Ops Budget Nobody Approved

When the cost of generating code drops to nearly zero, you don't just write code faster; you end up with ten times more of it, turning software from an asset into a liability.

Jun 15, 2026

The efficiencies of agentic code generation are real, but the true costs are poorly understood. An agent writes a service in four minutes that would have cost a senior engineer two days, and the line item drops from two days of salary to a few dollars of tokens. That money didn’t disappear. It moved downstream, into the build farm, the test fleet, the release pipeline, and the pager rotation, and none of those line items sits next to the four-minute win where anyone would think to add it back. You booked the discount at the keyboard. The bill comes due everywhere else.

This is the part of the agentic-engineering pitch that nobody puts on the slide. The productivity is genuine and the savings on code generation are genuine, and both are measured at exactly the wrong point in the system. You booked the discount at the keyboard. The bill arrives everywhere else.

The cheapest thing got cheaper, so we made more of it

Jevons noticed it with coal in 1865: make a resource cheaper to use and you don’t spend less on it, you use far more of it. Tokens are coal now. When generating code drops to near-free, you don’t end up with the same amount of code written faster. You end up with ten times the code, and Jeff Atwood’s old line stops being a quip and starts being a budget: software is a liability, not an asset. Every line is something to compile, test, review, deploy, secure, and eventually debug at three in the morning.

So start with the part everyone forgets to multiply. Five engineers each driving a swarm of agents don’t write five times the code. They write ten or twenty times more, and “write code” is only the first step in a long pipeline that now has to swallow all of it: build, test, review, deploy, operate. Walk that pipeline one stage at a time and apply a single test at each: take the number, multiply it by ten, and ask whether the stage holds. Most of them don’t, even in shops with the best tooling in the industry, and the ones that buckle first are the ones nobody was watching.

Testing is where the bill lands first

Compilation isn’t free, in wall-clock or in compute, and most engineers have never once looked at how much of their test infrastructure eats. They are about to find out. Here is the number that should scare you: at scale, the dependency graph grows quadratically with the codebase, not linearly. Ten times more code does not mean ten times more tests to run. It means a hundred times, maybe a thousand, because every change has to prove it didn’t break everything downstream of it, and the set of things downstream exploded.

That turns a background cost into a line item. A test suite that used to be an afterthought becomes one of the largest compute consumers in the company, and it has to run on every one of those ten-times-more commits, because the agents themselves love running tests. Tests are how an agent knows it did good work, so it runs them constantly, on its own initiative, burning your fleet to check its own homework.

And the old shipping rule quietly breaks. Today you ship when every test passes: a long conjunction of Booleans, all of them green. That works while the suite is small enough that the infrastructure running it is reliable. Run a hundred thousand tests on hardware that flakes even a little, and the odds that all all hundred thousand come back green at once approach zero. You can’t ship anything. So the rule that kept you safe, everything must be true, becomes a rule you have to abandon for something statistical, and now you are guessing which tests are the ones that matter. That is a different and much scarier way to live.

Deploys, and the safety valve nobody priced

Push the same multiplier into release and a load-bearing assumption gives way. Ask why rollbacks work at all today, and the honest answer is timing: you release software slightly slower than it takes you to detect a problem in production. That gap is the whole safety valve. A bad change lands, the graphs go red, you roll back to the last good state before too much piled on top.

Now speed up releases so they’re happening faster than detection. The bad change lands, and before your monitoring even notices, three more changes have shipped on top of it, each one entangled with the last. The rollback you reach for has to contend with all of them. The valve that made fast deployment survivable was never the deploy speed. It was the margin between shipping and noticing, and cheap code generation is the thing eating that margin.

Locally correct, globally wrong

The nastiest costs aren’t capacity at all. They’re the second-order ones, the bugs that compile clean and pass review and surface six weeks later in production. Agents are good at writing code that is easy to write. They are not reliably good at code that is easy to maintain, and they don’t carry the long-horizon judgment that keeps a system coherent. Each change can be locally correct, defensible on its own, reviewed and merged, and the whole still drifts toward globally wrong.

Distributed systems are where this gets expensive, and it helps to see why. A single program either works or it doesn’t, and you can usually reproduce the failure on your laptop. A network of services fails differently. Each service can be correct on its own and the system can still break in the seams between them: a call that times out and gets retried twice, a message that arrives out of order, two services that each hold a defensible view of the same record and disagree the moment they compare notes under load. Now multiply the parts. Ten times more services means ten times more of those seams, more retries colliding with more timeouts, more partial failures where half the system has the new state and half has the old. These are the bugs that don’t show up in any one service’s tests, because no one service is wrong. They appear only when the pieces run together at scale, which makes them the hardest failures in the field to reproduce and the most expensive to chase.

The same multiplier reaches the doors between services. An internal API that you left unhardened because “only our own code calls it” is now called by agents, and an agent doesn’t read your README or honor your conventions. It finds the endpoint, calls it, and reaches whatever data the endpoint will hand back. Every internal interface is effectively public the moment an agent can reach it, and the hardening you deferred because it felt safe inside the perimeter is suddenly load-bearing.

The team that never made it into the spreadsheet

Trace the chain and notice who is standing at the end of it. Bigger build fleets, exploding test compute, riskier deploys, a brittle million-line system where one bad config flag takes everything down, distributed bugs that take a war room to find. The people who absorb all of that are the operations and SRE teams, and their cost rarely gets entered against the agentic productivity win. The savings land on the engineering ledger. The new load lands on a team in a different org with a different budget, and the two numbers are never put on the same page.

There is a darker version, too. If a token engine becomes load-bearing, weird things happen. Picture a rollback path that depends on an agent having budget, and someone burns that budget dry an hour before the incident. The cheapest resource in the system, the thing you stopped metering because it felt free, is now the thing standing between you and recovery.

None of this argues for slowing down. The productivity is real and it is not going back in the box. The argument is narrower and harder to dodge: the savings and the costs live in different places, and a discount you book in one node while the bill comes due in five others is not a discount. It’s a transfer. The teams getting credit for the speed and the teams paying for it should at least be looking at the same ledger. Right now they aren’t, and the gap between the four-minute service and its true cost is exactly the size of an ops budget nobody approved.

The Intent Layer

Discussion about this post

Ready for more?