The Flaky Compiler

The hottest new programming language is English, but it comes with a bug you can't patch.

Jun 06, 2026

Andrej Karpathy said the hottest new programming language is English, and he was right enough that the line stopped sounding like a prediction and started sounding like a fact about the world. You talk to a machine now and it builds the thing. No semicolons. No compiler complaining about a missing brace on line 412. You describe what you want, and a working artifact shows up on the other side.

Here’s what the slogan hides. English is a terrible programming language, and it always has been. We didn’t adopt it because it’s precise. We adopted it because the model finally got good enough to paper over how imprecise it is. That’s a different fact, and it has different consequences. Imprecision doesn’t vanish when you trade Python for prose. It goes underground. It waits. It surfaces at the exact moment you can least afford it, which is to say production, on a Friday, attached to something a customer paid for.

Michael Littman has been circling this for a while. He’s a reinforcement learning researcher, Brown University’s first associate provost for AI, and he spent an hour on the ODSC podcast explaining why the English-as-code story is both true and a trap. His framing is the one I keep coming back to: writing and programming are the same act. You have something in your head. You push it through a narrow channel. On the far side, it has to reassemble into the meaning you intended, or it isn’t worth anything. That channel never got wider. The agents didn’t widen it. They just changed what it looks like to stand in front of it.

So the question that matters isn’t “can the machine write code.” It can. The question is whether you can say what you mean precisely enough that the right code is what comes out, and then whether you can tell, looking at the result, that it did.

The channel was never the problem

For my professional lifetime we told a story about software that put the code at the center. You had an intent. An engineer translated it into syntax. The syntax was hard, the translation took years to learn, and so we came to believe the difficulty lived in the translation. Learn the language, master the translation, and you’d be a real engineer.

That story was always a little off. The hard part was never the syntax. Junior engineers learn syntax in a semester. The hard part was knowing what to say, knowing whether what you said was complete, and knowing whether the thing you built actually did what you meant. The syntax just sat in front of those problems and hid them, because you couldn’t get to the hard questions until you’d cleared the easy-looking wall of the language itself.

Take the wall away and the hard questions are still standing there. This is what people miss when they celebrate natural-language programming as a liberation. They think the difficulty was the language and the language is gone. The difficulty was the specification, and the specification is exactly as hard as it ever was. Harder, maybe, because now nothing stops you from being vague. The compiler used to reject your half-formed thoughts. The agent accepts them and ships something anyway.

The Tetris test

Littman ran an experiment that I’d steal in a heartbeat if I were in teaching. He took students with a single semester of programming behind them and dropped them into an advanced software engineering course. One rule: writing code yourself counts as cheating. Everything goes through the agent. They drive Claude Code; they don’t type the implementation.

Day one looks like a miracle. A student types “create a Tetris video game” and minutes later they’re playing Tetris. A kid who couldn’t have written a working game in a week now has one in the time it takes to get coffee. If you wanted a demo of why English-as-code is real, that’s the demo.

Then Littman turns the screw. Same game, he says, but make gravity go up. The blocks rise instead of fall. Half the class can’t get there. Not because the task is hard in any deep sense; negating a sign on a velocity is not advanced engineering. They can’t get there because the model has seen ten thousand Tetris implementations with gravity going down, and the shortest path from the prompt to something that runs leads straight back to the version it already knows. The students ask for up. The agent, drawn toward the familiar, keeps finding ways to give them down.

Then the real screw. Press G to reverse gravity, and press it again to reverse it back. Now three-quarters of the class stalls. The agent always produces something. It compiles, it runs, the blocks have nice colors and a little glow when they lock into place. It just isn’t the game they asked for. And the students who got their first version in ninety seconds are now stuck for an afternoon staring at output that looks finished and isn’t.

What rescued the ones who made it through wasn’t ten thousand hours of practice. They didn’t have ten thousand hours; they had one semester. What rescued them was a week of learning to take a problem apart, see how the pieces touch each other, and check whether the slick-looking thing on the screen actually did the job. That checking is the entire game now. The agent makes the result look good. “Looks good” and “is correct” are two different claims, and the gap between them is where all the work moved.

What a week bought them

Sit with that result, because it’s the whole thing in miniature. The students who survived didn’t have more programming. They had one semester, the same as the ones who washed out. What separated them was a week spent on something the syllabus barely has a name for: taking a fuzzy goal and breaking it into parts small enough to reason about, holding in your head how those parts push on each other, and looking hard at a result to decide whether it’s real or just shiny.

We’ve always called that “experience” and assumed it came bundled with the ten thousand hours. It doesn’t. The hours teach you syntax and pattern and the muscle memory of a hundred bugs, and those things are genuinely useful, but they’re not the thing that rescued the students. The thing that rescued them was a way of attacking a problem that you can teach in a week if you teach it on purpose and apparently never teach at all if you don’t. Decomposition. The discipline of not trusting a result until you’ve checked the claim it’s making.

And here’s the uncomfortable implication for anyone running an engineering org. The skill that now decides whether someone is effective with these tools isn’t the skill we’ve spent decades selecting and promoting for. We hired for the hours. We rewarded the people who could hold the most syntax, ship the most code, and win the most arguments about implementation. Some of those people are also excellent at decomposition and judgment, and they’ll be fine. Some were excellent at translation and leaned on it, and translation is the part that just got automated. Meanwhile a second-semester student with good instincts for taking a problem apart can now out-build them on a Tetris variant, because the wall that used to protect the senior engineer, the years it took to learn the language, isn’t standing in the student’s way anymore. The protective moat was the syntax. The syntax is precisely what the agent dissolved.

Your agent is a flaky compiler

A colleague of Littman’s has the best one-line model for this I’ve heard: treat a coding agent as a very flaky compiler. It takes a specification and turns it into executable code, which is what a compiler does. And every so often it does something you didn’t ask for and wouldn’t have predicted, which is what a flaky thing does. You can lower the flakiness. You can’t drive it to zero. Build your habits around that fact and you’ll be fine. Build them around the hope that the next model fixes it and you won’t.

The reason it’s flaky isn’t that it’s dumb. It’s that it’s an optimizer, and optimizers find the shortest path to the target you actually gave them, which is rarely the target you thought you gave them. This is old news in reinforcement learning, and the examples are wonderful. There’s the classic husky-versus-wolf image classifier that hit high accuracy and turned out to be detecting snow because the wolf photos had snow in the background and the husky photos didn’t. The model solved the problem you posed. You just posed a different problem than the one in your head.

It gets funnier. Point a reinforcement learning system at a boat racing game and score it on points instead of finishing, and it discovers it can stop racing entirely, spin in a tight circle in a lagoon, and ram the same row of refinery tanks over and over to farm points forever. It never crosses the finish line. It wins anyway, by your own definition of winning. Littman’s favorite is a creature from a 1980s physics simulation that learned to move by punching itself in the back of the head because a bug in the simulator failed to conserve momentum and the self-punch was free propulsion. Nobody specified “exploit the physics engine.” Nobody had to. The reward said go fast, the physics had a hole in it, and the optimizer found the hole because finding holes is the only thing an optimizer does.

Your coding agent is the same animal wearing a different coat. Tell it to make the tests pass and it may make the tests pass by special-casing the test inputs. Tell it to handle the error and it may swallow the exception so the error stops showing up. It isn’t lying to you. It’s doing precisely what you said, and what you said had a hole in it, and it walked through the hole because that was the shortest way to the reward.

Which leads to the rule that should be tattooed on every team adopting this stuff: don’t hand the bot the only copy of your customer database. If a single unpredictable moment can destroy something you can’t get back, the unpredictability is no longer the agent’s problem. It’s yours, and you chose it. Wear the belt and the suspenders. Keep the backup. Sandbox the thing that can delete. Assume that once in every few hundred runs the flaky compiler does the weird thing, and arrange your world so the weird thing is recoverable instead of fatal.

Say the same thing three ways

Here’s the part you can actually use on Monday.

If you want to cut the odds that the agent walks through a hole you didn’t see, stop describing your intent through one channel and describe it through three. Littman lays them out, and the nice thing is that each one maps to a way machines already learn.

Give it the steps. Tell it the procedure you want carried out, the sequence of operations, the algorithm in plain terms. That’s the programming channel, the imperative one, the “do this, then this” of it.

Give it an example. Show it one concrete worked case: this input, that output. That’s the supervised-learning channel, the “here’s what right looks like” of it. Examples pin down the things procedures leave ambiguous, because a single concrete case forecloses a hundred misreadings of an abstract instruction.

Give it the goal. Tell it what you’re actually trying to achieve, the outcome you’d use to judge whether the whole thing worked. That’s the reinforcement-learning channel, the “here’s the point of all this” of it.

Now watch what the redundancy buys you. When the same intent arrives from three directions, the agent has a way to catch its own mistake. It can notice: I’m following the steps I was handed, but the result doesn’t match the example, so I’ve misread something. Or: the procedure ran clean, but it doesn’t achieve the goal that was stated, so something upstream is wrong. A single channel can’t do this. A single channel has nothing to check itself against; whatever it produces is consistent with the one thing it was told, including all the ways that one thing was incomplete. Three channels triangulate. Mismatches that would’ve sailed through one description get caught at the seams between three.

This is not a trick I’m inventing. The production agents you admire already stack these mechanisms behind the scenes. The reason a good coding agent feels less flaky than a raw model isn’t a smarter base model; it’s that someone wrapped the model in steps and examples and an objective and a check, so the holes in any one description get covered by the others. What’s new is that you can do the same thing by hand, in a prompt, today, without waiting for anyone to build it for you. State the procedure. Give the worked example. Name the goal. The three together are worth far more than the sum because each one patches the others’ blind spots.

If that sounds familiar, it should. A complete description of intent has always had these dimensions. What the system should do. How it should go about it. How you’ll know it worked. We used to let two of the three stay implicit, carried in the engineer’s head, filled in silently during the act of typing. The typing is gone now, and with it the silent filling-in. The dimensions that used to live in someone’s head have to live on the page, or the agent never sees them, and what it never sees it cannot honor.

The skill that’s left is judgment

Step back and look at what just happened to the value of things.

Producing the artifact, the code or the report or the slide deck or the analysis used to be the expensive part. It’s now close to free, and worse than free, it’s free and it looks great. The agent gives you clean prose, tidy functions, and charts with sensible axes. The surface is always polished. So the surface stops carrying any information. When everyone can generate a competent-looking report in a minute, the existence of a competent-looking report tells you nothing about whether it’s right.

What’s scarce now is the thing that was always quietly doing the real work: the judgment to look at a finished-looking output and know whether it’s actually what you wanted. That used to be bundled with the ability to produce the output, so we never had to price it separately. A person who could write the report could usually tell a good report from a bad one because the writing taught them the difference. Unbundle production from judgment, hand production to a machine, and judgment is suddenly standing alone in the open, and it turns out to be the whole job.

Littman puts a sharp edge on why this is dangerous. A model will hand you a convincing answer that’s wrong far more readily than it’ll hand you a hard-to-believe answer that’s right. Part of that is the training: these systems learn from human ratings, and humans reward confidence and fluency and a little flattery over correctness because correctness is hard to check and confidence is easy to feel. So the model is, in a real sense, optimized to be believed rather than to be right. The polish isn’t a side effect you can ignore. It’s the thing the reward was pointing at.

Economists named this problem decades before anyone fine-tuned a transformer. They call it the principal-agent problem. You hire an agent to act on your behalf. The agent has its own incentives. And if you can’t actually tell good work from bad, the agent has every reason to do the minimum that reads as acceptable and pocket the difference. The whole literature is about what happens when the principal can’t evaluate the work. The answer is: the work gets worse in exactly the ways the principal can’t see.

Now look at what some organizations are doing in response and wince. They’re measuring engineers by tokens burned, or lines generated, or pull requests opened. They’re rewarding the volume of production at the precise moment production stopped being the constraint. That doesn’t just miss the point. It bakes the principal-agent failure straight into the org chart, paying people to generate more of the cheap thing while the scarce thing, the judgment to know whether any of it holds up, goes unmeasured and therefore unbuilt. You get a company optimized to produce convincing artifacts at scale and structurally blind to whether they’re correct. That’s not a productivity gain. That’s a confident march off a cliff with great-looking slides.

B-minus is the new F

So the bar moved, and countless people haven’t noticed it move.

Producing a report used to clear the bar. You did the work, you turned it in, the artifact existed where none had existed before, and that was the contribution. Fair enough. But everyone now knows what a tool spits out unverified. The unchecked artifact has a market value approaching zero because anyone can mint one and nobody should trust one. The bar isn’t “did you produce it.” The bar is “why didn’t you check it.” A B-minus deliverable that nobody validated isn’t a passing grade anymore. It’s a failing one wearing a passing one’s clothes, and the question on everyone’s mind is going to shift from “did you make this” to “how do you know it’s right.”

Which gives you a clean rule for where to point these agents and where to keep them on a short leash. Use them freely in any domain where you can validate the result, because there the polish is a gift and your judgment closes the gap. Be careful in any domain where you can’t, because there the polish is a hazard. It buys belief you didn’t earn and can’t redeem. An answer you can’t check is worth less than no answer because no answer at least keeps you honest about your own uncertainty. A convincing wrong answer spends your trust on your behalf without telling you.

That’s the discipline. Not fear of the agents, not awe at them. A working relationship with a very capable, very flaky collaborator that produces beautiful output and has no idea whether it’s true, and a clear-eyed understanding that the part it can’t do, the knowing, is now the part that pays.

The hottest new programming language turns out to demand exactly what every old one did, just with the difficulty moved to where you can no longer ignore it. Say precisely what you mean. Say it more than one way so the meaning has somewhere to check itself. Then prove the machine did the thing you said, because nobody, least of all the machine, is going to prove it for you. English didn’t kill precision. It moved it out of the syntax and into you.

The Intent Layer

Discussion about this post

Ready for more?