When Code Looks Right but Isn’t

Why human review is poorly suited to catching agent misunderstandings

Jun 04, 2026

Agent-generated code looks right. It reads well. It often works on the first try.

But “looks right” is not sufficient for production software. The question is not whether the code is readable or plausible. The question is whether it can be trusted.

Built for human mistakes

Code review worked because humans wrote code, and humans could review it. A senior engineer could glance at a diff and spot the bug—the off-by-one error, the missing null check, the race condition. Code review was archaeology. You were reading the bones of someone’s thinking, and your job was to catch what they’d missed or misunderstood.

This works because human error follows patterns. We forget boundary conditions. We misread specifications. We over-index on common cases and forget about the edge. A trained reviewer learns to smell these mistakes. They become reflex.

Clean code can be confidently wrong

When an agent generates code, you can’t read it the way you read human code. Agent code is often cleaner than human code. It doesn’t have the accidental architecture, the shortcuts born from deadline pressure, or the documentation gaps. But clean code can be confidently wrong. An agent can misunderstand a spec in ways that produce perfectly formatted, well-structured, plausible code that does the wrong thing with absolute consistency.

The problem: human code review was optimized for catching human mistakes. It wasn’t optimized for catching confident misunderstandings. When a human reads an agent diff, they’re still looking for the mistakes a human would make. But agents don’t make those mistakes. They make different ones.

This is not a weakness of code review. It’s a signal that code review, as a mechanism, has to change.

The trust boundary moves upstream

Here’s what changes: the trust boundary moves from the diff itself to the boundaries around it.

The diff stops being the evidence. When humans wrote code, the diff was the artifact that mattered. You reviewed it, you approved it, and you deployed. The diff was proof that someone had made a deliberate choice. With agent-generated code, the diff is just output. It’s what came out the back of a process. The real evidence lives upstream: in the specification, in the test suite, in the type system, in the runtime constraints that box in what the code can actually do.

Think of it this way. When you review human code, you’re verifying that a person understood the problem and made the right trade-offs. When you review agent code, you’re verifying that the spec was clear enough, the tests were comprehensive enough, and the types were tight enough to force the agent toward correctness.

Tests become the primary review mechanism

Not a mechanism. The primary one. With human code, tests are one signal among many. A reviewer might catch bugs that tests miss. But with agent code, tests are the only unambiguous proof that the code does what you want. An agent can’t reason about your unstated needs. It only knows what the spec says and what the tests verify. If your tests don’t cover a case, the agent will have no signal to write code that handles it. If your tests pass on generated code, that’s not lucky—that’s the contract being fulfilled.

This means test suites need to be different. They need to be exhaustive in a way that matters more. You can’t rely on a human coming in later and saying, “Oh, we should handle this case.” The human review step is downstream. It’s looking at whether the generated code violates the test suite or spec, not whether the human would have written it differently. This is a relief and a constraint in equal measure.

Types become guardrails

A strongly typed system doesn’t just prevent bugs. It prevents categories of bugs. When an agent generates code, a tight type system becomes a boundary that forces correctness in ways that human reviewers never had to. If you’ve typed your functions precisely—if you’ve made illegal states unrepresentable—then the agent can’t generate code that violates those invariants. It can be wrong, but not in those specific ways. You’ve narrowed the surface area of what “wrong” can mean.

This is why languages with weaker type systems will struggle with agent-generated code. Python can be wonderfully flexible in human hands. An experienced Python engineer can hold complex invariants in their head and write code that violates type hints when there’s a good reason. An agent has no head to hold anything in. It sees “any” and generates code that might crash at runtime because nothing caught the mistake earlier.

Runtime constraints become part of the contract

Error handling. Rate limiting. Circuit breakers. Timeouts. These aren’t just defensive programming anymore. They’re part of the contract. When a human writes code that calls an external API, they might add error handling because they’ve been burned before. An agent writes error handling because the spec says the operation can fail. But the distinction doesn’t matter. What matters is that runtime constraints—things that actually execute and enforce boundaries—become part of what you’re verifying when you look at generated code.

You’re not asking, “Did the developer think to handle this?” You’re asking, “Does the system enforce this?” Because if it doesn’t, the agent won’t either.

Review the spec, not just the code

This is the inversion. The human reviewer’s job shifts upstream. Instead of reviewing the diff, you’re reviewing the spec. Is it complete? Does it cover the cases that matter? Are the invariants stated clearly? Have you told the agent what failure looks like? This is harder work in some ways and easier in others. You’re not reading code. You’re reading requirements. But you’re reading them with new intensity because now they’re the thing that actually shapes the output.

The diff is still reviewed. But the review is compressed. You’re looking for obvious violations of the specification, not reading the code as if a human wrote it. Does it call functions that don’t exist? Does it assume behavior that contradicts the spec? Is there a category of error it silently ignores? These are fast questions to answer. They’re high-signal gates.

Trust the boundaries you built

The unsettling truth is this: trust is no longer about trusting the coder. It’s about trusting the boundaries you’ve built. If you’ve specified clearly, if your tests are complete, if your types are tight, if your runtime constraints are enforced—then agent-generated code is as trustworthy as the system that produced it. And that system is entirely under your control.

What feels like a loss of human judgment is actually a transfer of judgment upstream. You can’t rely on a reviewer to catch mistakes during code review because code review is too late. The reviewer can’t see what wasn’t specified. They can’t know what the agent should have assumed. The decisions that matter happen before the code is written: in the spec, in the test design, and in the type definitions you choose.

The diff is still worth reading

This is harder in some ways. It requires discipline. It means you can’t ship incomplete specs and count on human reviewers to fill in the gaps. But it’s also more honest. You’re no longer pretending that code review is the thing that ensures correctness. You’re admitting that correctness comes from specification, from testing, from constraint. Code review becomes what it always should have been: a check that the output matches the input, not the entire quality gate.

The diff is still worth reading. But reading it doesn’t create trust. Trust comes from knowing the system that made it. And that system is built from specs, tests, types, and constraints, the things that agents can actually learn from and be held accountable to.

Aliaksei Zelianouski

Jun 4

Agree the boundary moved to tests and types. For me the only thing that keeps the result good is keeping all the responsibility on myself - that's what forces me to actually read the code, write some of it by hand, think through the use cases, read the docs. Full delegation to the AI only works when a bad result is acceptable. My home automation runs fine and I've never once looked at the code.

The Intent Layer

Discussion about this post

Ready for more?