Local Agents are Toys

Most agent platforms are wrappers on someone else's loop; the company selling the actual runtime hasn't shown up.

Jun 05, 2026

I’ve started noticing the cut. Every agent demo ends a beat early: right after the model returns something impressive, right before anyone asks who’s logged in. You’re meant to clap at the output and skip the questions. What user is this running as? Whose API key paid for the call? What happens to the run when the laptop sleeps? The demo is a magic trick that depends on you not looking at the hands.

A local agent is the script on someone’s laptop that calls a model, calls a tool, calls the model again. The API key lives in an environment variable. Memory is a list in scope. The whole thing runs as one user, in one process, until you close the terminal. Most agent demos circulating right now are this shape: a clean loop, a clever prompt, a result in under a minute. The video gets captioned “GPT just rewrote my codebase” and ships.

That agent is a toy. Not because it’s badly built; most local agents are fine for what they are. The problem is what they are. Single-player. One identity. One session. Lose the process and you lose the run.

A production agent does none of that. It serves many users at once, so memory and environment variables become per-user state. It requires auth, because not every caller should reach every tool. And it has to run durably: long-horizon agents take minutes to hours, and a crashed loop has to resume from its last checkpoint, not from zero.

Three changes. They sound like ops concerns. They aren’t. Each one rewrites how the agent gets built, and each one shapes what’s worth buying or betting on in the agent stack.

State stops being a variable

In a script, “memory” is a list you keep in scope. The model writes to it; you pass it back on the next call; nobody else exists. The minute a second user shows up, that design breaks. Memory has to be keyed to a user, persisted somewhere durable, loaded on every turn, and locked when two requests for the same user arrive at the same time.

Environment variables follow the same arc. Your local OPENAI_API_KEY was a global. In production, the API key, the database connection string, the per-user OAuth token, and the feature-flag overrides: all of it has to be resolved at request time, scoped to the right user, and isolated so one user’s secrets never end up in another user’s prompt.

The shape of the bug changes too. A leaky scope on your laptop is a debugging annoyance. A leaky scope in production is a data breach. The agent isn’t a function anymore. It’s a multi-tenant service that happens to call a model.

If you’re evaluating an agent product (buying it, betting your roadmap on it, sizing the company building it), this is the first question to ask. Not “how good is the model?” but “how does it keep two users’ state apart?” The answer is either “we built that” or “you build that.” There’s no third option, and the second one is much more expensive than it sounds.

Auth becomes part of the loop

The local demo gives the agent every tool you’ve got. In production that’s a non-starter. A support agent talking to a logged-in customer can hit billing endpoints scoped to that customer’s account. The same agent talking to an anonymous visitor cannot. Auth used to be a gate at the front door: the request gets in or it doesn’t. With agents, it has to be enforced inside the loop, on every tool call, with the calling user’s credentials passed through.

This pulls a second design problem along with it. The model can’t be trusted with the credentials. If an LLM can read its own API keys, prompt injection turns into credential exfiltration the first time a hostile document lands in the context window. So the credentials sit outside the model. An auth proxy injects them into outbound traffic at call time. The agent knows it called getBillingHistory; it never sees the token that authorized the call.

That’s not paranoia. It’s the only way the system survives a user pasting in a malicious email.

For a buyer, this is where most agent vendors quietly cheat. The demo runs as an admin user with full access to everything. The pilot runs the same way. The first time a customer asks “can this agent see only their own data?” the team spends a quarter rebuilding the tool layer. The capability you’re paying for isn’t agentic reasoning. It’s an auth model that survives multi-tenant traffic. Vendors that don’t have one are selling you a feature they haven’t built yet.

Durability changes what “running” means

A local agent runs until it returns. If your machine crashes, you start over. That’s fine when a turn takes two seconds. It’s catastrophic when a turn takes two hours, and turns will take two hours, because the useful agents are the ones doing real work: triaging a backlog, reconciling accounts, drafting a report against fifty source documents.

At that horizon, the agent has to be a durable process. Every step writes its result to a checkpoint store. Every restart resumes from the last checkpoint, not from the top. The control flow is no longer a function call; it’s a state machine that survives the process it was running in.

This is where the analogy to web servers stops being useful. A web request is short, stateless, and idempotent on retry. An agent run is long, stateful, and expensive to redo. The infrastructure shifts accordingly. You stop thinking in request/response and start thinking in workflow engines, event sourcing, and resumable computation. The runtime is closer to a job system than a chatbot.

If you’re picking an agent platform, this is the layer that decides whether it can do anything beyond a chat turn. A platform that can run a four-hour reconciliation job and pick it back up after a deploy is a different product from one that can stream a clever response in three seconds. Both call themselves agent runtimes. Only one of them clears the bar for actual work. If you can’t tell which one you’re buying, you’re buying the wrong one.

What this means for you

Most teams pick their stack while they’re still in toy mode. They benchmark frameworks on how fast the local loop runs, how clean the prompt API looks, and how nice the tracing dashboard is. Buyers do the same thing in reverse: they evaluate vendors on demo polish and per-token price. None of it predicts what happens when the demo gets handed to a hundred users.

The questions worth asking are the boring ones. How does this framework key state to a user? How does it pass per-user credentials through to tool calls without the model seeing them? What happens to an in-flight agent run when the process dies? If the answers are “you build it yourself,” you’re not buying a framework. You’re buying a chain abstraction with a billing surface.

The real platform underneath is a durable multi-tenant runtime with auth in every tool call. The model layer alone won’t give you that. The framework layer mostly won’t either. Whoever ships that runtime (and ships it before the rest of the market notices they need it) is selling the actual product. Everyone else is selling a wrapper around someone else’s loop.

Local agents get to assume one user, one identity, one continuous process. Production agents can’t assume any of those. The gap between the two isn’t a polish pass. It’s a different system. The teams that confuse “we have a working demo” with “we have an agent product” are about to learn that the hard way, in front of paying customers.

The Intent Layer

Discussion about this post

Ready for more?