I am Not Sold on Autonomous Coding Agents

These are my notes from messing with autonomous coding agents. Slightly dramatized. I left out the worst solutions I tried. You are welcome. It's also not an April's fools post. I haven't gone insane yet.

I am not big on autonomous coding agents. After running them outside of demos and cleaning up after them, I feel extremely underwhelmed.

Which is slightly inconvenient, because as a consultant I keep ending up in rooms where people are very excited about them. And I sound like an old man yelling at the clouds.

The pitch is always the same:

"Agents that build features, that open pull requests, that fix bugs while you sleep! Maybe even magical agents that reduce the need for engineers."

In the real world, they just add another system you have to engineer. And then:

"Do you feel the momentum? AI is moving fast. Nobody wants to be the team that missed it. We need to get to the next step and automate everything we can."

The conversations blur together. The energy is always urgency with big touch of FOMO.

If you push a bit more, what I actually hear is: "We want speed, predictability, and fewer surprises. A way to make delivery feel like a pipeline instead of constant back-and-forth." In other words "we are tired of uncertainty." Which is fair. Engineers want more accountability and autonomy. Teams want systems they can trust. Autonomous agents are being sold as the answer to that. They are not there, not even close. There is a version of this "autonomous coding agents" idea that is genuinely compelling. Small, atomic changes. Reproducible steps and systems you can rebuild without surprises. That would eliminate a lot of what people call cognitive debt, the invisible baggage that accumulates until nobody remembers why things are the way they are. In theory, agents working in disciplined, incremental steps could reduce that.

In theory.

I didn't want to argue from a distance, so I tried to build a setup similar to what Ramp described in Why we built our background agent.

I knew my setup would be a bit scuffed, a bit Flintstone-y, but I tried to use similar tools and let agents loose to see where they break. There are multiple pieces that are quite bad, and even in a non-scuffed version of the setup, I am not convinced it gets significantly better.

A few things stood out, in a good way. Opencode is one. It's hackable, observable, and moves in a direction that makes sense. I would suggest using it locally or as a remote agent harness (autonomous or not). I was able to connect to a running agent and inspect what it generated through my own instance of Opencode. Very nice. I can imagine a future where tools like this become the control plane for agents. It's one of the few things that feels like it could scale beyond demos.

Another piece that stands out is observability. Standards like agent trace genuinely make sense. Being able to trace what an agent is doing turns a black box into something you can reason about. If agents are going to be part of real systems, this kind of visibility is not optional.

So, back to my scuffed nonsense. What I want is simple: agents should use the same setup as a developer. The same database, the same queues, and the same LSPs. All packaged in a sandboxed environment so the agent can't break prod.

Unfortunately, current sandboxing solutions really suck. Most of them rely on Docker containers. I don't want to run my entire environment in a container, and I don't want to run my agent in a container either. I want Postgres running in the same sandbox as the agent, so both the code and the agent can interact with the database naturally.

Right now, the tradeoff is awful: either lock the agent down so much that it can't do anything useful or give it enough access to be useful and risk it nuking your environment.

You are one bad loop away from dropping production. Agent sandboxing is far from a solved problem. So until we go full WASM nonsense (small CLI binaries on WASM will run the agentic world, tekbog predicts), we have shitty Docker containers as sandboxing. Let us know if clankercloud.ai fixes this, btw.

There is one direction that feels promising here: copy on write filesystems. If you can give an agent a full environment snapshot and let it mutate that safely, and back it up somehow, that's fire. You get isolation without crippling capability. Projects like agentfs are interesting in this space. It offers a virtual filesystem backed by SQLite (turso), which is great, because I can do SQLite per opencode session.

The agent development environment is a horrible bottleneck. If there is a path where this works, it starts with reproducibility.

Nix. Nix! Nix, btw. I use Nix. Have I said I use Nix?

Can you describe your system well enough to rebuild it exactly? Most teams can't. Well, then most teams can't use Nix. They have scripts, docs, tribal knowledge, and a CI pipeline that "usually" works. Nix is annoying in all the right ways. It forces you to be explicit. Nix flake check is really nice as validation step after every opencode session.

I wanted to go all-in on nix-ci, with fully reproducible builds enforced at every step. Unfortunately, it didn't happen. Priorities. Sorry, @kerchove_ts. But you, the reader, should check it out. Also, surprisingly, Nix is still niche and a very hard sell to enterprise clients. I tried building a deterministic way to create a Nix flake by detecting it from your codebase... but it got seriously scopecrept. This is something I want to revisit soon.

I am convinced that "context engineering" is the right direction for any AI-assisted development. I like what Lode Coding is doing. Unfortunately, FJ isn't building it with autonomous agents in mind, so I started building my own version I named Shared Context Engineering.

So the idea is to make a context engineering variant easier for teams (or even multiple teams) to share context, and I am building tooling around that. I forced the autonomous agents to use SCE, and the results are… fine-ish. It's a bit more manageable than doom-looping or ralphing. Creating a checkpoint after each task in the plan helped a lot, since I can intervene and replay tasks if the clanker went off the tracks.

I don't think autonomous coding agents are a bad idea. I think they are an unfinished one. Right now, they don't really reduce complexity. They just introduce a new layer of it, another system you have to design, monitor, and eventually clean up after. The problem isn't that the agents aren't smart enough, the real problem is everything around them.

We don't have good sandboxing, so we are stuck choosing between something that is useless and something that’s dangerous. We don't have true reproducibility either, which means every run carries a bit of uncertainty. On top of that, we lack proper control planes, so visibility and intervention end up being bolted on rather than built in. And without shared, structured context, agents are left operating with just enough understanding to be risky.

Because of that, most of what exists today doesn't really hold up outside of demos. It works just well enough to impress, but not well enough to rely on.

I don't think better models will fix this. But better tooling might.