AI agents and bad productivity metrics

Here’s a little bit of snark from developer John Crickett on X:

Software engineers: Context switching kills productivity. Also software engineers: I’m now managing 19 AI agents and doing 1,800 commits a day.

Crickett’s quip lands perfectly because it is not actually a joke. It’s a preview of the next management fad, wherein we replace one bad productivity proxy (lines of code) with an even worse one (agent output), then act surprised when quality collapses.

And yes, I know, nobody is doing 1,800 meaningful commits. But that’s the point. The metric is already being gamed, and agents make gaming effortless. If your organization starts celebrating “commit velocity” in the agent era, you are not measuring productivity. You are measuring how quickly your team can manufacture liability.

The great promise of generative artificial intelligence was that it would finally clear our backlogs. Coding agents would churn out boilerplate at superhuman speeds, and teams would finally ship exactly what the business wants. The reality, as we settle into 2026, is far more uncomfortable. Artificial intelligence is not going to save developer productivity because writing code was never the bottleneck in software engineering. The true bottleneck is validation. Integration. Deep system understanding. Generating code without a rigorous validation framework is not engineering. It is simply mass-producing technical debt.

So what do we change?

Thinking correctly about code

First, as I argued recently, we need to stop thinking about code as an asset in isolation. Every single line of code is surface area that must be secured, observed, maintained, and stitched into everything around it. As such, making code cheaper to write doesn’t reduce the total amount of work but instead increases it because you end up manufacturing more liability per hour.

For years, we treated developers like highly paid Jira ticket translators. The assumption was that you could take a well-defined requirement, convert it to syntax, and ship it. Crickett rightfully points out that if this is all you are doing, then you are absolutely replaceable. A machine can do basic translation, and a machine is perfectly happy to do it all day without complaining.

What a machine cannot do, however, is understand critical business context. AI cannot feel the financial cost of a compliance mistake or look at a customer workflow and instinctively recognize that the underlying requirement is fundamentally wrong. For this we need people, and we need people to thoughtfully consider exactly what they want AI to do.

Crickett frames this transition as a necessary move toward spec-driven development. He’s right, but we need to be incredibly clear about what a specification means in the agent era. It’s not one more Jira ticket but, rather, a set of constraints tight enough to ensure an LLM can’t escape them. In other words, it’s an executable definition of done, backed entirely by tests, API contracts, and strict production signals. This is the exact type of foundational work we have underinvested in for decades because it doesn’t look like raw output; it looks like process. You know, that “boring stuff” that slows you down.

You can see the friction playing out in real time just by looking at the comments to Crickett’s tweet. You’ll find people desperately trying to square the circle of agentic development. One commenter tries to reframe the chaos by calling it architecture versus engineering. Another insists that managing 19 agents is actually orchestrating, not context switching. A third bluntly states that running more than five agents simultaneously starts to look like vibe coding, which is merely a polite phrase for gambling with production systems. They are all highlighting the core issue: You haven’t eliminated the work. You’ve just moved it from implementation to supervision and review.

The more you parallelize your code generation, the more “review debt” you create.

Observability to the rescue

This is where Charity Majors, the co-founder and CTO of Honeycomb, becomes frustrated. Majors has argued for years that you can’t really know if code works until you run it in production, under real load, with real users, and real failure modes. When you use AI agents, the burden of development shifts entirely from writing to validating. Humans are notoriously bad at validating code merely by reading large pull requests. We validate systems by observing their behavior in the wild.

Now take that idea one step further into the agent era. For decades, one of the most common debugging techniques was entirely social. A production alert goes off. You look at the version control history, find the person who wrote the code, ask them what they were trying to accomplish, and reconstruct the architectural intent. But what happens to that workflow when no one actually wrote the code? What happens when a human merely skimmed a 3,000-line agent-generated pull request, hit merge, and moved on to the next ticket? When an incident happens, where is the deep knowledge that used to live inside the author?

This is precisely why rich observability is not a nice-to-have feature in the agent era. It’s the only viable substitute for the missing human. In the agent era, we need instrumentation that captures intent and business outcomes, not just generic logs that say something happened. We need distributed traces and high-cardinality events rich enough that we can answer exactly what changed, what it affected, and why it failed. Otherwise, we’re attempting to operate a black box built by another black box.

Majors also offers essential operational advice: Deploy freezes are a complete hack. The common human instinct when change feels risky is to stop deploying. But if you keep merging agent-generated code while not deploying it, you’re simply batching risk, not reducing it. When you finally execute a deploy, you’ll have absolutely no idea which specific AI hallucination just took down your payment gateway. So if you want to freeze anything, freeze merges. Better yet, make the merge and the deploy feel like one singular atomic action. The faster that loop runs, the less variance you have, and the easier it is to pinpoint exactly what broke.

Golden paths are the way

The fix for this impending chaos is not to rely on heroic engineers. As Majors points out, resilient engineering requires a commitment to platform engineering and golden paths (something I’ve also argued). Such golden paths make right behavior incredibly easy and the wrong behavior incredibly hard. The most productive teams of the next decade will not be the ones with the most freedom to use whatever framework an agent suggests, but instead those that operate safely inside the best constraints.

So how do you measure success in the agentic era?

The metrics that matter are still the boring ones because they measure actual business outcomes. The DORA metrics remain the best sanity check we have because they tie delivery speed directly to system stability. They measure deployment frequency, lead time for changes, change failure rate, and time to restore service. None of those metrics cares about the number of commits your agents produced today. They only care about whether your system can absorb change without breaking.

So, yes, use coding agents. Use them aggressively! But don’t confuse code generation with productivity. Productivity is what happens after code generation, when code is constrained, validated, observed, deployed, rolled back, and understood. That’s the key to enterprise safety and developer productivity.

An ode to craftsmanship in software development

The Greatest AI Show On Earth

Judge Dismisses Elon Musk’s XAI Trade Secret Lawsuit Against OpenAI

Global Resources Outlook 2024 | UNEP

The D Brief: DHS shutdown likely; US troops leave al-Tanf; CNO’s plea to industry; Crowded robot-boat market; And a bit more.

German Chancellor Merz faces difficult mission to Israel – DW – 12/06/2025