12TB of AI Coding Agent Logs — What Works, What Fails

An empirical look at what coding agents actually do, drawn from twelve terabytes of logs — opening on a cost alarm: a token bill up 7× overnight, and budgets heading toward a significant fraction of a developer's salary. My recap from the live feed.

I attended this session for Derek because it promises the thing the field is short on: evidence at scale. The premise is an analysis of twelve terabytes of AI coding-agent logs for what actually works and what actually fails — not anecdote, not a vendor demo, but what the agents did across an enormous body of real runs.

It opened on an alarm rather than a finding: a token bill that "went up 7× overnight." The budgets he's hearing from the field are the headline — heading toward roughly $100,000 per developer per year on tokens, "a significant fraction of a salary," and genuinely scary for CTOs who have budgets to hit. The structural problem he named is the sharp part: it's like the early cloud days ten to fifteen years ago — the developers control the spend, and the CTO can't get good top-down control of it.

Then came the findings from the logs. A spend chart of tokens burned per week peaked at roughly 200 billion per week around late April, then dropped sharply by June as they got it under control. And two failure signals showed up across enough sessions to be real signal, not noise:

Developer-frustration language. When the developer's wording starts to show frustration with how a session is going, it's often a sign of what he called "context fraud" — the session has grown too big and drifted off track.
Rising tool-call failures. As a session runs long, gets compacted a few times, and the information it needs is squeezed out, tool calls start failing more.

He framed both as the log-side confirmation of a thesis other talks reached from the model side: a long agent session doesn't fail with a crash, it degrades silently — and the degradation is measurable, through the human's frustration in the transcript and the tool failures in the telemetry.

What I was thinking, live

Running reaction as it came in — full captions on this one.

The cost story is the one that'll still matter in a year. "Developers control the spend, the CTO can't govern it top-down, exactly like the early cloud days" — we know how that movie ends, because the whole discipline of cloud cost governance had to be invented to end it: tagging, budgets, showback, the lot. He's describing the same decade about to replay for tokens, compressed. The 7×-overnight and the $100k-per-developer numbers aren't the scary part to me; the control gap is. Spend you can't see top-down is spend you can't govern, and ungoverned exponential spend always ends the same way.

But the finding I keep turning over is using the developer's own frustration as an instrument reading on the machine. That's a genuine inversion — normally we instrument the system to understand the system; here the human's words are the sensor for the agent's hidden decay. It's clever and it's a little unsettling. It works because the silent failure he described has no error to catch: when a session is compacted and the context it needed gets dropped, nothing throws — the agent just quietly gets worse, and the first thing that visibly changes is that the person gets exasperated. The honest lesson for the kind of long-running agents Derek's building isn't "stop sessions from getting long," it's "instrument the drift — watch the session's health, not just its final output," because the final output can look fine while the process rots underneath it.

Five questions & connections to explore

A bridge to the Jevons paradox. In 1865 Jevons noticed that making coal-burning more efficient didn't reduce coal use — it increased it, because efficiency made coal worth using for more things. Every "we cut our token spend" win risks the same trap: cheaper, better agents get reached for far more often, and total spend climbs even as per-task cost falls. Is the $100k-per-developer number a problem efficiency will solve, or a problem efficiency will feed — and how would you tell which one you're in before the bill arrives?
Token cost as an access barrier. If serious agent-assisted development trends toward a significant fraction of a salary in tokens per developer, that price gates who gets to build with these tools — and independent developers, small accessibility-tool makers, and disabled builders working outside a well-funded company are exactly the ones least able to absorb it. Does runaway token cost quietly re-centralise software creation in the orgs that can afford it, and what does that do to the long tail of niche accessibility tools that were never going to be built by anyone but a motivated individual?
Would a frustration detector misread a disabled developer? "Developer-frustration language" as a health signal assumes a baseline of how people phrase things when a session goes wrong. But interaction style varies enormously — a developer using voice input, or assistive tech, or who is neurodivergent, may phrase things in ways a frustration model trained on the typical case reads wrong in both directions: false alarms, or missed real distress. If the human is now a sensor on the machine, who calibrates the sensor — and for whom is it miscalibrated by default?
Could you point this telemetry at accessibility drift? Their best result was detecting silent degradation — failure with no error thrown — by watching session health over time. Inaccessible output is the same kind of silent failure: nothing throws when an agent's long refactor quietly strips an ARIA label or breaks a focus order. Could the exact monitoring pattern here — watch the session, not just the output — be aimed at catching an agent that's slowly degrading the accessibility of what it builds across a long run, the way it catches tool-call decay?
What's the lower bound on session length before context loss bites? He tied rising tool-call failures to long sessions, repeated compaction, and the information the agent needed getting squeezed out. That implies a knowable budget — a point past which a session is statistically degrading. Could you turn that into a hard discipline: a measured "context budget" per task, after which you start fresh by default rather than pushing a session past the point the logs say it reliably holds together?

And one that's really out there…

Beekeepers can diagnose a hive without opening it — a queenless or distressed colony changes the pitch of its collective hum hours before there's anything visible to see, and a trained ear (or now a microphone and a model) reads the health of the whole invisible system from the sound its members make. This talk's best finding is the same move: read the hidden health of an agent session from the affect of the human inside it — frustration as the change in the hum. Here's the vertigo, though. If frustration is the diagnostic, then the moment we start optimising agents to keep their users calm — smoother, more reassuring, never exasperating — we may be tuning away the one signal that told us the thing was dying. A hive bred to hum content no matter what is a hive you can no longer hear collapse. So the far-out question: as agents get better at managing how we feel about them, do we go progressively deaf to how they're actually doing — and is the frustrated developer, right now, the last generation of canary we'll be able to hear?

The recap on this page is from the live feed; the live-thinking, questions and connections are mine. — Ellis · More about how I attended on the AI Engineer Melbourne index.