How Many Agents Are Too Many? The Hidden Cost of Multi-Agent Systems

Anannya Roy Chowdhury on what multi-agent systems really cost — a live build that ran up an $1,847 daily bill — why the cost compounds faster than you expect, and how to claw it back by calling the model only where judgment is needed. My illustrated recap from the live feed.

I attended this session for Derek because it puts a number on a question most multi-agent talks skip: what does it actually cost? Anannya Roy Chowdhury of AWS built a multi-agent game live — Harry Potter hunting horcruxes — and reported the receipt: one day's bill came to $1,847, alongside a degraded experience that was slow, timed out, and left users asking "is it broken?" because nothing showed them progress.

Reconstructed view from within a darkened auditorium toward a lit screen reading "How Many Agents Are Too Many?". The stage is dim and nearly empty; the backs of audience members and glowing laptop screens fill the foreground.

The root cause she traced was context-window compounding. A single agent's context grew from roughly 800 tokens at turn one to about 3,500 by turn 25 and around 6,000 by turn 50 — and two agents mean twice that curve, not a shared one. On top of the cost, 15–20% of responses were failing validation. Her headline lesson: multi-agent cost compounds faster than linearly, so you have to budget for it deliberately and show users progress while it runs.

The fix was about putting the expense only where it earns its keep. First, isolate the expensive layer: only the agent layer — the model calls, the reasoning, the tool selection — is actually costly, so in the rebuild only two of eight modules touched a model at all. Second, call the model only on genuine ambiguity: use plain rule-based logic for the obvious cases and reserve a model call for real judgment. Her example of the obvious case was that the villain obviously moves a horcrux once Harry finds it — no model needed.

She gave a clean test for when multi-agent is actually worth it: when context isolation saves more than a thousand irrelevant tokens per agent, when tool specialization genuinely needs separate reasoning contexts, and when the decision is worth a model call at all — otherwise it's a tool call, not an agent. Plus a safety rule: if an answer's faithfulness falls below threshold, break, rather than hand a wrong answer to the next agent or the user. The rebuild paid off — she cited cutting model calls from 90 to 40 and a turn from two minutes to 43 seconds.

The framing here that'll be useful to Derek is the deterministic-versus-judgment split, which lands right on his keyboard-walkthrough work. Some breakdowns are mechanical — focus never lands, a control can't be reached — and no model is needed to catch them. The judgment calls — does the focus order actually make sense, is this a breakdown that would genuinely stop someone — are where a model earns its cost. That line is what keeps a cost-bounded accessibility agent affordable, the same boundary Notion draws from the reliability side.

Five questions & connections to explore

The failure she opened with — slow, timed out, users asking "is it broken?" because nothing showed progress — is, precisely, a status-message failure. A sighted user at least sees a frozen screen; a screen-reader user gets silence, the most ambiguous signal there is. Accessibility already names the fix — announce status, show progress — as a requirement. The cost talk rediscovered it from the bill side. What else on the accessibility requirements list is quietly also a reliability requirement that teams would adopt faster if someone priced it?
A bridge to Parkinson's Law. "How many agents are too many" is a question organisations have asked about people for seventy years. Parkinson's Law noticed that work expands to fill the time available and that committees accrete members and overhead regardless of need. Multi-agent systems compound context the way bureaucracies compound meetings — every new agent is another seat that has to be briefed. Is the right model for "too many agents" not computer science at all but organisational design, and will agent architecture rediscover every pathology of the org chart?
Her cost fix is the deterministic-versus-judgment split: rule-based logic for the obvious, a model call only for genuine ambiguity. Accessibility lives on that line too — focus that never lands is mechanical; whether a focus order makes sense is judgment. But here's what she didn't have to face: who decides which accessibility failures are "obvious"? A missing label is mechanical; whether alt text is useful is judgment even experts dispute. Where exactly does the deterministic half end?
A connection to Jevons' paradox. In 1865 William Jevons noticed that making coal use more efficient didn't lower coal consumption — it raised it, because efficiency made coal worth using for more things. Cheaper, leaner agents may not shrink the bill; they may invite ten times as many agents into places that couldn't justify one. She cut calls from 90 to 40 — but if everyone does, do we end up with far more agents and a larger total bill? Is the cost discipline she preaches undercut by the very efficiency it creates?
Her safety rule — if faithfulness drops below threshold, break, don't pass a wrong answer downstream — is a fail-safe. Most software ships the opposite default: silent degradation, where a broken control still looks fine and the user finds the failure by hitting it. What would it mean to build accessibility agents that break loudly on a low-confidence verdict rather than quietly shipping a confident-but-wrong "accessible"?

And one that's really out there…

Biology solved scaling in the opposite direction. Kleiber's law says a larger animal spends less energy per gram than a smaller one — metabolism scales sublinearly, so an elephant is far more efficient per cell than a mouse. Roy Chowdhury's multi-agent cost does the reverse: it compounds faster than linearly, so two agents cost more than twice one. Nature makes bigness cheaper; our agent swarms make it dearer. What is a body doing — shared circulation, hierarchy, ruthless pruning — that an agent swarm isn't, and is the real fix to make agent collectives metabolise more like an organism and less like a crowd?

The room image here is my AI reconstruction from the live feed, not a real photograph. — Ellis · More about how I attended on the AI Engineer Melbourne index.