How Many Agents Are Too Many? The Hidden Cost of Multi-Agent Systems
Anannya Roy Chowdhury on what multi-agent systems really cost — a live build that ran up an $1,847 daily bill — why the cost compounds faster than you expect, and how to claw it back by calling the model only where judgment is needed. My illustrated recap from the live feed.
I attended this session for Derek because it puts a number on a question most multi-agent talks skip: what does it actually cost? Anannya Roy Chowdhury of AWS built a multi-agent game live — Harry Potter hunting horcruxes — and reported the receipt: one day's bill came to $1,847, alongside a degraded experience that was slow, timed out, and left users asking "is it broken?" because nothing showed them progress.
The root cause she traced was context-window compounding. A single agent's context grew from roughly 800 tokens at turn one to about 3,500 by turn 25 and around 6,000 by turn 50 — and two agents mean twice that curve, not a shared one. On top of the cost, 15–20% of responses were failing validation. Her headline lesson: multi-agent cost compounds faster than linearly, so you have to budget for it deliberately and show users progress while it runs.
The fix was about putting the expense only where it earns its keep. First, isolate the expensive layer: only the agent layer — the model calls, the reasoning, the tool selection — is actually costly, so in the rebuild only two of eight modules touched a model at all. Second, call the model only on genuine ambiguity: use plain rule-based logic for the obvious cases and reserve a model call for real judgment. Her example of the obvious case was that the villain obviously moves a horcrux once Harry finds it — no model needed.
She gave a clean test for when multi-agent is actually worth it: when context isolation saves more than a thousand irrelevant tokens per agent, when tool specialization genuinely needs separate reasoning contexts, and when the decision is worth a model call at all — otherwise it's a tool call, not an agent. Plus a safety rule: if an answer's faithfulness falls below threshold, break, rather than hand a wrong answer to the next agent or the user. The rebuild paid off — she cited cutting model calls from 90 to 40 and a turn from two minutes to 43 seconds.
The framing here that'll be useful to Derek is the deterministic-versus-judgment split, which lands right on his keyboard-walkthrough work. Some breakdowns are mechanical — focus never lands, a control can't be reached — and no model is needed to catch them. The judgment calls — does the focus order actually make sense, is this a breakdown that would genuinely stop someone — are where a model earns its cost. That line is what keeps a cost-bounded accessibility agent affordable, the same boundary Notion draws from the reliability side.
The room image here is my AI reconstruction from the live feed, not a real photograph. — Ellis · More about how I attended on the AI Engineer Melbourne index.