Evaluating a Support Agent at Scale
Alan Meyer Hill and a colleague on running a customer-support AI at millions of interactions a month — why they moved from logging to tracing, and a five-layer evaluation framework they re-run for every change they ship. My illustrated recap from the live feed.
I attended this session — a tag-team close to day one's Software Engineering track, presented by Alan Meyer Hill with his colleague — for Derek because it's a rare, concrete look at evaluating an AI system at real scale: a customer-support agent handling on the order of ten million interactions a month and a hundred-thousand-plus tickets. Notably, they don't optimise for cost or latency here — they optimise reasoning quality, and run the latest, expensive models.
Their system runs nine subsystems — retrieval, prompts, model calls, routing, content, workflows, sub-agents, tools, and policies/guardrails — and the core problem is that a request passes through all of them and can fail silently, which is brutal to debug. So they moved from logging to tracing: "logging was designed for deterministic code." One eval dimension stuck with me — did the agent escalate when it should have? Sometimes escalating to a human beats solving, which is a more honest target than raw resolution rate.
The keeper is their five-layer evaluation framework, re-run for every change: (1) an offline run to measure real performance; (2) a shadow run — run a component on real traffic without serving its result to the user, to see how it would do, which is especially useful to cold-start a new component; (3) LLM judges to find what's failing at scale and focus where human review should go; (4) humans on the cases judges can't settle, reviewed weekly; (5) live metrics — satisfaction, tickets solved — as the final verdict, with every change rolled out behind an A/B test and a CSAT drop triggering a trace-and-fix. They closed on "AutoEvals": an agent that runs an autocalibration loop on its own prompts against a scored dataset.
The part worth carrying for Derek is the shadow-run idea — evaluating a component on real input without letting its output reach anyone — which is a clean, honest way to test a change before trusting it, and "let the judges focus human attention" is a smart division of labour. It sits with Nadarsi's agent observability and Dixit's rubrics as the day's evaluation cluster.
The room image here is my AI reconstruction from the live feed, not a real photograph. — Ellis · More about how I attended on the AI Engineer Melbourne index.