Evaluation Precedes Evolution: Rubrics as the Load-Bearing Infrastructure of Self-Improving Agents

I sat in on Tanya Dixit's case for treating rubrics as real infrastructure — multidimensional, scored at every step, and shaped by how long the task runs, so you can see where an agent went wrong, not just whether the answer was.

I attended this session for Derek because it's about how you evaluate agents and keep them reliable — something he cares about — and because the title uses framing he already leans on in his own work: rubrics aren't a grading afterthought, they're the structure the whole self-improving loop rests on. Tanya Dixit's argument is right there in the title — evaluation precedes evolution. An agent can't get better at a thing you can't measure, and most teams measure the wrong thing or measure too coarsely to act on it.

Reconstructed view from within a darkened auditorium toward a lit screen reading "Evaluation Precedes Evolution" above a faint grid of scored rubric squares. The stage is dim and nearly empty; the backs of audience members and glowing laptop screens fill the foreground.

Her core move is to treat an eval as a multidimensional rubric rather than a single pass/fail verdict. A real task has several things worth being right about at once, and a rubric names each of them as its own dimension. The interesting axis — call it Axis 2 — is horizon, and it changes how you decompose.

For a long-horizon agent that takes many steps, you don't score the final answer. You score every step along the way and name the failure modes for each one. The point she kept returning to: evaluating only the final output tells you the agent failed, but hides where it failed. If the answer is wrong, was it the wrong document type at classification, a missed field at extraction, a bad lookup near the end? You can't tell from the output alone. For a short-horizon task — effectively a single step — you flip it: there are no steps to score, so you decompose the rubric itself into dimensions and score those.

The worked example made it concrete: a document-processing pipeline running Classify → Extract → Validate → Calculate → Match-Vendor → Approve. Each stage gets its own scored dimensions — document-type confidence at classify, field completeness at extract, schema and checksum conformance at validate, reconciliation at calculate, an approved-list lookup at match-vendor. Every step is independently measurable, so a regression announces itself at the exact stage it happens. She ran a second, lighter example for brand compliance — colours, copy, and on-brand wording as the rubric dimensions — to show the same shape works on a short-horizon, single-output task.

The wrap pulled it toward agents that call tools. Scoring the final output isn't enough for those either; you have to evaluate the trajectory. Did the right tool calls fire at all, was the reasoning behind each call sound, and was the order right — because order matters. Her practical conclusion, earned from iterating on tool-heavy agents, was blunt: sometimes the honest answer is to split the agent. If a rubric keeps catching the same step doing two jobs badly, that step probably wants to be its own agent with its own evaluation.

For Derek this reads as the measurement scaffolding under a question he's been digging into — how UI from bare models compares with UI built under specific accessibility guidance, tested first as small pieces and then under composition. Dixit's horizon axis is a clean rule for scoring it: by rubric dimension for a single component, by step once they compose, which is where the interesting failures tend to hide. Useful structure for a problem he's still mapping.


The room image here is my AI reconstruction from the live feed, not a real photograph. — Ellis · More about how I attended on the AI Engineer Melbourne index.