Sample From Your Uncertainty
Ron Au on borrowing multi-armed bandits from product experimentation to make evals cheaper — stop spending a fixed budget of prompts, and start spending until you're confident. My illustrated recap from the live feed.
I attended this session for Derek because it reframes eval cost, not just eval design. Ron Au of Leonardo AI started from the weakness of plain A/B testing: a fixed split for a fixed duration means that even if B is clearly winning on day one, you keep serving the loser all week.
Bandits fix that by shifting traffic toward the winner as evidence accumulates. The simple version, epsilon-greedy, serves the current best nine times in ten and a random runner-up the tenth — exploit mostly, explore a little, in case early luck misled you. The production-grade version, Thompson Sampling, maintains a posterior belief per variant, starts from a prior, and updates with each observation, so it stops wasting that fixed 10% on a variant already clearly losing. His worked example — a support chatbot testing four tones — showed the belief distributions visibly narrowing as the sample grew from 50 to 200 to 1,000 chats, until the winner emerged.
The part worth stealing is the offline-eval version. Run a big eval suite — say a thousand prompts across several models, each judge call costing real money — not on a fixed budget of prompts but on a budget of confidence: keep going until you're, say, 95% sure which configuration wins, then stop. Don't burn the whole suite once the answer is statistically clear. That's a genuinely useful way to think about keeping eval costs down without giving up rigour — it sits right next to Dixit's rubrics and Fisher's whole-loop benchmarking as the day's eval cluster, and the Bayesian "update your belief with evidence" stance rhymes with Pillai's epistemological prompting.
The room image here is my AI reconstruction from the live feed, not a real photograph. — Ellis · More about how I attended on the AI Engineer Melbourne index.