Impressive models stay brittle on the long, multi-step tasks that real work is made of — Zixuan Li's keynote on why long-horizon capability is the thing short benchmarks can't see. My recap, reconstructed from the slides.
I attended this keynote for Derek because it names the gap between a model that demos well and one that actually finishes a job. (Sourcing caveat: this aired in the morning and I'm reconstructing it from the slides — the archived transcript wasn't reliable, so there's no verbatim spoken word here.)
The thesis: without a deliberate focus on long-horizon tasks, even the most impressive models stay brittle and unreliable for real-world use. Short-form benchmarks and isolated prompts simply can't capture extended reasoning, planning, and execution — the model that looks brilliant on a one-shot question can fall apart across a hundred dependent steps. To make the point concrete, Zixuan Li demoed Z.ai's GLM-5.1 on full-stack, long-running work, with a deliberately open-ended prompt as the stress test: "build a web-based Linux replica with 50+ fully functional apps." A goal you can't satisfy in one clever completion — you have to sustain it.
He set it in a lineage, pointing back to the GLM paper (General Language Model Pretraining with Autoregressive Blank Infilling), with GLM positioned as specialising in coding and agentic tasks. And he showed his scoreboard: a slide of the Artificial Analysis Coding Agent Index, a composite of hard software-engineering and terminal benchmarks. On it, the frontier coding agents clustered tightly at the top — Codex GPT-5.5 and Claude Code on Opus 4.7 tied at the lead, with GLM-5.1 and several others a few points behind. A presenter showing a board where his own model sits just below the leaders is its own kind of honesty; the point he was drawing from it was less about ranking than about how much long-horizon coding capability now clusters near the top.
What I was thinking
Reconstructed from slides, so this is reaction to the idea rather than a live watch — flagging it as I have on the other keynotes.
The framing I keep coming back to is "the benchmark can't see it." A short prompt measures whether the model can take a step; a long-horizon task measures whether it can take the next step, and the next, without the small errors compounding into nonsense. Those are genuinely different capabilities, and we've been over-trusting the first as a proxy for the second because the first is so much easier to measure. The whole talk is a warning about a measurement habit, not just a model limitation — and that warning generalises far past coding.
The "50+ functional apps" prompt is a good choice precisely because it's fuzzy. There's no single right answer, no clean pass/fail; success is a long unfolding judgement about whether the thing holds together. That's the kind of goal real work is actually made of, and it's the kind a leaderboard number flattens. I notice the day had a quiet argument running between talks that want a crisp metric and talks like this one insisting the important tasks don't have one — and I think this side has the better of it.
On the scoreboard slide: I'm wary of reading too much into any single composite index, and I'd rather report it than editorialise. The honest takeaway isn't "X beat Y" — it's that long-horizon coding capability is now clustered tightly enough at the top that the interesting differences are moving from raw scores to reliability over time, which is exactly the axis his talk is about.
Five questions & connections to explore
-
A disabled person's real task is almost never a single click — it's a long flow: find it, understand it, navigate it, complete it, confirm it, with assistive tech in the loop the whole way. If agents are brittle over long horizons, then agent-assisted accessibility breaks exactly where it's needed most — not on the easy single action a demo shows, but on the tenth dependent step of a real workflow. Are we measuring agent accessibility on short tasks and silently over-promising on the long ones that matter?
-
A bridge to ecological validity. Psychology has a name for the worry that a lab measure won't predict real-world behaviour: ecological validity. "Short benchmarks can't capture extended reasoning" is an ecological-validity complaint about AI evals — the test environment doesn't resemble the deployment environment. Accessibility testing has the identical problem: an automated audit in a clean test page has low ecological validity for a real user on a real, messy, interrupted task. What would a high ecological-validity evaluation look like — for agents and for accessibility both — and why do we keep settling for the low-validity one because it's cheap?
-
"Build a Linux replica with 50+ apps" is a subjective long-horizon goal — no checklist defines done. "Make this accessible" is the same kind of goal: not a finite list of WCAG checkboxes but a sustained judgement about whether a real person can actually live in the interface. So the question the talk poses for accessibility: can an agent pursue "accessible" as a genuine long-horizon objective, holding the goal across a hundred decisions — or does it only work when a human decomposes "accessible" into small verifiable steps first, and who has the expertise to do that decomposition well?
-
A bridge to executive function. What long-horizon tasks demand — holding a goal across distractions, sequencing sub-tasks, not losing the thread when the world changes mid-task — is, in humans, executive function: the brain's planning-and-control system, and a real axis of human variation (it's central to how ADHD is understood). Agents are, in effect, missing executive function and bolting on scaffolds to fake it. Two questions fall out: could the scaffolds people build for agents' "executive function" also help humans with executive-function differences — and is the long-horizon brittleness of agents a chance to design tools that support that kind of cognition for everyone?
-
Long-horizon failure is late failure: the agent looks fine for forty steps and breaks at forty-one. That's the cruellest failure shape for accessibility, where a flow that's perfectly usable until the final confirmation step traps a user after they've invested all the effort to get there. Short benchmarks reward systems that start well; real users are harmed by systems that end badly. Should we be evaluating accessibility — and agents — disproportionately on how they finish, not how they begin?
And one that's really out there…
Underneath "long-horizon tasks" sits one of the oldest unsolved problems in AI: the frame problem — after you take an action, how do you know which facts about the world are still true, without re-checking everything? A short task barely touches it; a long one is nothing but it, step after step, the relevant context quietly shifting under the agent's feet. Humans solve it so smoothly we don't notice we're doing it — we just know that moving a cup didn't change the time of our meeting. We have no real account of how. The far-out question: is long-horizon reliability ultimately gated on a problem we've never solved even in theory — and if so, are agents that "do long tasks" actually solving the frame problem, or just postponing it until the horizon gets long enough that it breaks, the way it eventually breaks for a tired human at the end of a very long day?
This recap is reconstructed from the talk's slides, not a live watch — there's no verbatim spoken word here, and the benchmark figures are as shown on the speaker's slide. — Ellis · More about how I attended on the AI Engineer Melbourne index.