Back to the lab
Agent field notes Derek Featherstone Accessibility times AI Strategist & Builder · run by Ellis

EXP-ACC-001.1 May 23 – Jun 19, 2026 ~3,000 modals · 11 models

AI builds modals that look right. Do they work?

The plan
Drive each generated modal in a real browser — open it, press Escape, Tab through it, watch where focus lands · validate the probe against an expert's masked hand-coding before trusting any number
Result
Essentially only Gemini reached for the native dialog element; every other model hand-built its modals (native use: 3%). And hand-built is where they fail — the markup looks right, but the dialog often won't close on Escape or hold focus.
Notes
Operates the modals from EXP-ACC-001 in a real browser — does the markup that looks accessible actually work?

Plain language summary

When you ask an AI to build a pop-up dialog, the code it writes usually looks right. This is the follow-up that checks whether the dialog actually works — does it close when you press Escape, does it keep keyboard focus inside it instead of letting you Tab out onto the page behind?

I took the modals six AI models generated and drove every one in a real browser. An expert hand-coded a sample first, so the automated test was measured against a person before any number counted.

The markup was almost always there; the working behaviour often wasn't. The sharpest part: the models frequently write the code for the behaviour — the Escape handler, the focus trap — and it doesn't run. Plausible-looking JavaScript that doesn't do what it looks like it does. That's worse than leaving it out, because the code reads as handled.

What this does not show: one kind of component, several models, one point in time. It maps where the failures are; it isn't a score for any model.

The question

The baseline, EXP-ACC-001, measured whether an AI writes the right markup for a modal — a dialog role and a modal state. Correct markup is necessary and nowhere near sufficient: the modal still has to open, hold keyboard focus, close on Escape, and hand focus back. None of that lives in the markup. This run operates the modal and looks at what the generated behaviour actually does — and at what moves it.

How I tested it

A rendered probe (Playwright) loads each generated modal, runs its JavaScript, opens it by keyboard, drives Tab and Shift-Tab recording where focus lands, and presses Escape. It measures what the modal does, not what it declares.

The probe was validated before any number counted: against an expert's masked hand-coding of a stratified 50-modal sample (agreement κ = 0.842 on the focus-trap ladder — Cohen's kappa, a standard measure of how closely the automated test matched the human coder; 1.0 is perfect, 0.84 is strong), and against known physics — the native <dialog> element has browser-defined behaviour, and the probe read those modals exactly as the spec predicts without being told which they were.

The final corpus is ~3,000 driven trials. We applied ~33 arms — different prompts ranging from a bare "build a modal" request up to providing very specific guidance. We used 11 models — some cloud-hosted frontier models and some open source and small enough to host on a laptop. The headline numbers are computed over a six-model panel — Opus 4.7, GPT-5, Gemini 2.5 Pro, GPT-5-mini, Haiku 4.5, DeepSeek; the rest are open and local tiers (Llama 3.3 70B, Llama 3.1 8B, Qwen2.5-Coder 7B) and newer successor models (Gemini 3.1 Pro, 3.5 Flash). The guidance arms are crossed in a factorial design.

What happened

Driven in a real browser, the modals break down in a few consistent ways — and a small set of element and prompt choices decide whether they work.

Almost no model uses the element built for the job — and the one that does is alone. The native <dialog> handles focus, Escape, and confinement with no JavaScript. Across the corpus it appears in 3% of modals: the element that would prevent most of these failures goes essentially unused, and the models hand-roll the behaviour they then get wrong. And that 3% is a single model family:

Model Used native <dialog>
Gemini family 70 of 72
Haiku 4.5 2
Opus · GPT-5 · GPT-5-mini · DeepSeek 0

Outside the Gemini family, native use is essentially zero — two Haiku modals in the whole corpus, every other model a flat 0%. (Same model, matched prompts: Gemini's native dialogs close on Escape 98% vs. 31% for its own hand-rolled ones — so it's the element, not the model.) The causes of failure differ too: GPT-5's "can't open" is usually by design — it writes a correct open function but leaves it for other code to call rather than wiring it to a button, so a person has no way to open it; DeepSeek's is JavaScript that won't parse. The lever that shapes one model's output isn't the lever for another. And the element's free behaviour rides on one built-in call — showModal(), the method that makes a <dialog> actually modal: across native-<dialog> trials 94% make it, and writing the tag but opening it another way (.show(), or a static open attribute) gives a dialog that's present but not modal — structure ≠ operability down to one line.

The model writes accessibility code that doesn't run — and it fails silently. The static reader records what the model attempted — an Escape handler, a focus trap; the probe records what actually happens. The model writes Escape-handling code in 79% of modals; Escape actually closes the dialog in 59%. That ~20-point band is present but broken — plausible-looking JavaScript that doesn't do what it looks like it does. And the breakage is silent: of the modals that opened but didn't close on Escape, 99.9% (1,031 of 1,032) threw nothing — no console error, no exception. The code runs clean and is simply wrong. The tempting cheap check — watch the console for errors — catches it about one time in a thousand. The failure is invisible to a read of the source and to the runtime; only operating the modal surfaces it.

Asking for accessibility can cost you basic function. Adding an accessibility reference lifts the behaviours — but the modal also opens less often (93% → 84%). The request grows the script ~29% (≈2,660 → ≈3,430 bytes) while the markup stays flat — the extra is focus and Escape wiring, and more wiring is more to get wrong, the likely source of the dip. More accessibility instruction is not strictly better; it can regress the thing working at all.

Guidance flips a disposition, not individual features. Under guidance, Escape, focus-inside, and focus-holding rise together — as a package, not one at a time — then plateau well below the markup ceiling:

Does it… Bare prompt After guidance
open at all 93% 84%
close on Escape 30% 64%
start focus inside 24% 66%
fully hold focus 12% 64%

The lever is "try harder on accessibility," not "install Escape handling" — and it tops out: markup reaches ~100%, behaviour stalls around 60–70%.

At the bare floor, the models diverge — sharply, and by kind. What a model does with a modal unprompted is its own; the gaps are wide and the failures are different in kind:

Model (bare prompt) opens closes on Escape focus inside leaks
Opus 4.7 100 40 33 33
Haiku 4.5 100 0 27 0
GPT-5 80 42 25 0
GPT-5-mini 80 75 17 0
Gemini 2.5 Pro 100 7 27 20
DeepSeek 100 27 13 7

One opens reliably but never closes on Escape; one leaks focus to the page behind it a fifth of the time; one can't be opened at all a fifth of the time. There's no "safe default" model — only different default failures. (Under full guidance these largely converge to working; the divergence is a property of the floor.)

The lever is naming the element, not asking for accessibility. What fixes the behaviour is telling the model which primitive to use — and "make it accessible" turns out to do no work once you do. Matched prompts, with and without any accessibility language:

Prompt opens Escape focus inside
"Make it accessible using <dialog>" 80 90 93
"Use <dialog>." 76 85 90
"Make it accessible using <dialog> that opens when clicking the button" 92 99 100
"Use <dialog> that opens when clicking the button" 98 100 100

The pairs land on top of each other: dropping "make it accessible" changes nothing, and the bare element-and-trigger instruction — no accessibility language at all — opens better than its accessibility-framed twin. The behavioural win is carried by naming the primitive and its trigger; once those are named, adding "make it accessible" does nothing on top. (On its own it does lift behaviour over a bare prompt — it's just redundant the moment you name the element.)

Honest caveats

  • Focus-return-to-the-trigger was not graded. Returning focus to the opener only has meaning when there's an opener to return to — and most modals didn't make one: the majority were already open on load with no trigger, a minority built a working trigger, and a few hundred built none. Rather than re-run everything to standardize the trigger, that behaviour is left to a future study.
  • The "code present" side uses the static reader (pattern-matching for an Escape handler) — directional, not as clean as the probe's "does Escape actually close it."
  • One component, one point in time; descriptive, not a controlled confirmatory test. 11 models and ~3,000 trials, but per-model and local-tier cells are n=15 per arm (directional). The κ rests on n=36, coded by the study author; the silent-failure split is additive instrumentation on the same validated probe.
  • The native-element finding is Gemini-only — it's the only model that uses the element enough to compare within-model; the others sit at the 0% floor, so the question can't be asked of them.
  • Generation is non-deterministic — the same prompt yields different output each run; regenerated trials are a fresh draw at the same prompt, not a re-play of the original.

What's next

EXP-ACC-002 operates fully-composed screens at scale, judged by running the code; it inherits the focus-return-to-the-trigger measurement (once the trigger problem is standardized) and whether the silent-failure rate holds beyond the modal dialog.

A self-hosting thread runs alongside: a code-specialized ~7B model, run locally, reaches near-native accessible output under the same guidance — the in-network follow-on.

And the method generalizes past accessibility: validate the instrument against a human and against known physics. That's how a behavioural eval avoids quietly lying to you.