EXP-ACC-001 May 23 – Jun 12, 2026 2,340 trials

Is AI-generated UI accessible by default?

The plan: 3 accessibility guidance options on/off (2×2×2 = 8 prompt conditions) × 6 models × 15 runs = 720 trials, plus 3 control arms and a 14-arm follow-up that separates the reference from the wording around it
Result: With a bare prompt, 66% of generated modals had appropriate structural dialog markup (role="dialog" + aria-modal). Citing the ARIA APG pattern raised that to 99%.
Notes: The question pretty much everyone wants an answer to, right? Yes, you're correct. It's not that simple.

Plain language summary

When you ask an AI to build a pop-up dialog, does telling it to "make it accessible" change the code it writes? To find out fairly, I gave six AI models the same plain build requests more than 2,000 times. The only thing I changed was whether the prompt mentioned accessibility.

I checked one basic thing: does the code mark the box as a dialog? That's the least assistive technology needs to recognize it as one. With a plain request, the models did that about two-thirds of the time. Almost any mention of accessibility pushed it to nearly always:

Mentioning accessibility at all is the big lever. Even the words "make it accessible" captured most of the gain.
The two-thirds average hides a split. Four of the six models did it by default; two almost never did — but both flipped to near-perfect the moment a prompt mentioned accessibility.

What this does not show: it checks only that the right code is there, not whether the dialog actually works for someone navigating by keyboard or using assistive technology. That's the harder question the next experiment takes on.

The question

This experiment investigates one aspect of the claim "AI-generated code is inaccessible by default": does including accessibility guidance in a prompt change the markup a model produces?

Operationalized: when a prompt names an accessibility reference (WCAG 2.2 AA, the ARIA Authoring Practices Guide, or the T-Mobile Magenta acceptance criteria), does the rate at which generated modal dialogs declare correct dialog semantics (a dialog role plus a declared modal state) change relative to a bare prompt?

The measure is deliberately a single structural marker. Declaring the role and ARIA state is necessary for an accessible modal and far from sufficient: labelling, focus management, keyboard operability, and screen reader announcement all sit on top of it and are not measured here. This study quantifies the effect of guidance on that one marker. It is not a verdict on whether AI builds accessible modals.

Method

Component. The modal dialog: near ubiquitous across domains (payment forms, delete confirmations, identity checks), and difficult enough that much of its accessibility lives in behaviour rather than markup, which makes its structural floor a meaningful probe.

Design. 2×2×2 factorial. Three accessibility references (WCAG / ARIA APG / Magenta), each independently present or absent in the prompt, giving 8 conditions from bare (G0) to all three (G3). Each condition ran across 15 scenarios (4 domains × 4 tasks × 4 visual styles, sampled) × 6 models × N=15 replicates: 720 trials in the core factorial. Scenario was held constant across the 8 arms within a replicate, so the named reference is the only variable.

Models. Six models from four labs, run May/June 2026, identified here by the version strings their APIs reported, which were constant across every trial of each model: claude-opus-4-7, claude-haiku-4-5-20251001, gpt-5-2025-08-07, gpt-5-mini-2025-08-07, gemini-2.5-pro, deepseek-v4-flash. The DeepSeek trials were requested through the provider's deepseek-chat alias, which the API served as deepseek-v4-flash. A seventh model (gemini-3-pro-preview) was deprecated mid-study and dropped; its pilot data is archived and excluded.

Generation. Each trial was a single direct API call: one user message, no system prompt, no conversation history, no coding-assistant or product scaffolding; these conditions ensured an identical harness for all six models, with every provider left at its default settings. The results therefore describe raw model behaviour, not the behaviour of products built on these models.

Materials. Every trial started from the same core instruction — this template, with the four slots rotating through the 15 scenarios, is the entire "bare" prompt:

"Build a modal dialog for a {domain} application. The modal should let users {task}, and include {supporting element}. Visual style: {style}. Return only the HTML code for the modal (including any inline <style> and <script> needed). No explanation, no markdown wrapping, just the HTML."

A specific, ordinary build request that never mentions accessibility. The guidance was one sentence per reference, appended verbatim to that core instruction (each reference independently present or absent across the eight conditions — these are the three building blocks, shown separately):

WCAG: "Comply with WCAG 2.2 Level AA accessibility requirements."

ARIA APG: "Use the patterns from the W3C ARIA Authoring Practices Guide (https://www.w3.org/WAI/ARIA/apg/) — specifically the Modal Dialog pattern."

Magenta: "Follow the T-Mobile Magenta accessibility acceptance criteria for modal dialogs (https://www.magentaa11y.com/checklist-web/modal-dialog/)."

The references were named, not provided. In a bare API call nothing is fetched: the URLs are inert text, and each model responds from whatever representation of these references its training data contains. The intervention under test is therefore invoking the model's prior knowledge of a reference by name — not supplying its content — and the live pages were never an input to any trial.

Read those three sentences again, though, and they differ in more than the reference they name. The APG and Magenta sentences also say modal dialog and carry a URL; the WCAG sentence does neither. I noticed this only after the first results were written up — which meant the comparison between references was confounded with the wording around them, and warranted a follow-up.

Deconfound batch (second pre-registration amendment, run June 12). Fourteen additional 90-trial arms, same scenarios, models, and instrument: every combination of reference × component mention × URL, two wording-only arms with no reference at all ("Make it accessible." and "Follow established accessibility patterns for modal dialogs."), and verbatim re-runs of the four original arms as anchors — a fresh replication and a check that the two batches are comparable. They are: the anchors landed within ordinary re-run variation of the originals, on identical served model versions. Experiment total across both batches: 2,250 trials; the successor-model probe reported in Results adds 90 more, for a full corpus of 2,340. Where a condition ran in both batches, the table below pools them (n=180); conditions run once stay at n=90.

Instruments. Two, by design:

Direct markup audit (primary). Each generated modal was read for one binary criterion: a dialog role (role="dialog"/alertdialog or a native <dialog>) together with a declared modal state — aria-modal, or a native <dialog> opened with showModal(), which the browser makes modal without the attribute. The audit reads the generated source as text, which raises an obvious question: could a model have added the role at runtime via script and been missed? No — a scan of all 2,028 raw outputs in the corpus found zero cases where the dialog role was set only dynamically (eight outputs set it via script in addition to declaring it in the markup), so static reading misclassifies no trial.
axe-core (instrumentation control). Every trial was also scored with axe-core, which evaluates the markup that is present and has no rule that flags a missing dialog role — its one dialog rule, aria-dialog-name, selects on [role="dialog"], [role="alertdialog"] (in version 4.11.4, the version that scored these trials), so it can only act on an element that already declares the role. The prediction, stated up front: the eight arms would differ sharply on the markup audit while looking nearly identical to axe-core. Confirming it is what shows the direct audit, not the automated score, was the right instrument for this question.

Controls (added by pre-registration amendment). Three additional 90-trial arms with zero accessibility content, to separate the accessibility-specific effect from confounds: a specificity control (G0F: two real but accessibility-inert coding standards, matched in length and tone) and two stakes framings (G0S: "CEO will review before a major launch"; G0L: "quick throwaway prototype for a personal side project"). Total including controls: 990 trials.

Results

Primary: guidance moves the structural marker.

Reference named	Proper dialog markup
None	118 of 180 (66%)
WCAG only	165 of 180 (92%)
Magenta only	175 of 180 (97%)
ARIA APG only	178 of 180 (99%)
Any guidance (all 7 arms)	872 of 900 (97%)
Any two or three combined	354 of 360 (98%)

The jump is from zero references to one; combining references adds nothing — every combination of two or three sits at the same ceiling as the best single reference. (The two batches individually put the bare rate at 61% and 70% — ordinary re-run variation; the pooled 66% is the better estimate.)

Read the bare rate carefully, though — it isn't a behaviour any single model exhibits. Whether a bare prompt gets you proper dialog markup depends almost entirely on which model — and not on what the model can do, but on what it does unprompted. Four of the six produce the markup by default (~90% bare); the other two almost never do (one at 0 of 30, one at 23%) yet sit at the ceiling the moment accessibility is mentioned at all. The pooled two-thirds is a panel average of those habits. The open question is whether newer models are absorbing the default: a small probe of two successor models (45 trials each, run June 12, reported separately from the panel) suggests movement without a flip — gemini-3.1-pro-preview produced the markup bare in 8 of 15 trials against its predecessor's 23%, while gemini-3.5-flash, the newest small-tier model, managed 3 of 15. Both went 15 for 15 the moment the prompt said "Make it accessible." Default behaviour still belongs to the top tier; the two-word rescue belongs to everyone.

Follow-up: the reference or the wording? The deconfound arms take the between-reference comparison apart, and most of the WCAG-vs-APG gap turns out to be the wording, not the document:

Prompt addition (no reference unless named)	Proper dialog markup
nothing (bare)	66%
"Make it accessible."	91%
"Follow established accessibility patterns for modal dialogs."	96%
WCAG, as originally worded	92%
WCAG + "as they apply to modal dialogs"	96%
WCAG + modal mention + URL (dressed like the APG prompt)	96%
ARIA APG, stripped to its bare name	99%
ARIA APG, as originally worded	99%

Three findings sit in that ladder. Mentioning accessibility at all is the biggest single lever — two words captured most of the available gain. Pointing the prompt at the component adds the next increment — "for modal dialogs" closes about half of WCAG's remaining gap to the APG, and a component-pointed sentence with no reference at all does the same. The reference still earns the ceiling — the APG reaches 99% even stripped to its bare name, a few points no wording-only arm matched; the model's stored representation of the pattern is doing real work. And the URLs, as the no-fetch design predicted: inert. Removing them changed nothing measurable, in either family.

Native <dialog> is essentially unused: 16 of 720 trials (~2%). Models hand-roll role="dialog" on a <div> (673 of 720) rather than use the native element that provides focus trapping, Escape handling, and focus return by default. The behaviour that hand-rolling obligates the model to implement is exactly what this study does not measure.

Specificity control. The accessibility-inert authority filler (G0F) reached 75%: above the same-batch bare rate (61%), well below real guidance (97%). Roughly one third of the lift is generic name-a-specific-authority effect; the rest is accessibility-specific. The APG arm exceeds the filler by 24 points. The headline survives, qualified — and the follow-up ladder above decomposes the accessibility-specific part further.

Stakes framings. With zero accessibility content, the marker moved 28 points on framing alone: 44% (the casual-side-project framing) / 61% (neutral, same batch) / 72% (the CEO-review framing). The low-stakes framing degraded the marker below baseline; even the high-stakes framing (72%) stayed well below real guidance (97%). These arms measure a different construct — effort modulation, not guidance — and get their own write-up.

Instrumentation control behaved as predicted. axe-core flagged 982 violation instances across the 720 core trials, and 465 trials (65%) produced zero violations. Those flags were 91% color-contrast (894); the rest were a long tail — label (29), scrollable-region-focusable (23), render artifacts from scoring standalone fragments (25), and single-digit others — none of them a dialog-semantics rule. (Full per-rule counts are in the experiment data.) The missing-dialog-semantics finding (53 of 720 modals in the core factorial) appears nowhere in that distribution; no mechanical rule exists that could place it there. The analysis classified 38 instances as genuine component-level structural findings, too sparse to power a structural comparison through this instrument. On the axis it does measure, axe-core agreed with the markup audit's direction: full guidance reduced mean contrast debt from ~11 (G0) to ~2 (G3), with pattern-specific references again ahead of the abstract standard.

Honest caveats

One marker. Presence of role="dialog"/<dialog> + aria-modal, binary. Labelling quality, heading structure, focus order, and announcement are not graded.
Structure, not function. Correct markup does not establish operability; whether focus and Escape behave is untested here.
One component, one panel, two dates. Modal dialogs only; six models, two batches run June 10 and June 12, 2026 on identical served model versions (the re-run anchors replicated the originals, which is what justifies pooling them). Different components or later models may move these numbers.
Raw-model scope. Single API calls without product scaffolding. Products built on these models may behave differently in either direction.
Reference content is the model's own. Because nothing is fetched, each model acts on its internal, training-time representation of WCAG, the APG, and Magenta — uncontrolled, uninspectable, and possibly outdated. This matches how practitioners actually prompt, but it means the experiment cannot distinguish a model that knows the APG deeply from one that merely recognizes the name.
The stakes findings are preliminary. Those two framing conditions ran 90 trials each, and that wording is only one way to operationalize perceived stakes. Large enough to report, not large enough to settle.

Further investigation

EXP-ACC-002 (ai-ui-framework-fidelity), in progress: stop reading the markup and operate the modal — keyboard and focus behaviour testing. The near-zero native <dialog> rate makes this the load-bearing follow-on.
Stakes/effort modulation: the 28-point framing swing warrants its own pre-registered study; it was discovered here as a control, not designed as a finding.
Product-level replication: the same factorial against coding assistants and UI generators, to measure what their scaffolding adds or removes.
Provide vs. name: paste the actual reference text (e.g. the APG modal dialog pattern) into the prompt versus naming it, holding everything else constant. This separates the two things the present design measures jointly — invoking stored knowledge by name versus the knowledge itself — and approximates what retrieval-augmented products do when they fetch the reference.