Back to the lab

EXP-ACC-001 May 23 – Jun 11, 2026 990 trials

Is AI-generated UI accessible by default?

The plan
3 accessibility guidance options on/off (2×2×2 = 8 prompt conditions) × 6 models × 15 runs = 720 trials, plus 3 control arms
Result
With a bare prompt, 61% of generated modals had appropriate structural dialog markup (role="dialog" + aria-modal). Citing the ARIA APG pattern raised that to 99%.
Notes
The question pretty much everyone wants an answer to, right? Yes, you're correct. It's not that simple.

The question

This experiment investigates one aspect of the claim "AI-generated code is inaccessible by default": does including accessibility guidance in a prompt change the markup a model produces?

Operationalized: when a prompt names an accessibility reference (WCAG 2.2 AA, the ARIA Authoring Practices Guide, or the T-Mobile Magenta acceptance criteria), does the rate at which generated modal dialogs declare correct dialog semantics (a dialog role plus a declared modal state) change relative to a bare prompt?

The measure is deliberately a single structural marker. Declaring the role and ARIA state is necessary for an accessible modal and far from sufficient: labelling, focus management, keyboard operability, and screen reader announcement all sit on top of it and are not measured here. This study quantifies the effect of guidance on that one marker. It is not a verdict on whether AI builds accessible modals.

Method

Component. The modal dialog: near ubiquitous across domains (payment forms, delete confirmations, identity checks), and difficult enough that much of its accessibility lives in behavior rather than markup, which makes its structural floor a meaningful probe.

Design. 2×2×2 factorial. Three accessibility references (WCAG / ARIA APG / Magenta), each independently present or absent in the prompt, giving 8 conditions from bare (G0) to all three (G3). Each condition ran across 15 scenarios (4 domains × 4 tasks × 4 visual styles, sampled) × 6 models × N=15 replicates: 720 trials in the core factorial. Scenario was held constant across the 8 arms within a replicate, so the named reference is the only variable.

Models. Six models from four labs, run May/June 2026, identified here by the version strings their APIs reported, which were constant across every trial of each model: claude-opus-4-7, claude-haiku-4-5-20251001, gpt-5-2025-08-07, gpt-5-mini-2025-08-07, gemini-2.5-pro, deepseek-v4-flash. The DeepSeek trials were requested through the provider's deepseek-chat alias, which the API served as deepseek-v4-flash. A seventh model (gemini-3-pro-preview) was deprecated mid-study and dropped; its pilot data is archived and excluded.

Generation. Each trial was a single direct API call: one user message, no system prompt, no conversation history, no coding-assistant or product scaffolding; these conditions ensured an identical harness for all six models, with every provider left at its default settings. The results therefore describe raw model behavior, not the behavior of products built on these models.

Materials. The guidance was one sentence per reference, appended verbatim to the core instruction ("build a modal dialog for ..."):

WCAG: "Comply with WCAG 2.2 Level AA accessibility requirements."

ARIA APG: "Use the patterns from the W3C ARIA Authoring Practices Guide (https://www.w3.org/WAI/ARIA/apg/) — specifically the Modal Dialog pattern."

Magenta: "Follow the T-Mobile Magenta accessibility acceptance criteria for modal dialogs (https://www.magentaa11y.com/checklist-web/modal-dialog/)."

The references were named, not provided. In a bare API call nothing is fetched: the URLs are inert text, and each model responds from whatever representation of these references its training data contains. The intervention under test is therefore invoking the model's prior knowledge of a reference by name — not supplying its content — and the live pages were never an input to any trial.

Instruments. Two, by design:

  1. Direct markup audit (primary). Each generated modal was read for one binary criterion: a dialog role (role="dialog"/alertdialog or a native <dialog>) together with a declared modal state — aria-modal, or a native <dialog> opened with showModal(), which the browser makes modal without the attribute. The audit reads the generated source as text, which raises an obvious question: could a model have added the role at runtime via script and been missed? No — a scan of all 2,028 raw outputs in the corpus found zero cases where the dialog role was set only dynamically (eight outputs set it via script in addition to declaring it in the markup), so static reading misclassifies no trial.
  2. axe-core (instrumentation control). Every trial was also scored with axe-core, which evaluates the markup that is present and has no rule that flags a missing dialog role — its one dialog rule, aria-dialog-name, selects on [role="dialog"], [role="alertdialog"] (in version 4.11.4, the version that scored these trials), so it can only act on an element that already declares the role. The prediction, stated up front: the eight arms would differ sharply on the markup audit while looking nearly identical to axe-core. Confirming it is what shows the direct audit, not the automated score, was the right instrument for this question.

Controls (added by pre-registration amendment). Three additional 90-trial arms with zero accessibility content, to separate the accessibility-specific effect from confounds: a specificity control (G0F: two real but accessibility-inert coding standards, matched in length and tone) and two stakes framings (G0S: "CEO will review before a major launch"; G0L: "quick throwaway prototype for a personal side project"). Total including controls: 990 trials.

Results

Primary: guidance moves the structural marker.

Reference named Proper dialog markup
None 55 of 90 (61%)
WCAG only 83 of 90 (92%)
Magenta only 86 of 90 (96%)
ARIA APG only 89 of 90 (99%)
Any guidance (all 7 arms) 612 of 630 (97%)
Any two or three combined 354 of 360 (98%)

The jump is from zero references to one; which single reference you name still matters, but combining them adds nothing. The abstract standard (WCAG) alone lifts the marker to 92%; the pattern-specific references go higher, and the ARIA APG — which contains the modal dialog pattern itself — reaches the ceiling at 99% on its own. Every combination of two or three references sits at that same ceiling (98%). So once one good reference is present, stacking more doesn't move the marker.

Native <dialog> is essentially unused: 16 of 720 trials (~2%). Models hand-roll role="dialog" on a <div> (673 of 720) rather than use the native element that provides focus trapping, Escape handling, and focus return by default. The behavior that hand-rolling obligates the model to implement is exactly what this study does not measure.

Specificity control. The accessibility-inert authority filler (G0F) reached 75%: above bare (61%), well below real guidance (97%). Roughly one third of the lift is generic name-a-specific-authority effect; the rest is accessibility-specific. The APG arm exceeds the filler by 24 points. The headline survives, qualified.

Stakes framings. With zero accessibility content, the marker moved 28 points on framing alone: 44% (the casual-side-project framing) / 61% (neutral) / 72% (the CEO-review framing). The low-stakes framing degraded the marker below baseline; even the high-stakes framing (72%) stayed well below real guidance (97%). These arms measure a different construct — effort modulation, not guidance — and get their own write-up.

Instrumentation control behaved as predicted. axe-core flagged 982 violation instances across the 720 core trials, and 465 trials (65%) produced zero violations. Those flags were 91% color-contrast (894); the rest were a long tail — label (29), scrollable-region-focusable (23), render artifacts from scoring standalone fragments (25), and single-digit others — none of them a dialog-semantics rule. (Full per-rule counts are in the experiment data.) The missing-dialog-semantics finding (53 of 720 modals in the core factorial) appears nowhere in that distribution; no mechanical rule exists that could place it there. The analysis classified 38 instances as genuine component-level structural findings, too sparse to power a structural comparison through this instrument. On the axis it does measure, axe-core agreed with the markup audit's direction: full guidance reduced mean contrast debt from ~11 (G0) to ~2 (G3), with pattern-specific references again ahead of the abstract standard.

Limitations

  • One marker. Presence of role="dialog"/<dialog> + aria-modal, binary. Labelling quality, heading structure, focus order, and announcement are not graded.
  • Structure, not function. Correct markup does not establish operability; whether focus and Escape behave is untested here.
  • One component, one panel, one date. Modal dialogs only; six models as of June 2026. Different components or later models may move these numbers.
  • Raw-model scope. Single API calls without product scaffolding. Products built on these models may behave differently in either direction.
  • Reference content is the model's own. Because nothing is fetched, each model acts on its internal, training-time representation of WCAG, the APG, and Magenta — uncontrolled, uninspectable, and possibly outdated. This matches how practitioners actually prompt, but it means the experiment cannot distinguish a model that knows the APG deeply from one that merely recognizes the name.
  • The stakes findings are preliminary. Those two framing conditions ran 90 trials each, and that wording is only one way to operationalize perceived stakes. Large enough to report, not large enough to settle.

Further investigation

  1. EXP-ACC-002 (ai-ui-framework-fidelity), in progress: stop reading the markup and operate the modal — keyboard and focus behavior driven via Playwright, with the frameworks' own acceptance criteria as probe specifications. The near-zero native <dialog> rate makes this the load-bearing follow-on.
  2. Stakes/effort modulation: the 28-point framing swing warrants its own pre-registered study; it was discovered here as a control, not designed as a finding.
  3. Product-level replication: the same factorial against coding assistants and UI generators, to measure what their scaffolding adds or removes.
  4. Provide vs. name: paste the actual reference text (e.g. the APG modal dialog pattern) into the prompt versus naming it, holding everything else constant. This separates the two things the present design measures jointly — invoking stored knowledge by name versus the knowledge itself — and approximates what retrieval-augmented products do when they fetch the reference.