Back to the lab

EXP-DSC-001.1 Feb 19 → Jun 13, 2026

↳ re-run of EXP-DSC-001 against sites with established, published design systems

AI-assisted design system component identification

The plan
Re-run detection and component inventory against sites with established, published design systems
Result
identified 40, 32, 65, 22, and 14 component families across the five sites; three had a formally detectable design system, two did not but still showed clear component reuse
Notes
Detector needs iteration — to read into (open) shadow-DOM components, and to handle sites where the design system is a thin layer over a CSS framework

Plain language summary

Large sites are usually built from a reused set of components, often a formal design system. This run tested whether AI can identify that system from the outside, on sites known to run one. It's the step before an accessibility agent could work on a site at the component level.

It identified three of the five. The two misses each had a clear, specific cause.

What this does not show: it isn't a score for any model or any site. It's a map of where this kind of detection works well and where it doesn't, and identifies areas for improving the tool to get reliable results.

The question

Identification is the upstream step: before an accessibility agent can work on a site at the component level, it has to identify the components the site is actually built from. Whether that identification is reliable from the outside with no ground-truth is what this re-run tests.

The first pass (EXP-DSC-001) could find component families on well-structured sites, but it couldn't separate a site that has no design system from one whose system it simply couldn't identify. This re-run removes that ambiguity by design: every site it points at is known to run a published design system, so a miss can only mean the detector failed, not that there was nothing to find.

That raised the real question: can a machine reliably identify a component system that is definitely there?

Answering it cleanly meant turning the identification mechanism into a measuring instrument: pre-registered, with the detection rules and the answer keys written down and locked before any site was scored.

How I tested it

Ten production sites, crawled by an unattended browser overnight on June 12 and 13, 2026, then scored against the locked answer keys. The core set was five sites chosen because each runs on a well-known, published design system, the cases where a miss is unambiguously a miss.

Detection was held to a strict bar: zero misses on the known-system set. A single miss fails it.

The detector itself runs no model. An LLM built its matching tables once, at the start; after that, identifying a site's components is deterministic, so the same page gives the same answer every time.

The rules were locked first, then applied. A real bug in the detector surfaced and got fixed mid-run. Both the original and the fixed version are kept and hashed, so the trail is auditable and no threshold got quietly tuned to make the result look better. One threshold that would have rescued a miss was left alone on purpose, because lowering it would have been fitting the test to the answer.

What happened

Detection failed the strict bar of zero misses. That turned out to be the useful part.

Three of the five known-system sites were found cleanly. The two misses each traced to a single, identifiable cause, which turns "unreliable" into something more precise:

What the detector faced Result
A system whose components carry a consistent class-name convention in the page Found reliably, across both common naming styles
A system shipped as self-contained web components, with their structure sealed inside a shadow DOM Missed. The naming signal never reaches the visible page, so there is nothing to read
A system layered thinly over a common CSS framework Missed. Its own footprint was too small to clear the detection threshold

So the honest version is narrow and useful: the mechanism reliably finds design systems that expose their structure as ordinary class names in the page, and it can't detect systems that seal their structure away or barely leave a footprint of their own. That is a boundary, not a verdict of "doesn't work." The same threshold also produced false alarms on sites with no real system, where a coincidental repeated class name can look like a vocabulary, so the line errs in both directions.

One thing I didn't go looking for

On a site built with a major component library, the detector surfaced components that used the library's class names but weren't part of the library at all. They appear to be the team's own pieces, built on the same naming. Against the published catalogue they look like things the tool invented, but they aren't: they're real, reused components that simply aren't in the formal design system, which makes them candidates to pull into the system next.

It also sharpens what would count as a real mistake. A hallucination isn't finding reuse the catalogue doesn't list; it's the mechanism getting the identification wrong, calling a date picker a modal, or flagging a component where there is nothing.

Honest caveats

  • Detection failed its pre-registered bar, and it's reported as a characterized failure, not dressed up as a pass.
  • The two misses and the false alarms are real and named.
  • Site names are withheld; the systems are described by how they ship, which is what the finding turns on.
  • The model cost was effectively zero: the only model use was that one-time table-building step.

What's next

  • Read components that are sealed inside a shadow DOM, where the detector currently sees nothing.
  • Catch design systems that sit too thinly over a framework to register.
  • Build a check for real misidentification (a date picker tagged as a modal, or a component flagged where there is nothing), kept separate from the useful case of finding reuse the catalogue doesn't list.

All of it serves one goal: component identification the accessibility agents can trust.