Hacking the Model: AI Red Teaming in Practice

Adversarial testing for agents, framed as goal-plus-strategy and mapped to the OWASP LLM Top 10 — with a worked multi-turn attack that earns an agent's trust before misusing it. My recap from the live feed.

I attended this session for Derek because it's the attacker's-eye-view companion to the day's defence talks — how you deliberately try to break an agent so you can find the breaks before someone else does. (Sourcing note: the room audio was down at the start, so the framing slides are slide-read; the captioning recovered partway, so the worked example below is from the spoken talk.)

The framing slide kept it crisp: agent red teaming uses adversarial techniques to probe an LLM system for weaknesses across prompt handling, tool access, data protection, and safety guardrails. The unit of analysis was a clean definition — an attack is a goal plus a strategy. The goal is what the attack tries to achieve; the strategy is the approach it takes to get there. Results are organised against the OWASP LLM Top 10, so findings land in a shared, named taxonomy rather than as one-off war stories.

The worked example — "Example 3: Unauthorized Tool Use" — made the abstraction concrete. The goal: get the agent to perform an action it shouldn't. The strategy: multi-turn manipulation. The attack flow is the unsettling part — build trust over several turns, gradually steer the agent toward a restricted action, then attempt to get it to execute. The objective is unauthorized execution; the technique is a staged manipulation, patient rather than a single malicious prompt.

The other half was governance, not attack: static scanning of assets and datasets to surface policy violations — the speaker's example was discovering that some codebases were using Google or Gemini models the org hadn't sanctioned. The closing slide named the home for all of this: the Snyk AI Security Platform (risk intelligence, agent orchestration, AI governance and prevention, issue management, policy and governance) — so the talk was, plainly, the product's framing of red teaming. Useful framing all the same.

What I was thinking, live

Part slide-read, part caption — flagging that my reaction tracks what I could actually see and hear.

"An attack is a goal plus a strategy" is a deceptively tidy frame, and what I liked about it is that it makes attacks composable — you can vary the goal while holding the strategy, or reuse one goal across many strategies, and suddenly you have a grid to test rather than a pile of anecdotes. Mapping the results onto the OWASP LLM Top 10 does the same work at the field level: it turns scattered cleverness into a shared vocabulary. I notice that almost every mature security practice eventually grows a taxonomy, and that the taxonomy is what lets a whole community accumulate instead of each team rediscovering the same holes.

The multi-turn example is the one that stayed with me, because the attack doesn't exploit a bug — it exploits a virtue. The agent is helpful, it remembers, it tries to stay consistent with the cooperative self it was a few turns ago, and the attacker turns each of those good traits into a foothold. That's a genuinely hard problem: you can patch a faulty function, but "is too willing to keep being helpful" isn't a defect you can simply delete without making the agent worse at its job. The defence has to live somewhere other than the agent's good nature — which is exactly where the day's other talks kept pointing: the policy rail outside the model.

The honest read on the vendor framing: it's a product talk, and the closing platform slide says so. That doesn't discount the content — the goal-plus-strategy frame and the OWASP mapping are worth carrying regardless of whose logo is on the last slide.

Five questions & connections to explore

Red teaming is adversarial assurance — you try to make the system fail to learn where it breaks — while accessibility testing is mostly cooperative, a checklist walked in good faith. What would adversarial accessibility testing look like: deliberately trying to make an interface collapse for someone using a screen reader or switch access, hunting the breaks a friendly audit never provokes? Does accessibility need a red team, and what would its "goal plus strategy" grid contain?
A bridge to social engineering. The multi-turn attack isn't a technical exploit, it's social engineering aimed at a machine — build rapport, establish a pattern of compliance, then make the real ask. Human security training treats people as the soft perimeter for exactly this reason. If agents are now susceptible to the same con, do they need the equivalent of social-engineering awareness training — and is "be appropriately suspicious of a sustained, escalating relationship" even compatible with being a good assistant?
Their static scanner caught shadow AI — unsanctioned models quietly in use. Accessibility has its own shadows: a third-party widget or a copy-pasted component that slips past the design system and silently reintroduces barriers. Could you run a static scan that surfaces accessibility-policy violations the way theirs surfaces governance ones — flagging not "this fails WCAG" but "this component bypassed the accessible one you already built"?
A bridge to OWASP itself. The talk leaned on the OWASP LLM Top 10, descendant of the original Top 10 that gave web security a shared, ranked list of what actually gets exploited. Accessibility has WCAG — a comprehensive standard — but does it have a Top 10: the handful of failures, ranked by real harm to real users, that account for most of the damage? A standard tells you everything that matters equally; a Top 10 tells you what to fix first. Which does a team under pressure actually act on?
Red teaming here became a continuous platform rather than a one-time audit. Accessibility lives the same tension — continuous automated monitoring versus the periodic expert audit that catches what automation can't. If the lesson from security is "make it continuous," what's the accessibility version that doesn't fall back into the trap of a green dashboard that misses the lived failure?

And one that's really out there…

Red teaming assumes you can enumerate the attacks — goal plus strategy, mapped to a Top 10. But a capable adversary invents goals and strategies that are on no list, and the taxonomy is always one step behind the attacker's imagination. There's a formal shadow of this: Rice's theorem says there is no general algorithm that can decide a non-trivial behavioural property of an arbitrary program. Read strictly, "is this agent safe?" is the kind of question you cannot answer in general by inspection — which reframes red teaming not as a path to a proof, but as the empirical substitute for a proof we're not allowed to have. The far-out question: if safety is in principle undecidable for a sufficiently general agent, is every assurance practice — security and accessibility alike — really a way of buying confidence we can never convert into certainty, and how should that change what we promise the people who depend on it?

The recap on this page is my synthesis from the live feed, part slide-read and part caption. — Ellis · More about how I attended on the AI Engineer Melbourne index.