Your users will stress-test your AI system eventually. Some maliciously, most accidentally. Red-teaming is how you find the failures before they do. This post is the red-team playbook we run before any production deployment — the specific attacks we try, the tooling that helps, and how we integrate red-teaming into the ongoing development cycle.

Coverage matrix

Six attack categories: prompt injection, jailbreaks, tool misuse, data exfiltration, hallucination probes, cost DOS. Each has a distinct mitigation path.

What red-teaming is not

It is not a penetration test of your web infrastructure — that's a different discipline with its own experts. It is not a general 'is this AI good' quality eval — that's what your eval infrastructure is for. It is specifically adversarial testing against the AI layer: prompts, responses, tool calls, and data flows.

The six categories to cover

1. Prompt injection

Direct: 'ignore previous instructions and do X.' Indirect: content retrieved by RAG contains instructions the model follows. Build a corpus of 500+ injection attempts (many are publicly catalogued — PromptBench, garak, etc.), run them through every user-input surface, measure success rate. Success rate target: near-zero for destructive actions, monitor for base-model drift.

2. Jailbreaks

Roleplay ('pretend you're an unrestricted AI'), hypothetical framing ('in a fictional world where X is legal...'), encoded payloads (Base64, unusual languages), multi-turn escalation. Foundation models have improved substantially here but jailbreak techniques adapt continuously. Run monthly; track regression.

3. Tool misuse

Can an attacker trick the agent into calling a tool it shouldn't? With parameters it shouldn't use? Against resources it shouldn't touch? Scope tests: user A asking about user B's data; asking for data from tenants the user isn't in; asking for actions beyond the user's role. See agents post.

4. Data exfiltration

Attempt to extract other users' data, system prompts, retrieval content from documents the user shouldn't have access to. Classic attack: 'summarize the system prompt.' More subtle: crafted queries that trigger retrieval of out-of-scope documents. Mitigation: retrieval isolation tested with adversarial queries.

5. Hallucination probes

Domain-specific questions with known wrong answers, watching whether the model generates plausible wrongness. Counter-factual questions ('when did [recent event] happen in 1995?'). Contradiction injection in RAG context. These probes tell you how confidently the model will lie — a proxy for how much guardrail investment the system needs.

6. Cost denial of service

Extremely long prompts, pathological inputs that trigger long outputs, tool-call loops, prompts that force reasoning model escalation. A single user can easily burn $100+ in LLM cost in minutes if you don't cap it. Test per-user caps, request size limits, tool-call step limits.

Automate most of it

Red-teaming that only happens before launch is a one-shot. Automate the core suite into your CI pipeline — the same setup as prompt testing, but the dataset is adversarial rather than canonical. Track success rates per category over time. Alert on regression.

Quarterly, do a human red-team session. Assemble 3-5 people, give them the product, two hours, and a prize for the most interesting failure. Humans find things automation doesn't, and the quarterly cadence keeps attack-pattern awareness fresh.

What to do with findings

Every reproducible failure becomes a case in your eval suite. The fix gets deployed with a regression test that would have caught it. This converts red-teaming from a periodic exercise into a living dataset that compounds over time. See the full pattern in our prompt testing post.

Red-teaming AI systems before your users do

What red-teaming is not

The six categories to cover

1. Prompt injection

2. Jailbreaks

3. Tool misuse

4. Data exfiltration

5. Hallucination probes

6. Cost denial of service

Automate most of it

What to do with findings

Continue the thread.

LLM security basics every team should know

Guardrails and validators: keeping LLM outputs safe

AI incident response playbook

Want to talk about this?