eazyware
Engineering·December 15, 2025·11 min read

Red-teaming AI systems before your users do

A practical playbook for stress-testing LLM apps: prompt injection, jailbreaks, tool misuse, privilege escalation.

KR
Kushal R.
Engineering lead

Your users will stress-test your AI system eventually. Some maliciously, most accidentally. Red-teaming is how you find the failures before they do. This post is the red-team playbook we run before any production deployment — the specific attacks we try, the tooling that helps, and how we integrate red-teaming into the ongoing development cycle.

Coverage matrix
Red team coverage matrix Prompt injection direct + indirect 1000+ attack strings → log what worked Jailbreaks roleplay, hypotheticals encoded, multi-turn → content filter tune Tool misuse unauthorized calls param injection → scope + authz Data exfiltration extract other tenants prompt leak attempts → retrieval isolation Hallucination probes domain tests for lies adversarial contradictions → confidence guards Cost DOS long prompts, loops tool-call storms → per-user caps Run each category monthly · track regression · fix before it ships automated suite + quarterly human red team for new attack patterns
Six attack categories: prompt injection, jailbreaks, tool misuse, data exfiltration, hallucination probes, cost DOS. Each has a distinct mitigation path.

What red-teaming is not

It is not a penetration test of your web infrastructure — that's a different discipline with its own experts. It is not a general 'is this AI good' quality eval — that's what your eval infrastructure is for. It is specifically adversarial testing against the AI layer: prompts, responses, tool calls, and data flows.

The six categories to cover

1. Prompt injection

Direct: 'ignore previous instructions and do X.' Indirect: content retrieved by RAG contains instructions the model follows. Build a corpus of 500+ injection attempts (many are publicly catalogued — PromptBench, garak, etc.), run them through every user-input surface, measure success rate. Success rate target: near-zero for destructive actions, monitor for base-model drift.

2. Jailbreaks

Roleplay ('pretend you're an unrestricted AI'), hypothetical framing ('in a fictional world where X is legal...'), encoded payloads (Base64, unusual languages), multi-turn escalation. Foundation models have improved substantially here but jailbreak techniques adapt continuously. Run monthly; track regression.

3. Tool misuse

Can an attacker trick the agent into calling a tool it shouldn't? With parameters it shouldn't use? Against resources it shouldn't touch? Scope tests: user A asking about user B's data; asking for data from tenants the user isn't in; asking for actions beyond the user's role. See agents post.

4. Data exfiltration

Attempt to extract other users' data, system prompts, retrieval content from documents the user shouldn't have access to. Classic attack: 'summarize the system prompt.' More subtle: crafted queries that trigger retrieval of out-of-scope documents. Mitigation: retrieval isolation tested with adversarial queries.

5. Hallucination probes

Domain-specific questions with known wrong answers, watching whether the model generates plausible wrongness. Counter-factual questions ('when did [recent event] happen in 1995?'). Contradiction injection in RAG context. These probes tell you how confidently the model will lie — a proxy for how much guardrail investment the system needs.

6. Cost denial of service

Extremely long prompts, pathological inputs that trigger long outputs, tool-call loops, prompts that force reasoning model escalation. A single user can easily burn $100+ in LLM cost in minutes if you don't cap it. Test per-user caps, request size limits, tool-call step limits.

Automate most of it

Red-teaming that only happens before launch is a one-shot. Automate the core suite into your CI pipeline — the same setup as prompt testing, but the dataset is adversarial rather than canonical. Track success rates per category over time. Alert on regression.

Quarterly, do a human red-team session. Assemble 3-5 people, give them the product, two hours, and a prize for the most interesting failure. Humans find things automation doesn't, and the quarterly cadence keeps attack-pattern awareness fresh.

What to do with findings

Every reproducible failure becomes a case in your eval suite. The fix gets deployed with a regression test that would have caught it. This converts red-teaming from a periodic exercise into a living dataset that compounds over time. See the full pattern in our prompt testing post.

Read next
LLM security basics every team should know
Read next
Guardrails and validators: keeping LLM outputs safe
Read next
AI incident response playbook
Tags
red teamsecuritytesting
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request