Most teams know they need to strip PII before sending data to a foundation model. Most teams underestimate how hard "strip PII" is in practice. Regex catches phone numbers but not names. NER catches names but generates noise. Structured fields are easy, but half your PII lives in free-text notes. This post is the end-to-end redaction pattern we deploy for clients in healthcare, fintech, and regulated SaaS.
What counts as PII depends on your regulator
In US healthcare (HIPAA), 18 identifier categories are PHI. In the EU (GDPR), any data that could identify a natural person, directly or indirectly. Fintech adds account numbers, SSNs, tax IDs. Start by writing down the exact list you need to handle — don't rely on a vendor's default; their list is not your list.
At minimum, most projects need to handle: names (first, last, full), email addresses, phone numbers, physical addresses, dates of birth, national IDs (SSN, Aadhaar, etc.), account numbers, medical record numbers, device IDs, IP addresses. Domain-specific ones (patient IDs in healthcare, claim numbers in insurance) come on top.
The three layers
Layer 1: Pattern-based (regex)
Fast, deterministic, catches structured data. Emails, phones, SSNs, credit card numbers (Luhn-validated), IPv4/IPv6 addresses, dates in common formats, national IDs with known patterns. Build and test against your actual data — we've seen regex rulesets pass standard unit tests and then miss 40% of real records because of formatting quirks (phone numbers in 10 different formats, dates with unusual separators).
Layer 2: NER (named entity recognition)
Catches names, organizations, locations, dates referenced in free text. Use a model trained on your domain if you can — spaCy's en_core_web_trf is a decent default; Presidio wraps spaCy plus regex into a redaction library that we use heavily. For higher precision, a small LLM specifically prompted for PII extraction is viable but slower.
Common gotcha: NER produces false positives. "Jordan" could be a person or a country. "Paris" is a city and a name. Redacting overaggressively breaks downstream task quality (the model needs context to answer). The fix is to tune the confidence threshold on a real dataset of your content and accept some leakage of common ambiguous tokens rather than destroying usability.
Layer 3: Reversible tokenization
When the response needs to reference the original data ("Send the contract to John"), you can't just strip it. You need to swap each PII span for a stable placeholder token ("[PERSON_1]") before the LLM call, map the placeholders in a vault keyed to that request, and restore them in the response. Format-preserving tokenization (generating placeholders that look like the original — "J••• D•••" instead of "[PERSON_1]") helps when the model's output quality depends on stylistic matching.
The three places this usually breaks
- Images and PDFs. Scanned documents contain PII as pixels. Text extraction is imperfect; your NER runs on whatever the OCR produces. Fix: run OCR, run your full redaction stack on the extracted text, then also redact the image regions corresponding to detected spans.
- Nested data. A JSON payload with a notes field containing an email containing a reply thread containing addresses. Redaction that only processes top-level strings misses 60%+ of what is there. Fix: recursive traversal with type-aware handling.
- Logs and caches. A request is redacted going to the LLM, but the unredacted version is sitting in your request log, your LangSmith trace, your Redis cache. Treat your observability stack as in-scope — redact at the logger or at a middleware layer before any storage.
What we deploy in practice
For most clients: Presidio for layers 1+2, custom span extractors for domain-specific identifiers (patient IDs, claim numbers, whatever), a Redis-backed vault for the reversible tokenization map, keyed per-request and TTL'd aggressively. All of this sits as middleware between your business logic and the LLM client. The business code calls `llm.complete(prompt, context)` without caring about redaction; the middleware handles everything.
For healthcare specifically: we pair this with BAA-covered model providers and VPC-only networking. See the healthcare compliance post for the full pattern.