Prompts are code. They have inputs, outputs, edge cases, and regressions. Yet most teams write prompts like they write Slack messages: casually, iteratively, untested. The cost shows up later — brittle production behavior, undetected regressions, and debugging cycles that eat weeks. This post is how we test prompts like the software they are.

CI pipeline

Every PR runs the prompt suite in parallel, scores against baseline, gates merge on regression. Dataset grows from every production issue.

Prompts are code

A production prompt is a function. Inputs: user query, retrieved context, conversation history, system state. Outputs: model response. A prompt change is a code change. A prompt change without testing is pushing to production without running tests. Framed this way, the testing practices we need are just the software engineering practices we already have, applied to a new artifact type.

What to test

Four categories of prompt tests:

Correctness: given a known input, does the output meet the spec? (exact match for structured outputs, semantic similarity for free-form, LLM-as-judge for nuanced cases)
Regression: prior bugs now fixed — do they stay fixed? Each fixed bug becomes a test case forever.
Edge cases: empty inputs, very long inputs, adversarial inputs, inputs in unexpected languages. All should be handled gracefully.
Consistency: the same input run multiple times should produce similar-enough outputs. Temperature 0 helps; even at 0, scoring consistency matters.

Test structure

Each test is a triple: (input, spec, scorer). Input is the full prompt context. Spec is what the output should look like — not necessarily the exact output, but the acceptance criteria. Scorer is the function that judges whether the actual output meets the spec. See our eval infrastructure post for deep coverage of scorer types.

CI integration

Run prompt tests in CI on every PR. Gate merges on pass rate. This forces discipline: engineers can't sneak prompt changes through without running them against the test suite. Cost: LLM calls on each PR run. For a 200-case test suite with GPT-4o-mini scorers, this is $1-$5 per PR. Cheap compared to the cost of regressions.

Parallelize aggressively — sequential test runs take 10-20 minutes, parallel runs take 2-3. Waits above 5 minutes mean engineers start skipping tests, which defeats the purpose.

Version prompts like code

Prompts go in git, not a Notion doc. Version every change. Tag releases. When a regression hits production, git bisect the prompt history just like you would code. Without versioning, debugging is archaeology — with versioning, it's a diff.

A/B testing prompts in production

Eval tests prevent known regressions. A/B testing evaluates whether a prompt change is actually better in the real world. Route a percentage of traffic to the new prompt, measure downstream signals (user satisfaction, task completion, escalation rate), and ship if metrics improve or stay flat.

Which prompts deserve A/B testing: high-volume prompts where a small quality delta compounds across many users; prompts where eval scores are ambiguous; prompts touching revenue-generating flows. Don't A/B test every minor prompt tweak — it's too slow.

Tools

Promptfoo: CLI-focused, great for CI integration, open-source. Our default for CI tests.
Langfuse: web UI for prompt management plus test runs.
Braintrust: excellent for comparison views between prompt versions.
OpenAI Evals: serviceable if you're all-in on OpenAI.

Common mistakes

Tests only on happy path. Missing edge cases is where regressions hide.
LLM-as-judge only. Noisy. Supplement with deterministic scorers where possible.
No gate. Tests that don't block merges are aspirational, not operational.
Stale test data. The suite must grow with production. Every bug adds a case.
Testing prompts in isolation. Integration tests (prompt + retrieval + post-processing) catch problems unit tests miss.

Closing

Prompt testing feels like overhead until the first regression catches you. After that it feels like the obvious minimum. Build the test suite early, gate deploys on it, grow it with every production issue. The discipline costs 10% of your AI engineering time and saves 50% of the firefighting.

Prompt testing like it's 2026