Every week, a new benchmark number. 'Our model beats GPT-4o on MMLU by 3 points.' 'State of the art on SWE-bench.' These headlines are marketing, and the underlying numbers are often gamed in specific, repeatable ways. This post is the tactics we've observed vendors using to inflate benchmark scores, and the antidote — what your evaluation process needs to look like to see through them.

Five tactics

Test contamination, cherry-picking, unrealistic settings, tuned prompts per task, and sandbagging competitors. All five show up regularly in vendor benchmark releases.

Tactic 1: Test set contamination

The benchmark's questions ended up in the training data. The model has memorized answers. The reported score reflects memorization, not capability. This is the most common and most insidious tactic because it can happen accidentally (benchmark data leaked to the training corpus) or intentionally (trained specifically to do well on a known benchmark).

Antidote: use private eval sets. Build your own benchmark from your domain and keep it private. Any public benchmark may be contaminated; your private set is your ground truth.

Tactic 2: Cherry-picked sub-tests

The benchmark has 14 categories. The vendor reports the 4 they win on. The others don't make the announcement. The net message implies overall superiority while representing only partial data.

Antidote: always ask for the full suite. If the vendor shows you 4 categories out of 14, ask for the other 10. If they can't or won't produce them, assume those categories went poorly.

Tactic 3: Unrealistic test conditions

The benchmark is run in ideal conditions — tiny batch size, no latency cap, maximum-reasoning mode, best hardware. Your production workload is the opposite. The reported score bears little relationship to what you'd observe.

Antidote: ask about production settings. 'What is the p95 latency for this quality at batch size 32?' 'What's the cost per 1M tokens at this quality?' 'How does this degrade at higher concurrency?' Real answers show vendor transparency; evasion shows benchmark theater.

Tactic 4: Tuned prompts per task

The vendor uses a different, carefully-tuned prompt for each benchmark. The prompts are not disclosed. Running the same model with a standard prompt produces noticeably worse results. The benchmark score reflects 'vendor's best prompt with weeks of tuning'; your results reflect 'reasonable prompt written in an afternoon.'

Antidote: require prompt disclosure, or run your own benchmarks with your own prompts. If a vendor claims quality X on benchmark Y, replicate the test with a standard prompt. The gap between 'vendor's benchmark number' and 'your standard-prompt result' is often 5-15 points — a material difference.

Tactic 5: Sandbagging competitors

When a vendor publishes 'VendorA vs us' comparisons, the vendor's prompt for themselves is tuned; VendorA's prompt is deliberately basic. Or VendorA is run at default settings while the vendor uses optimal settings. The comparison looks like a head-to-head and isn't.

Antidote: never trust vendor-published comparisons. Run your own bake-off. Same prompts, same settings, same hardware conditions. The answer you get from your own test is 90% more trustworthy than any vendor comparison.

The antidote: build your own eval discipline

Our recommended process for any model evaluation: (1) Curate 100-200 representative examples from your actual use case. (2) Run the same prompt template across all models. (3) Score on your metric — semantic judge, rule-based, human eval, whatever fits. (4) Measure cost and latency under your production-like conditions. (5) Document the methodology and make it reproducible internally.

This takes 1-3 days of engineering effort. It saves months of chasing benchmark numbers. More importantly, it produces a trustworthy answer to 'which model is best for us' — a question public benchmarks cannot answer.

Use vendor benchmarks as a floor filter

Not all benchmarks are worthless. They do tell you which models are in the general vicinity of each other. Use them to narrow a field of 20 candidates to a field of 3-4. Then run your own test to pick among the finalists. Benchmarks as a floor, your eval as the decision.

The vendors know this distinction even when their marketing pretends otherwise. The technical teams at major AI labs run their own task-specific evals for real decisions; they don't make bet-the-company decisions on MMLU numbers. Your team shouldn't either.

Benchmark gaming: when leaderboard numbers mislead

Tactic 1: Test set contamination

Tactic 2: Cherry-picked sub-tests

Tactic 3: Unrealistic test conditions

Tactic 4: Tuned prompts per task

Tactic 5: Sandbagging competitors

The antidote: build your own eval discipline

Use vendor benchmarks as a floor filter

Continue the thread.

We ran 200 LLMs through our eval suite. Here's what we learned.

Why evaluation infrastructure matters more than prompts

AI hype vs reality: what actually shipped in 2025

Want to talk about this?