eazyware
Opinion·May 12, 2025·9 min read

Benchmark gaming: when leaderboard numbers mislead

Contamination, overfitting, evaluation-leakage. How frontier benchmarks get inflated, and how to read through it.

KR
Kushal R.
Engineering lead

Every week, a new benchmark number. 'Our model beats GPT-4o on MMLU by 3 points.' 'State of the art on SWE-bench.' These headlines are marketing, and the underlying numbers are often gamed in specific, repeatable ways. This post is the tactics we've observed vendors using to inflate benchmark scores, and the antidote — what your evaluation process needs to look like to see through them.

Five tactics
Five benchmark-gaming tactics 1. Test set contamination benchmark data in training model memorizes answers → use private eval set 2. Cherry-picked sub-tests report only where they win hide weaker categories → ask for full suite 3. Unrealistic settings tiny batch, no latency cap perfect conditions → ask about prod settings 4. Tuned prompts per task special prompts for benchmarks not replicable at home → require prompt disclosure 5. Sandbagging competitors bad prompts for rivals tuned prompts for self → run your own benchmark The antidote Build your own eval set Run same prompts on all Test in production conditions Every published benchmark score is marketing first, measurement second Trust your task-specific eval · treat published numbers as a floor filter, not a decision
Test contamination, cherry-picking, unrealistic settings, tuned prompts per task, and sandbagging competitors. All five show up regularly in vendor benchmark releases.

Tactic 1: Test set contamination

The benchmark's questions ended up in the training data. The model has memorized answers. The reported score reflects memorization, not capability. This is the most common and most insidious tactic because it can happen accidentally (benchmark data leaked to the training corpus) or intentionally (trained specifically to do well on a known benchmark).

Antidote: use private eval sets. Build your own benchmark from your domain and keep it private. Any public benchmark may be contaminated; your private set is your ground truth.

Tactic 2: Cherry-picked sub-tests

The benchmark has 14 categories. The vendor reports the 4 they win on. The others don't make the announcement. The net message implies overall superiority while representing only partial data.

Antidote: always ask for the full suite. If the vendor shows you 4 categories out of 14, ask for the other 10. If they can't or won't produce them, assume those categories went poorly.

Tactic 3: Unrealistic test conditions

The benchmark is run in ideal conditions — tiny batch size, no latency cap, maximum-reasoning mode, best hardware. Your production workload is the opposite. The reported score bears little relationship to what you'd observe.

Antidote: ask about production settings. 'What is the p95 latency for this quality at batch size 32?' 'What's the cost per 1M tokens at this quality?' 'How does this degrade at higher concurrency?' Real answers show vendor transparency; evasion shows benchmark theater.

Tactic 4: Tuned prompts per task

The vendor uses a different, carefully-tuned prompt for each benchmark. The prompts are not disclosed. Running the same model with a standard prompt produces noticeably worse results. The benchmark score reflects 'vendor's best prompt with weeks of tuning'; your results reflect 'reasonable prompt written in an afternoon.'

Antidote: require prompt disclosure, or run your own benchmarks with your own prompts. If a vendor claims quality X on benchmark Y, replicate the test with a standard prompt. The gap between 'vendor's benchmark number' and 'your standard-prompt result' is often 5-15 points — a material difference.

Tactic 5: Sandbagging competitors

When a vendor publishes 'VendorA vs us' comparisons, the vendor's prompt for themselves is tuned; VendorA's prompt is deliberately basic. Or VendorA is run at default settings while the vendor uses optimal settings. The comparison looks like a head-to-head and isn't.

Antidote: never trust vendor-published comparisons. Run your own bake-off. Same prompts, same settings, same hardware conditions. The answer you get from your own test is 90% more trustworthy than any vendor comparison.

The antidote: build your own eval discipline

Our recommended process for any model evaluation: (1) Curate 100-200 representative examples from your actual use case. (2) Run the same prompt template across all models. (3) Score on your metric — semantic judge, rule-based, human eval, whatever fits. (4) Measure cost and latency under your production-like conditions. (5) Document the methodology and make it reproducible internally.

This takes 1-3 days of engineering effort. It saves months of chasing benchmark numbers. More importantly, it produces a trustworthy answer to 'which model is best for us' — a question public benchmarks cannot answer.

Use vendor benchmarks as a floor filter

Not all benchmarks are worthless. They do tell you which models are in the general vicinity of each other. Use them to narrow a field of 20 candidates to a field of 3-4. Then run your own test to pick among the finalists. Benchmarks as a floor, your eval as the decision.

The vendors know this distinction even when their marketing pretends otherwise. The technical teams at major AI labs run their own task-specific evals for real decisions; they don't make bet-the-company decisions on MMLU numbers. Your team shouldn't either.

Read next
We ran 200 LLMs through our eval suite. Here's what we learned.
Read next
Why evaluation infrastructure matters more than prompts
Read next
The 2026 guide to picking an AI vendor
Tags
benchmarksevaluationresearch
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request