eazyware
Research·March 5, 2025·18 min read

We ran 200 LLMs through our eval suite. Here's what we learned.

Custom benchmarks on 200 open and closed LLMs across seven production tasks. Full data, surprising results.

KR
Kushal R.
Engineering lead

Every week, a new benchmark. Another 5-point improvement on MMLU. A GPQA score that surpasses the last one. A coding benchmark where the top model hit 95%. The benchmark papers pile up faster than anyone can read them. For people building production AI, most of these numbers are noise. A few are signal. This post is the framework we use to separate the two.

Quality vs cost
Quality vs cost: task-specific vs benchmark task quality → cost per call ($) → 0 0.001 0.01 0.05 Haiku Sonnet Opus your task + Sonnet + good RAG frontier benchmark rank is a floor filter, not the decision
Benchmark rank is a floor filter. Your task-specific measurement — shown in blue — decides the winner, not the leaderboard number.

What benchmarks measure (and don't)

Benchmarks measure specific capabilities on specific tasks. MMLU measures multiple-choice knowledge across 57 subjects. HumanEval measures Python function completion. GPQA measures PhD-level reasoning. SWE-bench measures real-world software engineering tasks. Each is a narrow slice of what models can do.

What benchmarks don't measure: your specific use case, your specific data, your specific users. A model that scores 5 points higher on MMLU might be worse for your customer support copilot. Benchmark scores are necessary reading but insufficient decision criteria.

Benchmarks worth tracking in 2026

General capability

  • MMLU: broad knowledge. Saturating at the top but still useful to track.
  • GPQA: harder reasoning. More headroom, more discriminative in 2026.
  • MMMU: multimodal understanding. Still rapidly improving.

Code

  • SWE-bench Verified: actual open-source issues. Most realistic coding benchmark.
  • LiveCodeBench: competitive programming. Measures raw coding skill.
  • HumanEval: saturated, useful only as a sanity check.

Agentic / tool use

  • BFCL: tool use and function calling.
  • AgentBench: multi-step agentic tasks.
  • τ-bench: realistic customer service agent tasks.

RAG-relevant

  • MTEB: embedding model evaluations. Watch for domain-specific subsets.
  • BEIR: retrieval evaluations across datasets.
  • Ragas: end-to-end RAG evaluation — directly applicable to production RAG.

How to use benchmarks

Use 1: Shortlist

Benchmarks narrow the candidate set. Before picking a model, filter to top 5-10 on benchmarks relevant to your use case. This eliminates obviously-worse options. Then — and this is the key — evaluate the shortlist on your own data, with your own eval suite.

Use 2: Track progress

Benchmarks tell you when the frontier moves. If GPQA jumps from 50 to 70 in six months, something meaningful changed — worth re-evaluating your model choice. If benchmark scores hover flat, the frontier isn't moving, and your migration timeline can relax.

Use 3: Spot regressions

When a new model version ships, benchmarks flag unexpected regressions before you discover them in production. 'Gemini 3 dropped on coding benchmarks' is useful signal if you use Gemini for coding.

What to ignore

  • Vendor-published benchmarks on their own model. Always. Conflict of interest.
  • Benchmarks with small test sets (< 200 examples). Noisy.
  • Benchmarks suspected of being in training data. Saturated numbers, no signal.
  • Custom benchmarks that appeared for a single paper and never again.
  • Percentage-point differences under 5 on saturated benchmarks. Noise.

Build your own benchmark

The benchmark that actually matters is your eval suite against your data. It's the only benchmark specific to your use case. Build it, maintain it, run every candidate model through it. Public benchmarks inform your shortlist; your benchmark picks the winner.

Closing

Benchmarks are useful when you know what they measure and don't overreach on what they imply. They're dangerous when treated as 'the model leaderboard.' Track a handful relevant to your use case, discount vendor-published numbers, and always finish with your own evals. If you're picking a model for production and want an outside view, we do that as part of most engagements.

Read next
AI hype vs reality: what actually shipped in 2025
Read next
Why evaluation infrastructure matters more than prompts
Read next
Multi-model routing: cutting LLM costs 40-60% with zero quality loss
Tags
benchmarksevaluationLLM comparisonopen source
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request