Open-source models have moved from 'useful for experimentation' to 'shippable for a surprising share of real workloads' over the last 18 months. Llama 3.3 70B, Qwen 2.5 72B, and DeepSeek V2.5 all cross the quality bar for many production tasks, and smaller models like Llama 3.1 8B are now genuinely useful as workhorses for classification, extraction, and low-latency routing. This post is what we actually ship, from our client work.
The models we actually ship
Llama 3.3 70B — the workhorse
Our default open-weights model for general tasks. Quality is within striking distance of GPT-4o and Claude Sonnet on most non-reasoning benchmarks, and materially better than the model equivalents from 18 months ago. It runs on two H100s or four A100s, which makes it approachable operationally. Fine-tuning support is mature across the major frameworks.
Where it loses to closed models: nuanced instruction-following in long prompts, coding with complex tool use, and anything that benefits from explicit reasoning (see reasoning models). For 70% of production tasks, we can't tell the difference in blinded evals.
Qwen 2.5 — strong multilingual and coding
Qwen 2.5 72B is the model we reach for when multilingual quality matters. It beats Llama 3.3 meaningfully on Chinese, Japanese, Korean, and Arabic tasks in our evals. Qwen 2.5 Coder 32B is also genuinely good at code — within 2-3 points of Claude Sonnet on HumanEval-style benchmarks at significantly lower cost to self-host.
DeepSeek V2.5 / V3 — the cost killer
DeepSeek's mixture-of-experts architecture gives you frontier-ish quality at roughly 20% of the inference cost of dense 70B models. Trade-off: the serving stack is finicky and MoE routing can cause tail latency spikes. Worth the complexity if you have inference volume.
Llama 3.1 8B, Mistral Nemo — small workhorses
For classification, extraction, routing, and anything where quality needs are modest but volume is high, 7-12B models are transformative. A single A10 GPU or a T4 with tuning can serve 100+ requests/second. We use these extensively for the classifier in model routing, for structured extraction pipelines, and for on-device or edge deployments.
Llama 3.1 405B — rarely
Quality is genuine but inference cost is prohibitive for most production uses. When you need this tier, a frontier closed model usually ships faster and costs less. We've used 405B twice: once for a client with a hard on-premise constraint, once for a synthetic-data-generation pipeline where latency didn't matter.
When open beats closed
Three scenarios where open wins cleanly. (1) Privacy or data-residency requirements that rule out API calls. (2) Consistent high-volume workloads where the economics of self-hosting beat per-token pricing — typically above 100M tokens/day sustained. (3) Heavy fine-tuning for a specific domain where open-weights models let you fully retrain.
Scenarios where closed still wins: most agent and tool-use workflows (closed models are meaningfully better at complex tool orchestration), any reasoning-heavy task, anything where the ops overhead of self-hosting isn't justified by cost savings.
The operational reality
Self-hosting open models is more work than teams expect. Inference servers (vLLM, TensorRT-LLM, TGI) have sharp edges. Model updates require retesting your entire eval suite. Capacity planning for GPUs is harder than provisioning API tokens. None of this is a reason to avoid open models — just a reason to budget for it. A dedicated MLOps engineer's time is often the largest line in the 'self-host' TCO.
Our default recommendation for projects starting today: closed models via API, with a mental note that Llama or Qwen are the escape hatch if cost or privacy forces the issue. Revisit the decision every 6-12 months — the gap between open and closed narrows meaningfully each cycle.