LLM quantization — reducing the precision of model weights from 16-bit floats to 8-bit or 4-bit integers — has made self-hosting practical for teams that couldn't afford it in 2022. A 70B model in bf16 needs 140GB+ of GPU memory; quantized to 4-bit, it fits in 40GB. Quality loss is often imperceptible for production tasks. This post is the practical guide to the quantization formats in 2026 and how to pick among them.
Why quantize
Memory footprint. A 70B model at bf16 = 140GB. At int8 = 70GB. At int4 = 35GB. The difference is 'needs H100 cluster' vs 'fits on a single GPU.'
Throughput. Smaller weights mean less memory bandwidth per forward pass. For memory-bound workloads (which most LLM inference is), this translates to faster inference.
Economics. Running a quantized model on cheaper hardware cuts inference cost substantially. 4-bit models on consumer GPUs are feasible for many workloads.
Int8 — safe default
8-bit integer quantization. 2x memory reduction. Quality loss typically below 1% on most benchmarks for modern LLMs.
Tooling: LLM.int8() (Tim Dettmers), SmoothQuant, bitsandbytes. Broadly supported.
The safe choice. If you're unsure what to pick, start with int8. Quality is essentially preserved; memory and throughput improve.
Int4 — the sweet spot for many
4-bit quantization. 4x memory reduction. Quality loss 1-5% depending on the method and model.
Key variants: GPTQ (group-wise quantization with second-order optimization), AWQ (activation-aware quantization, often higher quality than GPTQ), INT4 from various frameworks.
Sweet spot for self-hosting. Fits large models on consumer or single-GPU setups. Quality loss is acceptable for most production tasks when pairs with proper eval.
Int2 and below — research territory
1-bit and 2-bit quantization (BitNet, QuIP#, AQLM). Extreme compression; research-active. Quality loss is significant but shrinking with new techniques.
Not yet standard production. Watch for maturation over 2026. Currently reserved for experiments and edge deployments where memory is the binding constraint.
Deployment tooling
vLLM. Supports int8, int4 (GPTQ, AWQ), GGUF. For most self-hosting, vLLM is the serving stack.
llama.cpp / GGUF. For CPU and memory-constrained deployments. GGUF format supports multiple quantization levels in a single file with easy switching.
TensorRT-LLM. NVIDIA's optimized stack. Best performance on NVIDIA hardware. More complex to set up than vLLM.
ExLlamaV2. For single-user chat workloads. Fast on consumer GPUs.
Evaluating quantization quality
Don't trust generic benchmarks alone. A model's benchmark scores with int4 vs bf16 might look close, but your specific task may see larger regression.
Run your eval suite against both the unquantized and quantized versions. Common result: int8 is essentially identical; int4 is slightly worse on complex tasks, close on simple tasks.
See eval infrastructure post. This is one of the clearest use cases for a robust eval pipeline.
When to self-host quantized models
When API costs exceed $5-10K/month and you have someone capable of running GPU infrastructure. Below that, API providers win on convenience and total cost of ownership. Above, self-hosted quantized models can halve costs.
Privacy or compliance requirements. Data can't leave your infrastructure. Self-hosted is mandatory; quantization makes it affordable.
Specialized domains with fine-tuned models. Fine-tuned models don't exist as APIs. You host them; quantize for economics.
Gotchas
Calibration data matters. GPTQ and AWQ require calibration on representative data. Generic calibration hurts quality on your specific task; use your own data.
Not all models quantize equally. Some architectures suffer more quality loss at low-bit than others. Llama-family models are well-supported; less common architectures may not be.
Ongoing maintenance. New quantization methods ship regularly. Periodic re-benchmarking on your eval set catches opportunities to improve.