eazyware
Engineering·May 20, 2024·11 min read

LLM quantization: GPTQ, AWQ, GGUF, and the practical picks

Model quantization cuts VRAM 2-4x at modest quality cost. GPTQ, AWQ, GGUF, bitsandbytes — what each format fits and which to pick for production.

KR
Kushal R.
Engineering lead

LLM quantization — reducing the precision of model weights from 16-bit floats to 8-bit or 4-bit integers — has made self-hosting practical for teams that couldn't afford it in 2022. A 70B model in bf16 needs 140GB+ of GPU memory; quantized to 4-bit, it fits in 40GB. Quality loss is often imperceptible for production tasks. This post is the practical guide to the quantization formats in 2026 and how to pick among them.

Quantization formats
LLM quantization formats FORMAT BITS BEST FOR TOOLING GPTQ 4-bit GPU inference, high quality vLLM, TGI AWQ 4-bit faster GPU inference vLLM, TGI GGUF (K-quants) 2-8 bit CPU/Mac, llama.cpp llama.cpp, Ollama bitsandbytes 4/8-bit training + QLoRA HF transformers FP8 8-bit H100/H200 native TensorRT-LLM quick picks: serving on GPU → AWQ · self-host on Mac → GGUF · QLoRA training → bitsandbytes
bf16 baseline, int8 (2x smaller, minimal loss), int4/GPTQ/AWQ (4x smaller, moderate loss), int2 (extreme compression, significant loss). Format choice depends on hardware and tooling.

Why quantize

Memory footprint. A 70B model at bf16 = 140GB. At int8 = 70GB. At int4 = 35GB. The difference is 'needs H100 cluster' vs 'fits on a single GPU.'

Throughput. Smaller weights mean less memory bandwidth per forward pass. For memory-bound workloads (which most LLM inference is), this translates to faster inference.

Economics. Running a quantized model on cheaper hardware cuts inference cost substantially. 4-bit models on consumer GPUs are feasible for many workloads.

Int8 — safe default

8-bit integer quantization. 2x memory reduction. Quality loss typically below 1% on most benchmarks for modern LLMs.

Tooling: LLM.int8() (Tim Dettmers), SmoothQuant, bitsandbytes. Broadly supported.

The safe choice. If you're unsure what to pick, start with int8. Quality is essentially preserved; memory and throughput improve.

Int4 — the sweet spot for many

4-bit quantization. 4x memory reduction. Quality loss 1-5% depending on the method and model.

Key variants: GPTQ (group-wise quantization with second-order optimization), AWQ (activation-aware quantization, often higher quality than GPTQ), INT4 from various frameworks.

Sweet spot for self-hosting. Fits large models on consumer or single-GPU setups. Quality loss is acceptable for most production tasks when pairs with proper eval.

Int2 and below — research territory

1-bit and 2-bit quantization (BitNet, QuIP#, AQLM). Extreme compression; research-active. Quality loss is significant but shrinking with new techniques.

Not yet standard production. Watch for maturation over 2026. Currently reserved for experiments and edge deployments where memory is the binding constraint.

Deployment tooling

vLLM. Supports int8, int4 (GPTQ, AWQ), GGUF. For most self-hosting, vLLM is the serving stack.

llama.cpp / GGUF. For CPU and memory-constrained deployments. GGUF format supports multiple quantization levels in a single file with easy switching.

TensorRT-LLM. NVIDIA's optimized stack. Best performance on NVIDIA hardware. More complex to set up than vLLM.

ExLlamaV2. For single-user chat workloads. Fast on consumer GPUs.

Evaluating quantization quality

Don't trust generic benchmarks alone. A model's benchmark scores with int4 vs bf16 might look close, but your specific task may see larger regression.

Run your eval suite against both the unquantized and quantized versions. Common result: int8 is essentially identical; int4 is slightly worse on complex tasks, close on simple tasks.

See eval infrastructure post. This is one of the clearest use cases for a robust eval pipeline.

When to self-host quantized models

When API costs exceed $5-10K/month and you have someone capable of running GPU infrastructure. Below that, API providers win on convenience and total cost of ownership. Above, self-hosted quantized models can halve costs.

Privacy or compliance requirements. Data can't leave your infrastructure. Self-hosted is mandatory; quantization makes it affordable.

Specialized domains with fine-tuned models. Fine-tuned models don't exist as APIs. You host them; quantize for economics.

Gotchas

Calibration data matters. GPTQ and AWQ require calibration on representative data. Generic calibration hurts quality on your specific task; use your own data.

Not all models quantize equally. Some architectures suffer more quality loss at low-bit than others. Llama-family models are well-supported; less common architectures may not be.

Ongoing maintenance. New quantization methods ship regularly. Periodic re-benchmarking on your eval set catches opportunities to improve.

Read next
Self-hosting vs managed: GPU decisions in 2026
Read next
Open-source models in production: what actually holds up
Read next
Small models are back — and that changes the economics
Tags
quantizationGPTQAWQGGUFinference
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request