Vector storage is one of the largest line items for RAG systems at scale. A hundred million 1536-dim float32 vectors is ~600GB. Quantization can cut that to 150GB (int8) or 40GB (product quantization) with minimal recall loss. For teams past the prototype stage, quantization is not optional — it's the difference between affordable infrastructure and unaffordable. This post is the practical guide to the four main quantization approaches and when each is worth adopting.
Why quantization matters at scale
Storage cost scales linearly with vector size. RAM for in-memory indexes (HNSW especially) scales the same way. At 10M vectors, float32 is 60GB — fits on a single machine. At 100M, 600GB — requires horizontal sharding. At 1B, 6TB — serious distributed systems work required.
Quantization breaks the scaling: int8 keeps your 100M vectors on a single machine. Product quantization keeps 1B vectors affordable. This changes the operational complexity of your system.
Float16 — the free lunch
Half precision cuts storage in half with essentially zero recall loss (well under 0.5% in our benchmarks). Almost every modern vector database supports float16 natively. If your system is still on float32, switch to float16 today.
The only hesitation: some embedding models emit values at the edges of the float16 range. Very rare in practice but worth a production eval before rollout.
Int8 — the sweet spot
Scalar quantization to 8-bit integers: 4x storage reduction, 1-3% recall loss for most embedding models. Works well with HNSW and IVF indexes. Widely supported (Qdrant, Weaviate, pgvector, Pinecone all have int8 modes).
Our default for RAG systems past the 10M-vector threshold. The recall loss is typically invisible in downstream task quality (we measure the actual task — question answering, search relevance — not just raw recall).
Product quantization (PQ)
Split each vector into subvectors; cluster each subspace independently; store cluster IDs. Massive compression (16-64x typical) with higher recall loss (3-8% typical).
Best for: 100M+ vector systems where storage is the binding constraint. FAISS's IVF-PQ is the canonical implementation. Slightly more complex to tune than int8 — picking the subvector count and codebook size matters.
Pairs well with a rerank step: use PQ for approximate retrieval, fetch full float vectors for top candidates, do exact similarity on those. Combined cost is low, quality is close to full-precision.
Binary embeddings
Quantize to 1-bit per dimension (sign). 32x compression. Recall loss is significant (5-15% typical).
Generally too lossy for direct retrieval. But: excellent for a first-pass filter in a rerank-heavy pipeline. Binary retrieval to 1000 candidates, int8 or float rerank to final 10. Total cost and latency are often better than a pure int8 pipeline at the extremes of scale.
Picking the right quantization
Rule of thumb we apply: Float16 always. Int8 once storage or RAM becomes noticeable. PQ when you're at 100M+ vectors. Binary only as part of a multi-stage retrieval pipeline. See vector index tuning post for related considerations.
Model-specific behavior matters. OpenAI text-embedding-3 quantizes cleanly to int8; some older models show more recall loss. Always eval on your actual data and task before committing.
Rollout pattern
Shadow mode the quantized index against your current index. For N days, run both; compare retrieval quality on sampled real queries. Migrate once you've confirmed the recall delta is within acceptable bounds for your downstream task. See shadow testing post.