eazyware
Engineering·November 4, 2024·10 min read

Embedding quantization: cutting vector storage 4-16x

Int8, binary, and product quantization for embeddings. The quality-storage tradeoffs and when each technique is worth the implementation effort.

KR
Kushal R.
Engineering lead

Vector storage is one of the largest line items for RAG systems at scale. A hundred million 1536-dim float32 vectors is ~600GB. Quantization can cut that to 150GB (int8) or 40GB (product quantization) with minimal recall loss. For teams past the prototype stage, quantization is not optional — it's the difference between affordable infrastructure and unaffordable. This post is the practical guide to the four main quantization approaches and when each is worth adopting.

Storage vs quality
Embedding quantization — storage vs quality METHOD BYTES / DIM STORAGE RECALL LOSS Float32 (baseline) 4.0 1.0x (reference) 0% Float16 2.0 2x < 0.5% Int8 1.0 4x 1-3% Product quant (PQ) 0.25 16x 3-8% Binary (1-bit) 0.125 32x 5-15% int8 is the sweet spot for most RAG · PQ shines at > 100M vectors · binary only for rerank-top-k patterns
Float32 baseline, then each step down: float16 (2x), int8 (4x), product quantization (16x), binary (32x). Recall loss ranges from under 0.5% to 15% depending on method.

Why quantization matters at scale

Storage cost scales linearly with vector size. RAM for in-memory indexes (HNSW especially) scales the same way. At 10M vectors, float32 is 60GB — fits on a single machine. At 100M, 600GB — requires horizontal sharding. At 1B, 6TB — serious distributed systems work required.

Quantization breaks the scaling: int8 keeps your 100M vectors on a single machine. Product quantization keeps 1B vectors affordable. This changes the operational complexity of your system.

Float16 — the free lunch

Half precision cuts storage in half with essentially zero recall loss (well under 0.5% in our benchmarks). Almost every modern vector database supports float16 natively. If your system is still on float32, switch to float16 today.

The only hesitation: some embedding models emit values at the edges of the float16 range. Very rare in practice but worth a production eval before rollout.

Int8 — the sweet spot

Scalar quantization to 8-bit integers: 4x storage reduction, 1-3% recall loss for most embedding models. Works well with HNSW and IVF indexes. Widely supported (Qdrant, Weaviate, pgvector, Pinecone all have int8 modes).

Our default for RAG systems past the 10M-vector threshold. The recall loss is typically invisible in downstream task quality (we measure the actual task — question answering, search relevance — not just raw recall).

Product quantization (PQ)

Split each vector into subvectors; cluster each subspace independently; store cluster IDs. Massive compression (16-64x typical) with higher recall loss (3-8% typical).

Best for: 100M+ vector systems where storage is the binding constraint. FAISS's IVF-PQ is the canonical implementation. Slightly more complex to tune than int8 — picking the subvector count and codebook size matters.

Pairs well with a rerank step: use PQ for approximate retrieval, fetch full float vectors for top candidates, do exact similarity on those. Combined cost is low, quality is close to full-precision.

Binary embeddings

Quantize to 1-bit per dimension (sign). 32x compression. Recall loss is significant (5-15% typical).

Generally too lossy for direct retrieval. But: excellent for a first-pass filter in a rerank-heavy pipeline. Binary retrieval to 1000 candidates, int8 or float rerank to final 10. Total cost and latency are often better than a pure int8 pipeline at the extremes of scale.

Picking the right quantization

Rule of thumb we apply: Float16 always. Int8 once storage or RAM becomes noticeable. PQ when you're at 100M+ vectors. Binary only as part of a multi-stage retrieval pipeline. See vector index tuning post for related considerations.

Model-specific behavior matters. OpenAI text-embedding-3 quantizes cleanly to int8; some older models show more recall loss. Always eval on your actual data and task before committing.

Rollout pattern

Shadow mode the quantized index against your current index. For N days, run both; compare retrieval quality on sampled real queries. Migrate once you've confirmed the recall delta is within acceptable bounds for your downstream task. See shadow testing post.

Read next
Embedding models compared: OpenAI vs Cohere vs Jina vs BGE vs Nomic
Read next
Vector databases in 2026: Pinecone vs Qdrant vs Weaviate vs pgvector
Read next
Six RAG patterns that actually work in production
Tags
embeddingsquantizationvector searchstorage
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request