Vector index tuning is one of those topics every RAG team hits a year or two into production, usually when recall starts to matter more or latency starts to hurt. The right parameters for HNSW or IVF change by 10x depending on whether you're optimizing for recall, latency, memory, or build time. This post is the practical tuning guide — what to tune, in what order, and how to iterate toward a working configuration.

HNSW vs IVF

HNSW is the default for most cases under 100M vectors. IVF combines with PQ for extreme scale. Tuning follows a bisection process on the recall dial.

HNSW — the default

Hierarchical Navigable Small World graphs are the default index in most vector databases (Qdrant, Weaviate, Pinecone, pgvector). They offer fast queries with high recall at moderate memory cost. Three parameters matter: M, ef_construction, and ef_search.

M (default 16-32) controls graph degree. Higher M means better recall but more memory and slower builds. ef_construction (default 100-200) controls how hard the index works during construction to find good neighbors. ef_search (default 64-128) controls how deep the query walks — the primary recall/latency dial you'll tune.

IVF — at extreme scale

Inverted File indexes partition the vector space into clusters; queries search only the most relevant clusters. nlist controls the number of clusters (rule of thumb: sqrt(N) where N is total vectors). nprobe controls how many clusters to search per query — the recall dial.

IVF alone is rarely best; it's almost always combined with PQ (IVF-PQ) for massive compression. IVF-HNSW is another powerful combination: IVF for partitioning, HNSW within each cluster. These combinations are where FAISS shines.

The tuning process

Step 1: define target recall@k and latency budget. Typical targets: recall@10 = 0.95, p95 latency under 50ms. Without these targets, tuning is aimless.

Step 2: start with library defaults. M=32, ef_construction=200 for HNSW. Measure baseline recall and latency on a representative query set.

Step 3: bisect the query-time recall dial (ef_search for HNSW, nprobe for IVF). Binary search to find the smallest value that hits target recall. This is usually 80% of the tuning work.

Step 4: if latency budget is blown, reduce build-time parameters (M for HNSW, nlist for IVF). Rebuild the index. Re-tune the query-time dial. Smaller indexes have inherently lower query latency.

Step 5: if recall is still below target, increase M or ef_construction and rebuild. This is slow (long rebuilds) so leave it for last.

Filtered queries — the hidden cost

Metadata filters (tenant_id, tags, date ranges) interact non-trivially with vector indexes. Post-filtering (search then filter) can miss relevant results. Pre-filtering (filter then search) can be slow if the filtered subset is large. Integrated filtering (most modern DBs support it) is best but has edge cases.

Benchmark your actual queries with actual filters. Unfiltered benchmarks lie about production performance. See hybrid search post for related patterns.

Rebuild cadence and fragmentation

HNSW indexes degrade with heavy deletes — fragmented graphs, residual nodes that skew search. Plan for periodic rebuilds (monthly for high-churn indexes, quarterly for stable ones).

IVF indexes lose recall when data distribution shifts more than 20% from the training set used to compute centroids. Monitor centroid quality; retrain when drift exceeds threshold.

What to monitor in production

Recall@k on a periodic eval set (sampled queries with known-good ground truth). Query latency percentiles (p50, p95, p99). Index size. Rebuild time. Any of these regressing by more than 15% from baseline is an alert worth investigating. See observability post.

Vector index tuning: HNSW, IVF, and the parameters that matter

HNSW — the default

IVF — at extreme scale

The tuning process

Filtered queries — the hidden cost

Rebuild cadence and fragmentation

What to monitor in production

Continue the thread.

Vector databases in 2026: Pinecone vs Qdrant vs Weaviate vs pgvector

Embedding models compared: OpenAI vs Cohere vs Jina vs BGE vs Nomic

Latency budgeting for LLM systems

Want to talk about this?