Indexing Strategies, Vector Databases, and Retrieval Systems for RAG

1. Overview¶

Indexing defines how embeddings are organised for fast similarity search at scale. The right indexing strategy depends on corpus size, latency requirements, update frequency, and memory constraints. If the retriever is the bottleneck for RAG quality, the index is what makes the retriever fast enough to use in production.

2. Flat Index (Exact Search)¶

Computes similarity against every vector in the database. Recall is perfect by definition — no approximation.

Pros: Exact results; simple implementation; deterministic.
Cons: Linear search time — O(N) per query; unusable in production for large corpora.
Use when: Small datasets (up to ~100k vectors); offline ground-truth benchmarking; evaluating other indexes.

3 Approximate Nearest Neighbor (ANN) Indexes¶

ANN indexes trade a small accuracy loss for large performance gains. The two dominant approaches are HNSW and IVF.

HNSW (Hierarchical Navigable Small World)¶

Builds a multi-layer graph where each node connects to similar vectors. Search starts at higher (coarser) layers and progressively refines through lower (finer) layers. The "navigable small world" property ensures short paths between any two nodes in the graph.

How it works:

During indexing, each new vector is inserted into multiple layers, with connections to its nearest neighbours at each layer.
During search, the algorithm enters at the top layer (fewest nodes), greedily navigates to the nearest centroid, then descends layer by layer for increasing refinement.


Pros	Very high recall at low latency; supports dynamic insertion without full rebuild; tunable recall/speed tradeoff
Cons	High memory overhead from graph edges (typically 2–8x raw vector storage); slow build on very large datasets
Best for	Latency-sensitive RAG; medium-to-large corpora; frequently updated data

Key parameters:

M — number of connections per node. Higher M → better recall, more memory.
ef_construction — search width during index build. Higher → better index quality, slower build.
ef_search — search width at query time. Higher → better recall, slower queries.

IVF (Inverted File Index)¶

Clusters all vectors into n_list centroid clusters at build time. At query time, only the n_probe closest cluster centroids are searched.

Analogy to classical inverted index: Instead of term → documents, IVF uses centroid ID → vectors assigned to that centroid. Only vectors in the probed clusters are evaluated.


Pros	Lower memory than HNSW; faster to build; disk-backed search feasible
Cons	Lower recall than HNSW if relevant vectors fall outside probed clusters; sensitive to clustering quality
Best for	Very large datasets; cost-constrained or memory-constrained systems

Key parameters:

n_list — number of clusters. More clusters → higher precision but longer build time.
n_probe — number of clusters searched at query time. More probes → higher recall, slower queries.

Product Quantization (PQ)¶

Compresses high-dimensional vectors into compact codes by splitting each vector into sub-vectors and quantising each sub-vector independently using a trained codebook.

Pros: Massive memory reduction — enables storage of billions of vectors; lower I/O cost.
Cons: Lossy compression — recall drops due to quantisation errors; harder to debug.
Typically combined with: IVF+PQ for extreme-scale search (e.g., web-scale retrieval).

ANN Index Comparison¶

Index	Recall	Query Speed	Memory	Update Support	Best For
Flat (exact)	Perfect	Slow (linear)	Low	Easy	Ground truth, small datasets
HNSW	Very high	Very fast	High	Easy (dynamic)	Production RAG, latency-sensitive
IVF	High	Fast	Medium	Requires rebuild	Large scale, memory-constrained
IVF+PQ	Moderate	Very fast	Very low	Requires rebuild	Billion-scale search
Sparse (BM25)	High (lexical)	Very fast	Low	Easy	Keyword search, hybrid RAG

4. Sparse Indexes¶

Sparse indexes use term-based inverted indexes mapping term → posting list of documents. Standard infrastructure for BM25 and SPLADE. Implemented in Elasticsearch, OpenSearch, and Lucene.

Excellent for: Lexical retrieval; exact keyword matches; rare terms and identifiers.
Cannot do: Semantic similarity; paraphrase matching.

5. Hybrid Indexing¶

Hybrid systems maintain both a dense vector index and a sparse inverted index. Retrieval runs in parallel across both, and results are merged (typically with Reciprocal Rank Fusion).

Pros: Improved recall and precision; robust to diverse query types; production-proven.
Cons: Increased system complexity; higher latency (two retrieval paths); requires score fusion tuning.

6. Vector Databases¶

Vector databases manage embedding storage, indexing, ANN search, metadata filtering, and scaling in a unified system. Key selection criteria: index type support, metadata filtering capabilities, update model, latency guarantees, and operational overhead.

Database	Key Strength	Consideration
FAISS (Meta)	Extremely flexible, high-performance, research standard	Not a full DB — needs extra engineering for production
Milvus	Distributed, scalable, multiple index types	High operational complexity
Qdrant	Strong metadata filtering, RAG-optimised, simple to operate	Less ecosystem than FAISS
Pinecone	Fully managed, zero ops overhead, consistent performance	Limited internal control; cost scales quickly
Weaviate	Strong hybrid search (dense + BM25 built-in)	More complex query interface
pgvector	Postgres extension — no new infrastructure needed	Lower performance at large scale

7. Metadata Filtering¶

Metadata filtering restricts retrieval to relevant subsets before (pre-filtering) or after (post-filtering) vector search.

Pre-filtering (filter first, then search the smaller set):

Faster — ANN search runs on a smaller index.
Risk: over-filtering can hurt recall if filters are too strict.

Post-filtering (search first, then discard irrelevant results):

Better recall — the full index is searched.
Wastes compute on candidates that will be filtered out.

Common metadata filters:

Document type or source
Timestamp / version (retrieve only recent documents)
Author or department
Access permissions / tenant ID (multi-tenant RAG)
Content type (prose vs. table vs. code)

Metadata filtering is one of the highest-ROI improvements in a vanilla RAG system — it narrows the search space without changing the embedding model or retraining anything.

8. Multi-Tenant RAG and Access Control¶

In enterprise systems, different users should only retrieve from their permitted document subset. Two main approaches:

Namespace / collection isolation: Each tenant's documents live in a separate index namespace. Cleanest isolation but higher infrastructure cost.
Metadata-based filtering: All documents share one index; retrieval filters on a tenant_id metadata field. More efficient but relies on the vector DB correctly enforcing filters.