Retrieval Methods

1. Overview¶

Retrieval is the single most important factor in RAG quality. If the right document is not retrieved, the LLM cannot recover. This section covers the main retrieval algorithms from classical to neural, and how they compare.

2. TF-IDF¶

TF-IDF (Term Frequency–Inverse Document Frequency) is the foundational lexical retrieval method. It scores documents by combining two signals.

Term Frequency (TF)¶

How often a term appears in the document. Higher frequency → higher score.

TF(t, d) = count(t in d) / total terms in d

Variants like logarithmic TF reduce the impact of very frequent terms.

Inverse Document Frequency (IDF)¶

How rare a term is across the corpus. Rare terms score higher; common words ("the", "is") score near zero.

IDF(t) = log(N / (1 + df(t)))

Where N is the total number of documents and df(t) is the number containing term t.

TF-IDF Score¶

TF-IDF(t, d) = TF(t, d) × IDF(t)

A high score indicates a term that is both locally frequent and globally distinctive.

Strengths

Simple, no training required, interpretable, fast.
Works well for exact and partial lexical matches.

Weaknesses

No semantic understanding — requires exact word overlap.
Assumes term independence (bag-of-words).
Linear TF with no saturation — keyword stuffing inflates scores.
Ignores word order and syntax.

3. BM25 (Best Matching 25)¶

BM25 is the go-to lexical retrieval baseline and a direct improvement over TF-IDF. It is probabilistically motivated and addresses TF-IDF's two main weaknesses: term frequency saturation and document length normalisation.

BM25 Formula¶

BM25(d, q) = Σ IDF(t) × [tf(t,d) × (k₁ + 1)] / [tf(t,d) + k₁ × (1 - b + b × |d|/avgdl)]

Where:

tf(t, d) — term frequency of t in document d
|d| — length of document d
avgdl — average document length in the corpus
k₁ — controls term frequency saturation (typically 1.2–2.0)
b — controls length normalisation strength (typically 0.75)

Key Improvements Over TF-IDF¶

Term Frequency Saturation: BM25 assumes diminishing returns for repeated terms. The first few occurrences matter most; additional repetitions contribute progressively less. This prevents keyword stuffing from dominating rankings.

Document Length Normalisation: Longer documents are explicitly penalised via the b parameter. Setting b=0 disables normalisation; b=1 applies full normalisation. TF-IDF only applies weak or implicit normalisation.

Modified IDF:

IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))

More robust for rare and common terms than the basic IDF formula.

BM25 vs. TF-IDF¶

Aspect	TF-IDF	BM25
Term frequency	Linear — repeats always increase score	Saturated — diminishing returns past threshold
Length normalisation	Weak or implicit	Explicit, tunable via b
Tuning parameters	None	k₁ (saturation) and b (length norm)
Ranking quality	Good baseline	Better; closer to human judgment
Requires training	No	No

Example: Why BM25 Ranks Better¶

Corpus:

D1: "deep learning deep learning deep learning tutorial"
D2: "deep learning tutorial"
D3: "deep learning introduction overview"

Query: "deep learning tutorial"

TF-IDF ranks: D1 > D2 > D3 — D1 wins purely due to repetition of "deep learning".
BM25 ranks: D2 > D1 > D3 — D1's score saturates; D2's shorter length gives it a boost. D2 is more concise and directly on-topic.

BM25 aligns with human intuition: repeating the same phrase doesn't make a document more relevant.

When to Use BM25¶

Exact keyword matches matter (names, IDs, error codes, product codes).
No training budget or labelled data.
Fast, no-GPU, interpretable baseline.
As the sparse component of a hybrid retrieval system.

4. SPLADE¶

SPLADE (Sparse Lexical and Expansion Model) is a neural sparse retrieval model that bridges classical lexical retrieval (BM25) and dense semantic retrieval. It uses a pretrained transformer to produce sparse vector representations, and critically, it performs learned term expansion.

Core Idea¶

BM25 relies on exact term overlap. SPLADE trains a transformer to assign weights to vocabulary terms — including terms not present in the input — based on semantic relevance. Despite using a neural model, the output is a sparse vector compatible with standard inverted-index infrastructure.

How SPLADE Works (Step by Step)¶

Input text is passed through a masked language model (e.g., BERT).
Vocabulary scoring: For each token position, the model produces a score for every term in the vocabulary. The model answers: "Which vocabulary terms are relevant to this text, even if not explicitly present?"
Non-linearity and sparsification: Raw scores are transformed using ReLU (removes negatives) + log scaling (controls large activations). Max pooling across token positions produces a single sparse vector.
Result: One score per vocabulary term; most are zero. Only the most relevant terms remain active.
Inverted index: Non-zero terms are stored in a standard inverted index — identical infrastructure to BM25.

Example expansion:

Input: "car repair"
Expanded active terms: {car: 2.1, repair: 1.9, automobile: 1.4, mechanic: 1.1, engine: 0.8}

This allows matching a document about "automobile maintenance" even though neither query word appears in it.

SPLADE Scoring Function¶

Score(t) = max over positions of log(1 + ReLU(logit_t))

ReLU enforces non-negativity
Log scaling controls large activations
Max pooling encourages sparse activation

SPLADE vs. BM25¶

Aspect	BM25	SPLADE
Term weighting	Hand-crafted formula (TF × IDF)	Learned by transformer
Semantic expansion	None — exact words only	Yes — expands to related terms
Index compatibility	Inverted index	Inverted index (same infrastructure)
Interpretability	High	Moderate
Requires training	No	Yes
Retrieval quality	Strong baseline	Stronger; approaches dense retrieval

Think of SPLADE as "BM25 where the term weights and expansions are learned by a language model rather than hand-crafted."

When to Use SPLADE¶

Want BM25 efficiency with semantic expansion.
Existing inverted-index infrastructure that cannot be replaced.
Queries use different vocabulary than documents.
As the sparse component in hybrid retrieval alongside dense models.

5. Dense Retrieval (Bi-Encoders)¶

Dense retrieval uses neural embedding models to encode queries and documents into vectors; similarity is a dot product or cosine distance. Full coverage — training objectives, domain adaptation, model selection — is in Embedding.

When to use: Natural language queries; paraphrase matching; semantic similarity matters.

Limitation: Weak on exact keyword matching, rare terms, and identifiers — combine with BM25/SPLADE for production systems.

6. Retrieval Method Comparison¶

Method	Semantic Matching	Exact Match	Training Needed	Infrastructure	Best Use Case
TF-IDF	None	Good	No	Inverted index	Simple baseline
BM25	None	Strong	No	Inverted index	Keyword-heavy queries; default lexical baseline
SPLADE	Via expansion	Strong	Yes	Inverted index	Neural sparse; efficient + semantic
Dense bi-encoder	Strong	Weak	Yes	Vector DB + ANN	Natural language queries; semantic similarity
Hybrid (dense + sparse)	Strong	Strong	Yes (dense part)	Both	Production RAG; best overall quality