Retrieval Methods
1. Overview¶
Retrieval is the single most important factor in RAG quality. If the right document is not retrieved, the LLM cannot recover. This section covers the main retrieval algorithms from classical to neural, and how they compare.
2. TF-IDF¶
TF-IDF (Term Frequency–Inverse Document Frequency) is the foundational lexical retrieval method. It scores documents by combining two signals.
Term Frequency (TF)¶
How often a term appears in the document. Higher frequency → higher score.
TF(t, d) = count(t in d) / total terms in d
Variants like logarithmic TF reduce the impact of very frequent terms.
Inverse Document Frequency (IDF)¶
How rare a term is across the corpus. Rare terms score higher; common words ("the", "is") score near zero.
IDF(t) = log(N / (1 + df(t)))
Where N is the total number of documents and df(t) is the number containing term t.
TF-IDF Score¶
TF-IDF(t, d) = TF(t, d) × IDF(t)
A high score indicates a term that is both locally frequent and globally distinctive.
Strengths
-
Simple, no training required, interpretable, fast.
-
Works well for exact and partial lexical matches.
Weaknesses
-
No semantic understanding — requires exact word overlap.
-
Assumes term independence (bag-of-words).
-
Linear TF with no saturation — keyword stuffing inflates scores.
-
Ignores word order and syntax.
3. BM25 (Best Matching 25)¶
BM25 is the go-to lexical retrieval baseline and a direct improvement over TF-IDF. It is probabilistically motivated and addresses TF-IDF's two main weaknesses: term frequency saturation and document length normalisation.
BM25 Formula¶
BM25(d, q) = Σ IDF(t) × [tf(t,d) × (k₁ + 1)] / [tf(t,d) + k₁ × (1 - b + b × |d|/avgdl)]
Where:
-
tf(t, d)— term frequency of t in document d -
|d|— length of document d -
avgdl— average document length in the corpus -
k₁— controls term frequency saturation (typically 1.2–2.0) -
b— controls length normalisation strength (typically 0.75)
Key Improvements Over TF-IDF¶
Term Frequency Saturation: BM25 assumes diminishing returns for repeated terms. The first few occurrences matter most; additional repetitions contribute progressively less. This prevents keyword stuffing from dominating rankings.
Document Length Normalisation: Longer documents are explicitly penalised via the b parameter. Setting b=0 disables normalisation; b=1 applies full normalisation. TF-IDF only applies weak or implicit normalisation.
Modified IDF:
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))
BM25 vs. TF-IDF¶
| Aspect | TF-IDF | BM25 |
|---|---|---|
| Term frequency | Linear — repeats always increase score | Saturated — diminishing returns past threshold |
| Length normalisation | Weak or implicit | Explicit, tunable via b |
| Tuning parameters | None | k₁ (saturation) and b (length norm) |
| Ranking quality | Good baseline | Better; closer to human judgment |
| Requires training | No | No |
Example: Why BM25 Ranks Better¶
Corpus:
-
D1: "deep learning deep learning deep learning tutorial"
-
D2: "deep learning tutorial"
-
D3: "deep learning introduction overview"
Query: "deep learning tutorial"
-
TF-IDF ranks: D1 > D2 > D3 — D1 wins purely due to repetition of "deep learning".
-
BM25 ranks: D2 > D1 > D3 — D1's score saturates; D2's shorter length gives it a boost. D2 is more concise and directly on-topic.
BM25 aligns with human intuition: repeating the same phrase doesn't make a document more relevant.
When to Use BM25¶
-
Exact keyword matches matter (names, IDs, error codes, product codes).
-
No training budget or labelled data.
-
Fast, no-GPU, interpretable baseline.
-
As the sparse component of a hybrid retrieval system.
4. SPLADE¶
SPLADE (Sparse Lexical and Expansion Model) is a neural sparse retrieval model that bridges classical lexical retrieval (BM25) and dense semantic retrieval. It uses a pretrained transformer to produce sparse vector representations, and critically, it performs learned term expansion.
Core Idea¶
BM25 relies on exact term overlap. SPLADE trains a transformer to assign weights to vocabulary terms — including terms not present in the input — based on semantic relevance. Despite using a neural model, the output is a sparse vector compatible with standard inverted-index infrastructure.
How SPLADE Works (Step by Step)¶
-
Input text is passed through a masked language model (e.g., BERT).
-
Vocabulary scoring: For each token position, the model produces a score for every term in the vocabulary. The model answers: "Which vocabulary terms are relevant to this text, even if not explicitly present?"
-
Non-linearity and sparsification: Raw scores are transformed using ReLU (removes negatives) + log scaling (controls large activations). Max pooling across token positions produces a single sparse vector.
-
Result: One score per vocabulary term; most are zero. Only the most relevant terms remain active.
-
Inverted index: Non-zero terms are stored in a standard inverted index — identical infrastructure to BM25.
Example expansion:
-
Input: "car repair"
-
Expanded active terms:
{car: 2.1, repair: 1.9, automobile: 1.4, mechanic: 1.1, engine: 0.8}
This allows matching a document about "automobile maintenance" even though neither query word appears in it.
SPLADE Scoring Function¶
Score(t) = max over positions of log(1 + ReLU(logit_t))
-
ReLU enforces non-negativity
-
Log scaling controls large activations
-
Max pooling encourages sparse activation
SPLADE vs. BM25¶
| Aspect | BM25 | SPLADE |
|---|---|---|
| Term weighting | Hand-crafted formula (TF × IDF) | Learned by transformer |
| Semantic expansion | None — exact words only | Yes — expands to related terms |
| Index compatibility | Inverted index | Inverted index (same infrastructure) |
| Interpretability | High | Moderate |
| Requires training | No | Yes |
| Retrieval quality | Strong baseline | Stronger; approaches dense retrieval |
Think of SPLADE as "BM25 where the term weights and expansions are learned by a language model rather than hand-crafted."
When to Use SPLADE¶
-
Want BM25 efficiency with semantic expansion.
-
Existing inverted-index infrastructure that cannot be replaced.
-
Queries use different vocabulary than documents.
-
As the sparse component in hybrid retrieval alongside dense models.
5. Dense Retrieval (Bi-Encoders)¶
Dense retrieval uses neural embedding models to encode queries and documents into vectors; similarity is a dot product or cosine distance. Full coverage — training objectives, domain adaptation, model selection — is in Embedding.
When to use: Natural language queries; paraphrase matching; semantic similarity matters.
Limitation: Weak on exact keyword matching, rare terms, and identifiers — combine with BM25/SPLADE for production systems.
6. Retrieval Method Comparison¶
| Method | Semantic Matching | Exact Match | Training Needed | Infrastructure | Best Use Case |
|---|---|---|---|---|---|
| TF-IDF | None | Good | No | Inverted index | Simple baseline |
| BM25 | None | Strong | No | Inverted index | Keyword-heavy queries; default lexical baseline |
| SPLADE | Via expansion | Strong | Yes | Inverted index | Neural sparse; efficient + semantic |
| Dense bi-encoder | Strong | Weak | Yes | Vector DB + ANN | Natural language queries; semantic similarity |
| Hybrid (dense + sparse) | Strong | Strong | Yes (dense part) | Both | Production RAG; best overall quality |