BM25 Reference¶

BM25 is covered in depth in Retrieval Methods. This page is a compact math reference.

Scoring Formula¶

\[ \text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{tf(t,d) \cdot (k_1 + 1)}{tf(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)} \]

\[ \text{IDF}(t) = \log\frac{N - df(t) + 0.5}{df(t) + 0.5} \]

Parameters¶

Parameter	Typical Value	Controls
k₁	1.2–2.0	TF saturation speed — higher = slower saturation
b	0.75	Length normalisation strength — 0 = off, 1 = full

Key Properties¶

TF saturation: repeated terms give diminishing returns (prevents keyword stuffing)
Length normalisation: long documents are penalised proportionally to b
Modified IDF: more robust for very rare or very common terms than vanilla TF-IDF

Worked Example¶

Query: "deep learning tutorial"

Doc	Content	TF-IDF rank	BM25 rank	Why
D1	"deep learning deep learning deep learning tutorial"	1st	2nd	TF saturates; long doc penalised
D2	"deep learning tutorial"	2nd	1st	Concise; all terms present
D3	"deep learning introduction overview"	3rd	3rd	Missing "tutorial"