BM25 Reference¶
BM25 is covered in depth in Retrieval Methods. This page is a compact math reference.
Scoring Formula¶
\[
\text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{tf(t,d) \cdot (k_1 + 1)}{tf(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}
\]
\[
\text{IDF}(t) = \log\frac{N - df(t) + 0.5}{df(t) + 0.5}
\]
Parameters¶
| Parameter | Typical Value | Controls |
|---|---|---|
| kâ | 1.2â2.0 | TF saturation speed â higher = slower saturation |
| b | 0.75 | Length normalisation strength â 0 = off, 1 = full |
Key Properties¶
- TF saturation: repeated terms give diminishing returns (prevents keyword stuffing)
- Length normalisation: long documents are penalised proportionally to
b - Modified IDF: more robust for very rare or very common terms than vanilla TF-IDF
Worked Example¶
Query: "deep learning tutorial"
| Doc | Content | TF-IDF rank | BM25 rank | Why |
|---|---|---|---|---|
| D1 | "deep learning deep learning deep learning tutorial" | 1st | 2nd | TF saturates; long doc penalised |
| D2 | "deep learning tutorial" | 2nd | 1st | Concise; all terms present |
| D3 | "deep learning introduction overview" | 3rd | 3rd | Missing "tutorial" |