Skip to content

BM25 Reference

BM25 is covered in depth in Retrieval Methods. This page is a compact math reference.

Scoring Formula

\[ \text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{tf(t,d) \cdot (k_1 + 1)}{tf(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)} \]
\[ \text{IDF}(t) = \log\frac{N - df(t) + 0.5}{df(t) + 0.5} \]

Parameters

Parameter Typical Value Controls
k₁ 1.2–2.0 TF saturation speed — higher = slower saturation
b 0.75 Length normalisation strength — 0 = off, 1 = full

Key Properties

  • TF saturation: repeated terms give diminishing returns (prevents keyword stuffing)
  • Length normalisation: long documents are penalised proportionally to b
  • Modified IDF: more robust for very rare or very common terms than vanilla TF-IDF

Worked Example

Query: "deep learning tutorial"

Doc Content TF-IDF rank BM25 rank Why
D1 "deep learning deep learning deep learning tutorial" 1st 2nd TF saturates; long doc penalised
D2 "deep learning tutorial" 2nd 1st Concise; all terms present
D3 "deep learning introduction overview" 3rd 3rd Missing "tutorial"