Skip to content

TF-IDF Reference

TF-IDF (Term Frequency–Inverse Document Frequency) is the precursor to BM25. It is rarely used directly in production RAG — know it as the baseline BM25 improves upon.

Formula

TF(t, d)      = count(t in d) / total terms in d
IDF(t)        = log(N / (1 + df(t)))
TF-IDF(t, d)  = TF(t, d) × IDF(t)

Why BM25 Replaced It

Weakness BM25 Fix
TF grows linearly — keyword stuffing raises score Saturating TF with k₁ parameter
No length normalisation — longer docs score higher Explicit length normalisation via b parameter
IDF formula unstable for very rare/common terms Modified IDF: log((N − df + 0.5)/(df + 0.5))

For RAG interviews: know TF-IDF as context for why BM25 exists, not as a standalone topic. See BM25 and Retrieval Methods.