Skip to content

Tree-of-thoughts

1. Overview

Tree of Thoughts (ToT) is a prompting framework introduced by Yao et al. (NeurIPS 2023) that frames problem-solving as a search over a tree of intermediate reasoning steps ("thoughts"). Unlike Chain-of-Thought, which commits to a single linear path, ToT generates multiple candidate thoughts at each step, evaluates them, and uses a search algorithm (BFS or DFS) to explore the most promising branches — backtracking when a path looks unpromising.


2. Motivation: Where CoT Falls Short

CoT works well when each step has an obvious next move. It fails on problems that require:

  • Lookahead — the correct step now depends on what it enables later
  • Exploration — multiple plausible next moves with very different outcomes
  • Backtracking — recognising a path is wrong and trying a different one

Paper benchmark: Game of 24 (use four numbers and arithmetic to reach 24). CoT with GPT-4 solved only 4% of problems. ToT solved 74%.


3. How ToT Works

Problem
  ├── Thought A  →  eval: promising  →  expand
  │     ├── Thought A1  →  eval: good  →  solution
  │     └── Thought A2  →  eval: dead end  →  backtrack
  ├── Thought B  →  eval: weak  →  prune
  └── Thought C  →  eval: promising  →  expand
        └── ...

3.1 Thought Generation

Two strategies from the paper:

Strategy How When to use
Sampling Generate k thoughts independently in parallel (temperature > 0) When thoughts are short and diverse (e.g. one sentence)
Proposing Generate thoughts sequentially in a single prompt, using a "propose prompt" When thoughts are longer or need to be distinct from each other

3.2 Thought Evaluation

The LLM evaluates each thought to guide the search — this replaces the hand-crafted heuristic in classical search:

  • Value: prompt the LLM to score each thought independently (e.g. 1–10, or sure/likely/impossible)
  • Vote: prompt the LLM to compare thoughts and pick the most promising one

3.3 Search Algorithm

  • BFS: expand all thoughts at depth d before going to d+1; keeps the top-k states at each level; better for problems needing global comparison
  • DFS: explore one branch fully before backtracking; more memory-efficient but risks getting stuck

Key hyperparameters: tree depth, branching factor (number of thoughts per step), and number of states kept at each BFS level.


3.4 Worked Example — Game of 24 (numbers: 4, 9, 10, 13)

Game of 24: combine four numbers with +, −, ×, ÷ to make 24.

Step 1 — Generate candidate first moves (branching factor k=5):

The model proposes several ways to combine two of the four numbers:

Candidate 1: 13 - 9 = 4   → remaining: 4, 4, 10
Candidate 2: 10 - 4 = 6   → remaining: 6, 9, 13
Candidate 3: 9 + 4 = 13   → remaining: 10, 13, 13
Candidate 4: 13 + 9 = 22  → remaining: 4, 10, 22
Candidate 5: 4 × 9 = 36   → remaining: 10, 13, 36

Step 2 — Evaluate each candidate:

The LLM scores each state: "Given these remaining numbers, how likely is it that they can be combined to reach 24? Answer sure / likely / impossible."

13 - 9 = 4  → remaining 4, 4, 10  → "sure"   ✓ keep
10 - 4 = 6  → remaining 6, 9, 13  → "likely" ✓ keep
9  + 4 = 13 → remaining 10,13,13  → "likely" ✓ keep
13 + 9 = 22 → remaining 4, 10, 22 → "impossible" ✗ prune
4  × 9 = 36 → remaining 10,13,36  → "impossible" ✗ prune

Step 3 — Expand the kept states (BFS moves to depth 2):

For state [4, 4, 10], the model proposes next moves:

4 × 4 = 16  → remaining: 10, 16  → eval: "sure" (10 + 16 = 26? no... 16 - 10 = 6? no... wait)
              → actually eval: "likely"
10 - 4 = 6  → remaining: 4, 6   → eval: "impossible"
4 + 4 = 8   → remaining: 8, 10  → eval: "likely"
10 + 4 = 14 → remaining: 4, 14  → eval: "likely"
10 × 4 = 40 → remaining: 4, 40  → eval: "impossible" ✗ prune

Step 4 — Backtracking in DFS:

If the model is using DFS and reaches a dead end (e.g. remaining numbers [6, 6] with no way to make 24), it backs up to the parent node and tries the next unexplored branch — exactly like a human scratching out a dead-end calculation and trying a different first step.

Final solution found: 13 − 9 = 4 → 4 × 4 = 16 → 10 + 16 = 26... (not this branch) Actually: 10 − 4 = 6 → 13 − 9 = 4 → 6 × 4 = 24 ✓


4. Results from the Paper

Task CoT (best) ToT Notes
Game of 24 4% 74% GPT-4; ToT used BFS with k=5
Mini Crosswords 16% 60% 5×5 crossword; DFS with backtracking
Creative Writing ~6.9/10 ~7.6/10 Human coherence rating; smaller gap

The crossword result illustrates backtracking: the model fills a word, discovers it conflicts with a crossing word, and backs up to try a different fill — something a linear CoT cannot do.


5. Cost vs. Benefit

ToT is significantly more expensive than CoT:

  • Game of 24 required ~$1 of API calls per problem (GPT-4 pricing at time of paper)
  • Branching factor k=5 and depth=3 means up to 5³ = 125 LLM calls per problem vs. 1 for CoT

When ToT is worth it: problems with discrete, verifiable intermediate states where backtracking has clear value (puzzles, planning, formal reasoning). For open-ended generation or tasks where CoT already works well, the cost is rarely justified.


6. Extensions

Graph of Thoughts (Besta et al., 2024) — generalises ToT from a tree to an arbitrary directed graph. Thoughts from different branches can be merged (aggregated), allowing the model to combine partial solutions. Improves performance on sorting and set operations where results from parallel branches need to be combined.


Technique Structure Evaluation Backtracking Cost
CoT Linear chain None No
Self-Consistency Parallel chains Majority vote No
ToT Tree LLM value/vote Yes k^d ×
GoT Graph LLM value Yes higher

Source: Yao et al. (2023) — "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023. [arXiv:2305.10601]