Beam Search - Interview Prep Guide¶
1. Overview¶
Beam search maintains K candidate sequences (beams) at each decoding step and selects the sequence with the highest cumulative probability. It balances between greedy decoding (K=1) and exhaustive search (K=vocab_size).
Key insight: Explore multiple promising paths simultaneously to avoid getting stuck in locally optimal but globally suboptimal sequences.
2. How It Works¶
Algorithm¶
def beam_search(model, prompt, beam_width=5, max_length=50):
# Initialize: One beam with the prompt
beams = [(prompt, 0.0)] # (sequence, cumulative_log_prob)
for step in range(max_length):
candidates = []
# Expand each beam
for seq, score in beams:
if seq[-1] == EOS: # Completed sequence
candidates.append((seq, score))
continue
logits = model(seq)
probs = softmax(logits)
# Consider all tokens
for token_id, prob in enumerate(probs):
new_seq = seq + [token_id]
new_score = score + log(prob) # Cumulative log probability
candidates.append((new_seq, new_score))
# Keep top K beams
beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
# Stop if all beams are complete
if all(seq[-1] == EOS for seq, _ in beams):
break
return beams[0][0] # Return best sequence
Core Concepts¶
Beam width (K):
- K=1: Greedy decoding
- K=5-10: Typical for translation
- K=50+: Exhaustive (expensive)
Cumulative log probability:
- Use log probabilities to avoid numerical underflow
- Score = log(P₁) + log(P₂) + ... + log(Pₙ)
- Equivalent to log(P₁ × P₂ × ... × Pₙ)
3. Example Walkthrough¶
Prompt: "The cat sat on the"
Beam width: 3
Step 1: First Token¶
Model probabilities:
| Token | Probability | Log Prob |
|---|---|---|
| mat | 0.40 | -0.92 |
| floor | 0.25 | -1.39 |
| sofa | 0.15 | -1.90 |
| bed | 0.10 | -2.30 |
Top 3 beams:
- "The cat sat on the mat" → score: -0.92
- "The cat sat on the floor" → score: -1.39
- "The cat sat on the sofa" → score: -1.90
Step 2: Second Token¶
Expand each beam:
Beam 1: "mat" + next token
- "mat ." → -0.92 + (-0.51) = -1.43
- "mat and" → -0.92 + (-1.61) = -2.53
Beam 2: "floor" + next token
- "floor ." → -1.39 + (-0.36) = -1.75
- "floor while" → -1.39 + (-1.20) = -2.59
Beam 3: "sofa" + next token
- "sofa ." → -1.90 + (-0.41) = -2.31
- "sofa when" → -1.90 + (-0.51) = -2.41
New top 3 beams:
- "The cat sat on the mat ." → -1.43
- "The cat sat on the floor ." → -1.75
- "The cat sat on the sofa when" → -2.41
Final output: "The cat sat on the mat."
4. Key Characteristics¶
Explores Multiple Paths¶
Unlike greedy, beam search maintains K hypotheses:
- Can recover from locally suboptimal choices
- Considers alternative continuations
- Better global optimization
Length Bias Problem¶
Longer sequences accumulate more negative log probabilities:
Seq 1 (length 5): -0.5 + -0.6 + -0.4 + -0.5 + -0.3 = -2.3
Seq 2 (length 3): -0.5 + -0.6 + -0.4 = -1.5
Seq 2 has higher score despite being incomplete!
Solution: Length normalization
normalized_score = score / length^α
# α = 0.6-0.8 typical
# α = 0: no normalization
# α = 1: full normalization
5. Common Problems¶
Problem 1: Reduced Diversity¶
All beams often converge to similar outputs:
Prompt: "I think that"
Beam 1: "I think that we should focus on..."
Beam 2: "I think that we need to consider..."
Beam 3: "I think that this is important..."
All beams start with safe, high-probability tokens → similar continuations.
Why: Beam search favors safe, high-probability paths over diverse, creative paths.
Problem 2: Generic Outputs¶
In open-ended generation:
Prompt: "Tell me a story about"
Greedy: "a boy who lived in"
Beam (K=5): "a young boy who lived in a small town"
Beam search produces grammatically perfect but boring text.
Problem 3: Computational Cost¶
- Memory: O(K × T × V) for storing beam candidates
- Time: O(K × V) per step (vs O(V) for greedy)
- K=10 → 10× slower than greedy
6. Length Normalization¶
Without Normalization¶
# Shorter sequences preferred
beams = sorted(candidates, key=lambda x: x[1], reverse=True)
With Normalization¶
def length_penalty(length, alpha=0.6):
return ((5 + length) / 6) ** alpha # Google NMT formula
# Normalized score
normalized = score / length_penalty(len(seq), alpha=0.6)
beams = sorted(candidates, key=lambda x: normalized_score(x), reverse=True)
Effect:
- Encourages longer, more complete sequences
- Prevents premature termination
- Essential for translation and summarization
7. When to Use Beam Search¶
✅ Good Use Cases¶
Structured tasks with clear objectives:
- Machine translation
- Automatic speech recognition (ASR)
- Image captioning
- Summarization
- Question answering (extractive)
When correctness matters more than creativity:
- Technical documentation generation
- Code comment generation
- Medical report generation
❌ Poor Use Cases¶
Creative or conversational tasks:
- Story writing
- Dialogue systems
- Chatbots
- Poetry generation
Tasks requiring diversity:
- Brainstorming
- Multiple solution generation
- Creative writing
8. Variants and Improvements¶
Diverse Beam Search¶
Force beams to be dissimilar using diversity penalty:
diversity_penalty = 0.5
for beam_group in groups:
# Penalize tokens already chosen by other groups
adjusted_score = score - diversity_penalty * overlap_count
Constrained Beam Search¶
Force inclusion of specific tokens/phrases:
# E.g., must include "climate change" in summary
constraints = ["climate", "change"]
# Only keep beams that satisfy constraints
Stochastic Beam Search¶
Add randomness to beam selection for more diversity.
9. Interview Questions¶
Q1: What is beam search and how does it differ from greedy decoding?¶
Answer: Beam search maintains K candidate sequences (beams) instead of just one. At each step, it expands all beams, scores all possible continuations, and keeps the top K. This allows exploration of multiple paths and can recover from locally suboptimal choices, unlike greedy which commits to a single path.
Q2: Why use log probabilities instead of regular probabilities?¶
Answer: Two reasons:
- Numerical stability: Multiplying many small probabilities (0.3 × 0.4 × 0.2...) quickly underflows to zero in floating point
- Computational efficiency: Log transforms products to sums: log(P₁ × P₂) = log(P₁) + log(P₂), which is more stable and efficient
Q3: What is the length bias problem in beam search?¶
Answer: Longer sequences accumulate more negative log probabilities, making them score lower than shorter sequences even if they're more complete. For example, a 10-token sequence might score -8.5 while a 5-token incomplete sequence scores -3.2. Length normalization (dividing by sequence length raised to α) addresses this by favoring complete sentences.
Q4: Why does beam search produce less diverse outputs than sampling?¶
Answer: Beam search is deterministic and risk-averse—it keeps the K highest-probability sequences. This means all beams tend to follow safe, high-probability paths, leading to similar outputs. Rare but creative continuations are discarded early. Sampling methods can explore lower-probability tokens, leading to more diversity.
Q5: What's the computational complexity of beam search?¶
Answer:
- Time per step: O(K × V) where K=beam width, V=vocab size
- Greedy: O(V), so beam is K× slower
- Memory: O(K × T) to store K beams of length T
- Total: For T steps, O(K × V × T) time
Typical K=5-10 for translation, but this 5-10× slowdown is significant.
Q6: How do you choose the optimal beam width K?¶
Answer: Trade-off between quality and speed:
- K=1: Greedy (fast, low quality)
- K=5-10: Standard for translation (good balance)
- K=50+: Diminishing returns, very slow
Empirically, quality plateaus around K=10 for most tasks. Beyond that, you get marginal gains for significant computational cost.
Q7: Can beam search guarantee finding the optimal sequence?¶
Answer: No. Beam search is a heuristic that prunes the search space. It only explores the top K paths at each step, potentially discarding paths that could lead to the globally optimal sequence later. Full exhaustive search (K=|V|^T) is computationally infeasible, so beam search is a practical approximation.
Q8: What is diverse beam search and when is it useful?¶
Answer: Diverse beam search forces different beam groups to explore different areas of the search space by penalizing similarity. It's useful when you need multiple distinct outputs (e.g., generating K different translations or summaries). Standard beam search often produces K very similar sequences, which isn't helpful for diversity.
Q9: When would you use beam search over sampling methods like top-p?¶
Answer: Use beam search for:
- Objective quality metrics (BLEU, ROUGE) that correlate with likelihood
- Structured outputs (translation, ASR) with one correct answer
- Deterministic requirements (reproducibility)
Use sampling for:
- Creative tasks requiring diversity
- Conversational AI needing personality
- Open-ended generation where many good answers exist
Q10: How does beam search handle the EOS token?¶
Answer: When a beam generates EOS (end-of-sequence), it's marked as complete:
- Complete beams stop expanding
- They remain in the candidate pool with their final score
- Active beams continue generating
- Search terminates when all K beams complete or max_length reached
- Return the complete beam with highest normalized score
Some implementations use length normalization to fairly compare complete sequences of different lengths.
10. Key Takeaways for Interviews¶
- Definition: Maintains K candidate sequences, keeps top-K by cumulative probability
- Beam width K: Trade-off between quality (higher K) and speed (lower K)
- Length normalization: Essential to prevent bias toward shorter sequences
- Pros: Better than greedy, explores alternatives, good for structured tasks
- Cons: Computationally expensive (K× slower), low diversity, generic outputs
- Best for: Translation, ASR, captioning, summarization
- Worst for: Creative writing, dialogue, brainstorming
- Complexity: O(K × V × T) time, O(K × T) space