Greedy Decoding - Interview Prep Guide¶

1. Overview¶

Greedy decoding selects the token with the highest probability at each step during autoregressive text generation. It's the simplest and fastest decoding strategy but lacks diversity.

Key insight: Local optimization (best token at each step) doesn't guarantee global optimality (best overall sequence).

2. How It Works¶

Algorithm¶

def greedy_decode(model, prompt, max_length):
    tokens = prompt
    for _ in range(max_length):
        logits = model(tokens)  # Shape: (vocab_size,)
        next_token = argmax(logits)  # Pick highest probability
        tokens.append(next_token)
        if next_token == EOS:
            break
    return tokens

Example¶

Given probability distribution after "The cat sat on the":

Token	Probability
mat	0.40
floor	0.25
sofa	0.15
bed	0.10

Greedy choice: mat (highest probability)

Output: "The cat sat on the mat"

3. Key Characteristics¶

Deterministic¶

Same input → always same output
No randomness involved
Fully reproducible

Myopic (Short-sighted)¶

Each decision is locally optimal but may lead to suboptimal sequences:

Step 1: "I think" → model assigns:
  - "that" (0.6)
  - "the" (0.4)

Greedy picks: "that"

Step 2: "I think that" → model assigns:
  - "is" (0.3)
  - "maybe" (0.25)

But if we had picked "the" at step 1:
  "I think the" → "best" (0.7)

Final: "I think that is..." (score: 0.6 × 0.3 = 0.18)
Better: "I think the best..." (score: 0.4 × 0.7 = 0.28)

Greedy can't recover from early suboptimal choices.

4. Common Problems¶

Problem 1: Repetition Loops¶

Prompt: "To be or not to"
Output: "be or not to be or not to be or not to be..."

Why: If "be" has slightly higher probability than alternatives at each step, greedy gets stuck.

Problem 2: Generic Outputs¶

Prompt: "Write a creative story about"
Greedy: "a boy who lived in a small town and went to school..."

Always picks safe, high-probability continuations → boring text.

Problem 3: Early Mistakes Propagate¶

Prompt: "The capital of France is"
If model assigns:
  - "Paris" (0.45)
  - "Lyon" (0.46)  ← greedy picks this (wrong!)

Output: "Lyon, which is known for..."

Once the wrong token is chosen, subsequent tokens try to justify it.

5. When to Use Greedy Decoding¶

✅ Good Use Cases¶

Deterministic tasks:

Math problem solving
Code generation (when exact output matters)
Factual Q&A with clear answers
Translation of standardized text

Debugging:

Baseline comparisons
Reproducible testing
Fastest inference for quick experiments

Confident models:

When model probability distributions are very peaked
Single obvious correct answer

❌ Poor Use Cases¶

Creative tasks:

Story writing
Poetry generation
Brainstorming

Conversational AI:

Chatbots
Dialogue systems
Personality-driven responses

Long-form generation:

Articles, essays
Open-ended content

6. Practical Considerations¶

Computational Complexity¶

Time: O(T × V) where T = sequence length, V = vocab size
Space: O(1) for decoding state (minimal memory)
Fastest among all decoding strategies

Modifications¶

Greedy + repetition penalty:

def greedy_with_penalty(logits, previous_tokens, penalty=1.2):
    for token in previous_tokens:
        logits[token] /= penalty  # Reduce probability of repeated tokens
    return argmax(logits)

Greedy + length normalization: Useful when comparing sequences of different lengths (not during decoding itself).

7. Interview Questions¶

Q1: What is greedy decoding and how does it work?¶

Answer: Greedy decoding selects the token with the highest probability at each generation step. It's deterministic and fast but locally optimal—it picks the best next token without considering whether this leads to the best overall sequence.

Q2: Why is greedy decoding called "greedy"?¶

Answer: It's "greedy" because it makes the locally optimal choice at each step (highest probability token) without considering future consequences. Like the greedy algorithm paradigm, it optimizes immediate reward rather than global optimality.

Q3: What's the main disadvantage of greedy decoding?¶

Answer: Lack of diversity and repetition. Greedy often produces repetitive, generic text because it can't explore alternative paths. Once it makes a suboptimal choice, it can't backtrack, leading to error propagation and repetitive loops.

Q4: How does greedy decoding differ from beam search?¶

Answer:

Greedy: Keeps only 1 hypothesis (best token at each step)
Beam search: Keeps K hypotheses (top-K sequences), explores multiple paths

Beam search can recover from locally suboptimal choices by considering alternative sequences. Greedy cannot.

Q5: Can greedy decoding produce the optimal sequence?¶

Answer: Not necessarily. Greedy finds a locally optimal sequence but not necessarily globally optimal. The highest-probability sequence might require choosing a lower-probability token early on that leads to much higher probabilities later.

Q6: Why does greedy decoding cause repetition?¶

Answer: If a token or phrase has slightly higher probability than alternatives, greedy will keep selecting it. For example, in "to be or not to be or not to be...", if "be" consistently has a small probability advantage, the model gets stuck in a loop with no mechanism to escape.

Q7: How can you reduce repetition in greedy decoding?¶

Answer: Common techniques:

Repetition penalty: Divide logits of previously generated tokens by penalty factor (e.g., 1.2)
N-gram blocking: Prevent repeating same N-grams
Switch to sampling: Use temperature/top-p for diversity
Beam search: Explore multiple paths to avoid local minima

Q8: What's the computational cost of greedy decoding?¶

Answer: Very efficient:

Per step: O(V) to find argmax over vocabulary
Total: O(T × V) for T tokens
Memory: O(1) for decoding state
Fastest decoding strategy, no beam/sample storage overhead

Q9: When would you prefer greedy over beam search?¶

Answer:

Latency-critical applications (greedy is much faster)
Deterministic outputs required (reproducibility)
Single clear answer (factual Q&A, simple math)
Resource-constrained environments (minimal memory)

Beam search is overkill when diversity isn't needed and the model is confident.

Q10: How does greedy decoding interact with temperature?¶

Answer: Greedy decoding ignores temperature because it always picks argmax. Temperature only affects sampling-based methods. You can think of greedy as temperature → 0 (infinitely peaked distribution where highest probability → 1.0, others → 0).

8. Comparison with Other Methods¶

Aspect	Greedy	Beam Search	Sampling
Speed	Fastest	Slow (K× cost)	Fast
Diversity	None	Low	High
Determinism	Yes	Yes	No
Repetition	High risk	Medium risk	Low risk
Quality	Variable	Higher	Variable
Memory	Minimal	O(K × T)	Minimal

9. Code Example¶

import torch

def greedy_decode(model, input_ids, max_length=50):
    """
    Simple greedy decoding implementation.

    Args:
        model: Language model with forward() method
        input_ids: Starting token IDs (tensor)
        max_length: Maximum sequence length

    Returns:
        Generated token IDs
    """
    generated = input_ids.clone()

    for _ in range(max_length):
        # Get logits for next token
        with torch.no_grad():
            outputs = model(generated)
            logits = outputs.logits[:, -1, :]  # Last token logits

        # Greedy selection
        next_token = torch.argmax(logits, dim=-1, keepdim=True)

        # Append to sequence
        generated = torch.cat([generated, next_token], dim=1)

        # Check for EOS
        if next_token.item() == model.config.eos_token_id:
            break

    return generated

# Usage
# output = greedy_decode(model, prompt_tokens, max_length=100)

10. Key Takeaways for Interviews¶

Definition: Always picks the highest probability token at each step
Pros: Fast, deterministic, simple, reproducible
Cons: No diversity, repetitive, locally optimal only
Main problem: Repetition loops and generic outputs
Best for: Deterministic tasks, debugging, factual Q&A
Worst for: Creative writing, conversations, long-form generation
Complexity: O(T × V) time, O(1) space for decoding