Speculative Decoding - Interview Prep Guide¶
1. Overview¶
Speculative decoding reduces inference latency in autoregressive LLMs by using a small draft model to propose multiple tokens that a large target model verifies in parallel. It produces exact same output distribution as standard decoding—not an approximation.
Key insight: Instead of 1 token per expensive model call, verify K tokens in one call by having a cheap model guess ahead.
Typical speedup: 2-3x latency reduction with no quality loss
2. The Problem with Standard Decoding¶
Sequential Bottleneck¶
Autoregressive generation is inherently sequential:
Standard decoding loop:
for _ in range(max_length):
logits = large_model(tokens) # Expensive!
next_token = sample(logits[-1]) # Only use last position
tokens.append(next_token) # Generate 1 token
Problems:
- Generate 1 token per forward pass
- Model sits mostly idle (memory-bound, not compute-bound)
- Latency grows linearly with output length
- KV cache helps computation but doesn't break sequential dependency
Example: For 100 tokens, need 100 expensive forward passes of the large model.
3. How Speculative Decoding Works¶
Core Idea¶
Separate token proposal from verification:
- Draft model (small, fast): Proposes K tokens autoregressively
- Target model (large, expensive): Verifies all K tokens in one parallel forward pass
- Accept/reject: Use rejection sampling to maintain correct distribution
Visual Example¶
Prompt: "The capital of France is"
Draft model proposes: "Paris , which is" (K=4 tokens)
↓
Target model verifies all 4 tokens in one pass
↓
Accept? [Paris: ✓] [,: ✓] [which: ✓] [is: ✗]
↓
Output: "The capital of France is Paris , which"
Continue from "which" (3 tokens in 1 target pass instead of 3!)
4. Step-by-Step Algorithm¶
Setup¶
- Draft model \(q\): Small, fast (e.g., 1B params)
- Target model \(p\): Large, accurate (e.g., 70B params)
- Draft length \(K\): Number of tokens to propose (typically 4-8)
Step 1: Draft Proposes K Tokens¶
Draft model generates K tokens autoregressively:
Important: Sample tokens (don't use greedy) and record \(q(y_i \mid x, y_{<i})\) for each.
# Draft phase (cheap, K sequential calls to small model)
draft_tokens = []
draft_probs = []
for i in range(K):
logits = draft_model(prompt + draft_tokens)
probs = softmax(logits[-1])
token = sample(probs) # Sample, not greedy!
draft_tokens.append(token)
draft_probs.append(probs[token]) # Record q(y_i)
Step 2: Target Verifies All Tokens¶
Target model runs once on the entire sequence:
This produces probabilities for all draft positions:
# Verification phase (1 parallel call to large model)
full_sequence = prompt + draft_tokens
target_logits = target_model(full_sequence) # Shape: [seq_len, vocab_size]
# Extract logits for draft positions only
target_probs = []
for i in range(K):
pos = len(prompt) + i - 1 # Position before token i
probs = softmax(target_logits[pos])
target_probs.append(probs[draft_tokens[i]]) # p(y_i)
Key: Transformer naturally computes logits for all positions—speculative decoding just uses them.
Step 3: Accept or Reject Each Token¶
Use rejection sampling to maintain correct distribution:
accepted = []
for i in range(K):
# Acceptance probability
acceptance_prob = min(1.0, target_probs[i] / draft_probs[i])
# Random test
if random.random() < acceptance_prob:
accepted.append(draft_tokens[i])
else:
# First rejection: stop here
break
Stop at first rejection (can't accept token 3 if token 2 was rejected).
Step 4: Handle Rejection¶
If token \(j\) is rejected:
- Discard \(y_j, y_{j+1}, \dots, y_K\)
- Sample replacement from target model at position \(j\):
if len(accepted) < K: # Some token was rejected
# Sample next token from target model
rejection_pos = len(accepted)
next_token = sample(target_logits[rejection_pos])
accepted.append(next_token)
# Continue with accepted tokens
prompt = prompt + accepted
If all K tokens accepted, continue with next draft batch.
5. Why It's Faster¶
Speedup Analysis¶
Standard decoding for N tokens:
- N forward passes of target model
- Cost: \(N \times C_{\text{target}}\)
Speculative decoding:
- Draft proposes K tokens: \(K \times C_{\text{draft}}\) (cheap)
- Target verifies in 1 pass: \(1 \times C_{\text{target}}\)
- If acceptance rate = \(\alpha\), generate \(\alpha \times K\) tokens per cycle
Expected tokens per target call: $$ E[\text{tokens}] = 1 + \sum_{i=1}^{K} \alpha^i \approx \frac{1 - \alpha^{K+1}}{1 - \alpha} $$
Example:
- \(K=4\), \(\alpha=0.7\) → ~2.4 tokens per target call
- 2.4× speedup (vs 1 token per call in standard decoding)
Why Parallel Verification Works¶
Transformers compute logits for all positions in one pass:
Input: ["The", "cat", "sat"]
Output: [logits_0, logits_1, logits_2]
↓ ↓ ↓
P(·|"The") P(·|"The cat") P(·|"The cat sat")
Standard decoding only uses logits_2 (last position).
Speculative decoding uses logits_0, logits_1, logits_2 to verify multiple tokens.
This is not new capability—it's how Transformers always work during training.
6. Why It's Exact (Not Approximate)¶
Rejection Sampling Proof¶
The acceptance rule: $$ \alpha_i = \min\left(1, \frac{p(y_i)}{q(y_i)}\right) $$
is standard rejection sampling:
- If \(p(y_i) \geq q(y_i)\): always accept (draft underestimated)
- If \(p(y_i) < q(y_i)\): accept with probability \(p/q\) (draft overestimated)
Result: Accepted tokens have distribution \(p\), not \(q\).
Mathematical guarantee: Output distribution is identical to sampling from target model directly.
When Rejection Happens¶
Draft assigns: q("Paris") = 0.8
Target assigns: p("Paris") = 0.9
→ Accept always (α = min(1, 0.9/0.8) = 1.0)
Draft assigns: q("London") = 0.6
Target assigns: p("London") = 0.3
→ Accept 50% of time (α = 0.3/0.6 = 0.5)
Rejection corrects for draft model errors while maintaining exact target distribution.
7. Practical Considerations¶
Draft Model Selection¶
Options:
- Smaller version of target (e.g., 1B vs 70B parameters)
- Quantized target model (INT8 vs FP16)
- Distilled model trained to match target
- Early-exit from target (use intermediate layers)
Requirements:
- Fast enough (≥10× faster than target)
- Similar enough to target (high acceptance rate)
Draft Length K¶
Trade-off:
- Small K (2-4): Lower overhead, guaranteed speedup
- Large K (8-16): Higher potential speedup, but more likely to reject
Optimal K depends on:
- Acceptance rate (higher α → larger K beneficial)
- Draft/target speed ratio
- Memory constraints
Typical: K=4-5 works well in practice
Acceptance Rate¶
Factors affecting α:
- Draft-target model similarity
- Task complexity (simple text → higher α)
- Prompt context (more context → better draft predictions)
Typical rates:
- Good draft: α=0.7-0.9
- Poor draft: α=0.3-0.5
Below α≈0.3: Speculative decoding may be slower than standard (overhead dominates).
8. Interaction with KV Cache¶
KV Cache in Both Models¶
Draft model:
- Maintains its own KV cache
- Generates K tokens autoregressively (K cache updates)
Target model:
- Computes KV cache for entire draft window in one pass
- Accepted tokens' KV states are kept
- Rejected tokens' KV states are discarded
Memory Considerations¶
Standard decoding: 1 model's KV cache
Speculative decoding: 2 models' KV cache (draft + target)
Memory cost: Minimal—draft model is small, so its cache is negligible.
9. Interview Questions¶
Q1: What is speculative decoding and why is it faster?¶
Answer: Speculative decoding uses a small draft model to propose K tokens cheaply, then a large target model verifies all K tokens in one parallel forward pass. Since Transformers compute logits for all positions naturally, we can check multiple draft tokens together. If most are accepted (high acceptance rate), we generate multiple tokens per expensive target model call instead of 1, reducing latency by 2-3× while maintaining exact output distribution.
Q2: Does speculative decoding produce exact or approximate results?¶
Answer: Exact. It uses rejection sampling to correct for draft model errors. The acceptance probability α = min(1, p(y)/q(y)) ensures accepted tokens have the exact target distribution p, not the draft distribution q. The output is statistically identical to standard decoding with the target model—it's a latency optimization, not an approximation.
Q3: Why can't you just run the target model in parallel for K tokens directly?¶
Answer: Autoregressive models have a causal dependency—token t depends on all tokens before it (t-1, t-2, ...). You cannot predict token t and token t+1 independently because t+1 needs t as input. Speculative decoding works around this by having the draft model make guesses (which may be wrong), then the target model verifies them all at once (which is valid because verification only requires a single forward pass).
Q4: How does the target model verify K tokens in one pass?¶
Answer: Transformers naturally compute logits for all token positions in a forward pass, not just the last one. Given input [x, y₁, y₂, y₃], the model outputs logits for positions 0, 1, 2, 3. Standard decoding only uses the last position. Speculative decoding uses all positions to compute p(y₁|x), p(y₂|x,y₁), p(y₃|x,y₁,y₂) and compare them with the draft probabilities. This is how Transformers work during training—speculative decoding just reuses this during inference.
Q5: What happens when a draft token is rejected?¶
Answer: Stop immediately at the first rejection:
- Discard the rejected token and all subsequent draft tokens
- Sample a replacement token from the target model at that position
- Restart speculation from the new sequence
For example, if tokens 1,2 are accepted but token 3 is rejected, keep 1,2, sample a new token 3 from target, discard draft tokens 4,5,...,K. This maintains causality and correctness.
Q6: What determines the acceptance rate?¶
Answer: How well the draft model matches the target model's distribution:
- High similarity (e.g., quantized version) → α=0.8-0.9 → 2-3× speedup
- Moderate similarity (e.g., smaller architecture) → α=0.5-0.7 → 1.5-2× speedup
- Low similarity → α<0.3 → overhead dominates, possibly slower
Also affected by task (simple text has higher α) and context (more prompt context helps draft predict better).
Q7: How do you choose the draft length K?¶
Answer: Trade-off between potential speedup and overhead:
- Higher acceptance rate α → can use larger K (more tokens likely accepted)
- Lower α → use smaller K (avoid wasting computation on rejections)
- Typical: K=4-5 works well across scenarios
Formula: Expected tokens per cycle ≈ (1-α^(K+1))/(1-α). Diminishing returns beyond K=5-8 for most acceptance rates.
Q8: Does speculative decoding reduce total FLOPs?¶
Answer: No, it increases total FLOPs slightly (due to draft model overhead and potential rejections). The speedup comes from reducing latency—fewer sequential calls to the expensive target model. It's memory-bound optimization: better GPU utilization by verifying multiple tokens in parallel rather than one at a time. Wall-clock time decreases even though total compute increases.
Q9: Can you use greedy decoding for the draft model?¶
Answer: In speculative decoding, using greedy decoding for the draft model is not inherently wrong — the rejection sampling step mathematically guarantees the target model's distribution is preserved regardless of how the draft generates tokens. The real concern is efficiency: when the target model samples stochastically (temperature > 0), a greedy draft diverges more from the target's distribution, leading to lower acceptance rates and reduced speedup. If the target itself uses greedy decoding, then using greedy for the draft is perfectly fine and can work very well. So the draft model's decoding strategy is essentially a performance tuning choice — greedy hurts acceptance rates only when the target is sampling, not a correctness issue.¶
Q10: How does speculative decoding interact with KV cache?¶
Answer: Both models use KV cache:
- Draft model: Maintains its own cache, generates K tokens autoregressively
- Target model: Computes cache for entire draft window in one pass; keeps cache for accepted tokens, discards for rejected ones
Memory overhead is minimal since the draft model is small. KV cache improves efficiency (avoids recomputing attention) but doesn't change the algorithm—it's an orthogonal optimization.
10. When to Use Speculative Decoding¶
✅ Good Use Cases¶
- Latency-critical serving (chatbots, real-time applications)
- Long-form generation (more tokens → more opportunities for speedup)
- Good draft model available (smaller version, quantized, distilled)
- Inference-bound workloads (not training)
❌ Less Beneficial¶
- Batch inference (already parallelized across sequences)
- Very short outputs (overhead not amortized)
- No suitable draft model (acceptance rate too low)
- Memory-constrained systems (need to load 2 models)
11. Variants and Extensions¶
Medusa / Multi-Head Speculation¶
- Add multiple prediction heads to draft model
- Each head predicts different future positions in parallel
- Tree-based verification (multiple candidate paths)
Lookahead Decoding¶
- Uses n-gram matching and Jacobi iterations
- Doesn't require separate draft model
- Lower speedup but no extra model needed
Staged Speculative Decoding¶
- Multiple draft models of increasing size
- First draft proposes, second refines, target verifies
- Can achieve higher acceptance rates
12. Key Takeaways for Interviews¶
- Main idea: Draft model proposes K tokens, target verifies all in one pass
- Speedup source: Multiple tokens per expensive model call (not reduced FLOPs)
- Exactness: Rejection sampling ensures identical distribution to target model
- Acceptance rate: Critical metric—need α>0.5 for practical speedup
- Parallel verification: Uses existing Transformer capability (logits for all positions)
- Draft model: Must be 10×+ faster, doesn't need to be very accurate
- Typical gain: 2-3× latency reduction in production systems
- Memory: Minimal overhead (draft model is small)