Sampling Methods¶

1. Overview¶

Sampling methods introduce controlled randomness into text generation by probabilistically selecting tokens rather than deterministically choosing the highest-probability token. This enables diverse, creative outputs while maintaining coherence.

Key insight: The best sequence isn't always the highest-probability one—controlled randomness can produce more natural, interesting text.

Three main techniques:

Temperature sampling - Controls randomness via scaling
Top-k sampling - Samples from top K tokens
Top-p sampling (nucleus) - Samples from smallest set with cumulative probability ≥ p

2. Temperature Sampling¶

2.1 How It Works¶

Temperature scales logits before applying softmax, controlling distribution "sharpness":

\[\text{P}_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}\]

Where:

\(z_i\) = logit for token i
\(T\) = temperature (T > 0):
- \(T = 1\) → original distribution
- \(T < 1\) → sharper (more deterministic)
- \(T > 1\) → flatter (more random)

2.2 Example¶

Original probabilities after "The cat sat on the":

Token	Prob (T=1)	Prob (T=0.5)	Prob (T=2.0)
mat	0.40	0.58	0.28
floor	0.25	0.26	0.23
sofa	0.15	0.10	0.18
bed	0.10	0.04	0.14
roof	0.05	0.01	0.09
moon	0.03	0.00	0.05
pizza	0.02	0.00	0.03

T=0.5 (Low): Distribution sharpens → "mat" dominates → near-greedy behavior
T=2.0 (High): Distribution flattens → "moon", "pizza" become viable → more random

2.3 Extreme Cases¶

T → 0:

Distribution becomes one-hot (probability → 1.0 for argmax)
Equivalent to greedy decoding
Zero randomness

T → ∞:

Uniform distribution (all tokens equally likely)
Maximum randomness
Often produces gibberish

Typical values:

T=0.7: Focused, coherent (factual Q&A)
T=1.0: Balanced (default)
T=1.2-1.5: Creative, diverse (story writing)

2.4 Code Example¶

import torch
import torch.nn.functional as F

def temperature_sampling(logits, temperature=1.0):
    """
    Sample token with temperature scaling.

    Args:
        logits: Raw model outputs [vocab_size]
        temperature: Scaling factor (T > 0)

    Returns:
        Sampled token ID
    """
    # Scale logits
    scaled_logits = logits / temperature

    # Convert to probabilities
    probs = F.softmax(scaled_logits, dim=-1)

    # Sample
    token_id = torch.multinomial(probs, num_samples=1)
    return token_id.item()

# Usage
# next_token = temperature_sampling(model_logits, temperature=1.2)

3. Top-k Sampling¶

3.1 How It Works¶

Top-k restricts sampling to the k most probable tokens:

Sort tokens by probability (descending)
Keep only top-k tokens
Set all other probabilities to zero
Renormalize the top-k probabilities
Sample from renormalized distribution

3.2 Example¶

Probabilities after "The cat sat on the":

Token	Probability
mat	0.40
floor	0.25
sofa	0.15
bed	0.10
roof	0.05
moon	0.03
pizza	0.02

Top-k with k=3:

Kept tokens:

mat: 0.40
floor: 0.25
sofa: 0.15

Renormalized:

mat: 0.40/0.80 = 0.50
floor: 0.25/0.80 = 0.31
sofa: 0.15/0.80 = 0.19

Removed: bed, roof, moon, pizza

Output: One of {mat, floor, sofa} with renormalized probabilities

3.3 Key Limitation: Fixed K¶

Top-k doesn't adapt to distribution shape:

Case 1: Confident model

Probabilities: [0.85, 0.07, 0.03, 0.02, 0.02, 0.01]
k=5 → keeps 5 tokens even though model is very confident
Inefficient: forces sampling from low-quality tokens

Case 2: Uncertain model

Probabilities: [0.20, 0.20, 0.20, 0.20, 0.20]
k=3 → keeps only 3 tokens, excludes equally valid options
Too restrictive: removes valid choices

Problem: Same k value for different distribution shapes.

3.4 Typical Values¶

k=10-20: Conservative, relatively safe outputs
k=40-50: More diverse, creative
k=100+: Very random, may include low-quality tokens

3.5 Code Example¶

import torch

def top_k_sampling(logits, k=50, temperature=1.0):
    """
    Sample from top-k tokens.

    Args:
        logits: Raw model outputs [vocab_size]
        k: Number of top tokens to consider
        temperature: Optional temperature scaling

    Returns:
        Sampled token ID
    """
    # Apply temperature
    scaled_logits = logits / temperature

    # Get top-k
    top_k_logits, top_k_indices = torch.topk(scaled_logits, k)

    # Softmax over top-k
    probs = F.softmax(top_k_logits, dim=-1)

    # Sample from top-k
    sampled_idx = torch.multinomial(probs, num_samples=1)

    # Map back to original vocabulary
    token_id = top_k_indices[sampled_idx]
    return token_id.item()

# Usage
# next_token = top_k_sampling(model_logits, k=40, temperature=0.9)

4. Top-p Sampling (Nucleus Sampling)¶

4.1 How It Works¶

Top-p selects the smallest set of tokens whose cumulative probability ≥ p:

Sort tokens by probability (descending)
Compute cumulative probability
Keep tokens until cumulative ≥ p
Renormalize and sample

Key advantage: Adaptive—number of tokens varies based on distribution.

4.2 Example¶

Probabilities after "The cat sat on the":

Token	Probability	Cumulative
mat	0.40	0.40
floor	0.25	0.65
sofa	0.15	0.80
bed	0.10	0.90
roof	0.05	0.95
moon	0.03	0.98
pizza	0.02	1.00

Top-p with p=0.9:

Keep: mat, floor, sofa, bed (cumulative = 0.90)
Remove: roof, moon, pizza
Effective k = 4

4.3 Adaptive Behavior¶

Case 1: Confident model

Probabilities: [0.85, 0.07, 0.03, 0.02, 0.02, 0.01]
p=0.9 → keeps 2 tokens [0.85, 0.07]
Effective k = 2 (adaptive reduction)

Case 2: Uncertain model

Probabilities: [0.20, 0.20, 0.20, 0.20, 0.20]
p=0.9 → keeps 5 tokens
Effective k = 5 (adaptive expansion)

Advantage over top-k: Automatically adjusts to model confidence.

4.4 Typical Values¶

p=0.9: Standard for most applications (OpenAI default)
p=0.95: More diverse, creative outputs
p=0.75-0.85: More focused, conservative

4.5 Code Example¶

import torch

def top_p_sampling(logits, p=0.9, temperature=1.0):
    """
    Nucleus sampling (top-p).

    Args:
        logits: Raw model outputs [vocab_size]
        p: Cumulative probability threshold (0 < p ≤ 1)
        temperature: Optional temperature scaling

    Returns:
        Sampled token ID
    """
    # Apply temperature
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)

    # Sort probabilities
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find cutoff: first position where cumulative > p
    # Include that position (so cumulative >= p)
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    # Mask out tokens to remove
    indices_to_remove = sorted_indices_to_remove.scatter(
        0, sorted_indices, sorted_indices_to_remove
    )
    filtered_logits = scaled_logits.clone()
    filtered_logits[indices_to_remove] = float('-inf')

    # Sample from filtered distribution
    filtered_probs = F.softmax(filtered_logits, dim=-1)
    token_id = torch.multinomial(filtered_probs, num_samples=1)
    return token_id.item()

# Usage
# next_token = top_p_sampling(model_logits, p=0.9, temperature=1.0)

5. Combining Sampling Methods¶

In practice, sampling methods are often combined:

Common Combination: Temperature + Top-p¶

def sample_token(logits, temperature=0.9, top_p=0.9):
    # 1. Apply temperature first
    scaled_logits = logits / temperature

    # 2. Then apply top-p filtering
    token = top_p_sampling(scaled_logits, p=top_p)
    return token

Why combine:

Temperature controls overall randomness
Top-p prevents sampling from the very long tail
Together: controlled creativity with safety

Other Combinations¶

Temperature + Top-k:

# Used in some older systems
token = top_k_sampling(logits, k=40, temperature=0.8)

Top-k + Top-p:

# Apply top-k first as a hard cutoff
# Then apply top-p for adaptive filtering
# Less common, top-p usually sufficient

6. When to Use Each Method¶

Temperature Sampling¶

✅ Use when:

Need simple randomness control
Working with other filtering methods (top-k, top-p)
Want smooth transition between deterministic and random

❌ Avoid when:

Used alone (no filtering from tail)
Need adaptive behavior

Top-k Sampling¶

✅ Use when:

Need simple, predictable diversity control
Fixed computational budget (always k tokens)
Legacy systems (older standard)

❌ Avoid when:

Distribution shape varies significantly
Need adaptive behavior
Modern systems (top-p preferred)

Top-p Sampling¶

✅ Use when:

Need adaptive diversity
Model confidence varies
General-purpose text generation
Conversational AI, creative writing

❌ Avoid when:

Need strict determinism
Computational constraints (slightly more complex)

7. Interview Questions¶

Q1: What problem do sampling methods solve?¶

Answer: Greedy and beam search produce repetitive, generic text by always choosing high-probability tokens. Sampling methods introduce controlled randomness, enabling diverse, creative outputs while preventing the model from getting stuck in repetitive loops. They balance coherence with variety.

Q2: How does temperature affect the probability distribution?¶

Answer: Temperature scales logits before softmax. Low temperature (T<1) sharpens the distribution—high-probability tokens become more dominant (near-greedy). High temperature (T>1) flattens the distribution—low-probability tokens become more likely (more random). At T→0, it becomes greedy; at T→∞, it becomes uniform.

Q3: What's the main limitation of top-k sampling?¶

Answer: Fixed k doesn't adapt to distribution shape. When the model is very confident, k=50 wastes computation on unlikely tokens. When uncertain with many valid options, k=50 might exclude good alternatives. Top-k treats all distributions the same, ignoring the model's confidence level.

Q4: How is top-p better than top-k?¶

Answer: Top-p is adaptive—it automatically adjusts the number of candidate tokens based on distribution shape: - Confident model → keeps fewer tokens (smaller effective k) - Uncertain model → keeps more tokens (larger effective k)

This makes top-p more robust across different contexts without hyperparameter tuning.

Q5: Can you use temperature=0.5 with top-p=0.9 together?¶

Answer: Yes, and this is common in practice:

Apply temperature=0.5 first → sharpen distribution (reduce randomness)
Apply top-p=0.9 → filter out low-probability tail
Sample from filtered distribution

Temperature controls overall randomness, top-p prevents sampling gibberish. Together they provide controlled, safe creativity.

Q6: Why not just use temperature alone?¶

Answer: Temperature alone doesn't filter out low-probability tokens—it just reduces their probability. Even with low temperature, there's still a tiny chance of sampling nonsense tokens from the very long tail. Top-p/top-k provide a hard cutoff, ensuring we never sample from clearly bad options.

Q7: What happens with very high temperature (T=5)?¶

Answer: The distribution becomes nearly uniform—all tokens have similar probability regardless of model's original confidence. This produces incoherent gibberish:

Input: "The capital of France is"
Output: "banana quantum seventh pencil"

High temperature destroys the model's learned knowledge. Typical max: T=1.5-2.0.

Q8: How do you choose between p=0.9 vs p=0.95?¶

Answer:

p=0.9: More focused, coherent (default for most applications)
p=0.95: More diverse, creative (for storytelling, brainstorming)

Trade-off: Higher p → more diversity but higher risk of incoherence. Start with 0.9; increase for creativity, decrease for safety. Depends on task requirements.

Q9: What's the computational cost of top-p vs top-k?¶

Answer:

Top-k: O(V log k) — partial sort for top-k elements
Top-p: O(V log V) — full sort to compute cumulative probabilities

Top-p is slightly more expensive, but the difference is negligible compared to model inference cost. The adaptive benefits of top-p outweigh the small computational overhead.

Q10: Why do modern LLMs (GPT-4, Claude) prefer top-p over top-k?¶

Answer: Adaptivity and robustness. Top-p automatically adjusts to:

Different contexts (formal vs casual)
Varying model confidence
Different domains (technical vs creative)

This makes it more reliable across diverse use cases without manual tuning. Top-k requires choosing k for each scenario, while top-p with p=0.9 works well universally.

8. Comparison Table¶

Method	Randomness Control	Adaptive	Prevents Tail Sampling	Typical Usage	Complexity
Temperature	Continuous (T)	No	No	+ top-p/top-k	O(V)
Top-k	Discrete (k)	No	Yes	Legacy, simple control	O(V log k)
Top-p	Continuous (p)	Yes	Yes	Modern LLMs, general use	O(V log V)
Greedy	None	N/A	N/A	Debugging, deterministic	O(V)
Beam	None	No	N/A	Translation, ASR	O(K×V)

10. Key Takeaways for Interviews¶

Temperature: Controls randomness by scaling logits; T<1 sharper, T>1 flatter
Top-k: Fixed number of candidate tokens; simple but not adaptive
Top-p: Adaptive cumulative probability threshold; modern standard
Combination: Use temperature + top-p together for best results
Top-p advantage: Automatically adjusts to model confidence
Common values: temperature=0.8-1.0, top-p=0.9
Use case: Sampling for creative/diverse tasks; greedy/beam for deterministic tasks