Batching Strategies

1. Why Batching Matters¶

GPU Utilization Problem:

Single request uses <10% of GPU compute capacity
Memory bandwidth underutilized
Expensive hardware sitting idle

Batching Solution:

Process multiple requests simultaneously
Amortize memory access costs
Achieve 5-10x throughput improvement

Key Constraint: GPU memory limits batch size (primarily KV cache)

2. Static Batching¶

Mechanism¶

Accumulate N requests
Process entire batch together
Wait for ALL sequences to complete before accepting new requests

Characteristics¶

Batch: [Req1: 100 tokens, Req2: 20 tokens, Req3: 80 tokens]
All requests wait until Req1 (longest) completes
GPU idle during Req2, Req3 completion

Throughput: Improved vs single request
Latency: High variance (tail latency problem)
GPU Utilization: Poor (bubbles after short sequences finish)

When to Use¶

Offline batch processing
Known sequence lengths
Latency not critical

3. Dynamic Batching¶

Mechanism¶

Accumulate requests up to max_batch_size OR max_delay
Whichever comes first triggers batch execution
Still waits for full batch completion before next batch

Key Parameters¶

max_batch_size = 32      # Maximum requests per batch
max_delay_ms = 100       # Maximum wait time

Improvement over Static:

Balances latency and throughput
Reduces wait time for batch formation

Still Limited:

Batch-level scheduling (not iteration-level)
GPU idle time after short sequences complete

When to Use¶

Services with moderate traffic
Simpler implementation than continuous batching
Good for non-LLM models (CV, audio)

4. Continuous Batching (Iteration-Level Scheduling)¶

Core Innovation¶

Iteration-Level Scheduling: Schedule at each decode step, not batch level

Mechanism¶

Initial Batch: [Req1, Req2, Req3, Req4]

Iteration 1: All generate token 1
Iteration 2: All generate token 2
Iteration 3: Req2 completes → Req5 joins → [Req1, Req3, Req4, Req5]
Iteration 4: Req5, Req3 complete → Req6, Req7 join → [Req1, Req4, Req6, Req7]
...

Key Benefit: No GPU idle time, slots filled immediately

Implementation Details¶

Sequence Completion Detection:

Monitor for EOS tokens
Max length reached
User cancellation

Slot Management:

Free KV cache blocks immediately
Add waiting request to batch
Update attention masks

Memory Efficiency:

Works best with PagedAttention (vLLM)
Non-contiguous memory allocation
Independent per-sequence management

Performance Impact¶

Throughput: +20-30% vs dynamic batching
Latency: Lower average, more consistent tail latency
GPU Utilization: 80-95% (vs 50-70% for static)

Trade-offs¶

Complex implementation
Variable batch size per iteration
Requires sophisticated memory management

5. Chunked Prefill¶

Problem¶

Long prompts block decode:

Prompt: 10,000 tokens (prefill) → 50 iterations
Short prompts: waiting in queue
Decode operations: starved

Solution¶

Break prefill into chunks, interleave with decode

Mechanism¶

Iteration 1: Chunk 1 of long prompt (512 tokens)
Iteration 2: Decode for ongoing requests
Iteration 3: Chunk 2 of long prompt (512 tokens)
Iteration 4: Decode for ongoing requests
...

Parameters: - max_prefill_tokens: Tokens to prefill per iteration - Balances prefill throughput vs decode latency

Benefits¶

Prevents large prompts from causing latency spikes
Better tail latency for decode operations
More predictable service times

Implementation Challenges¶

Partial KV cache management
Attention mask complexity
Scheduling overhead

6. Speculative Batching¶

Concept¶

Combine speculative decoding with batching

Mechanism¶

Draft Phase: Small model predicts k tokens for all requests in batch
Verification Phase: Large model verifies all k tokens in single pass
Accept/Reject: Keep correct tokens, retry from first error

Batch-Level Optimization¶

Without Speculation: 5 iterations for 5 tokens
With Speculation (k=5): 1 iteration (if all accepted)
Effective Speedup: 2-3x for batch

Challenges¶

Variable acceptance rates across requests
Synchronization points
Draft model overhead

7. Prefix Caching in Batching¶

Problem¶

Repeated prefixes waste computation

User1: [System Prompt] + User Query 1
User2: [System Prompt] + User Query 2
User3: [System Prompt] + User Query 3

Solution¶

Share KV cache blocks for common prefixes

Automatic Prefix Caching (APC)¶

Hash prompt prefixes
Reuse physical memory blocks
Multiple requests point to same KV cache

Batch-Level Benefits¶

Without APC:
  3 requests × 1000-token system prompt = 3000 tokens in KV cache

With APC:
  3 requests share 1000-token KV cache = 1000 tokens total
  Effective batch size: 3x larger

Use Cases¶

Multi-turn chat (common history)
RAG (shared context documents)
Agent frameworks (repeated tool descriptions)

8. Mixed Batch Scheduling¶

Challenge¶

Different request types have different characteristics:

Prefill: Compute-bound, parallel
Decode: Memory-bound, sequential

Naive Approach¶

Separate batches for prefill and decode → Underutilization

Optimized Approach¶

Mixed batches with resource allocation:

GPU Resources:
  80% compute → Prefill operations
  20% memory bandwidth → Decode operations

Single iteration:
  Process prefill chunks (compute-heavy)
  Process decode steps (memory-heavy)
  Overlap when possible

SplitFuse (DeepSpeed-FastGen)¶

Dynamic algorithm that:

Monitors compute vs memory utilization
Adjusts prefill chunk size
Balances prefill/decode in each iteration

9. Batch Size Selection¶

Factors¶

Memory Constraint:

Available Memory = Model Weights + KV Cache + Activations
KV Cache = batch_size × seq_len × kv_memory_per_token

Optimal Batch Size:

Too small → GPU underutilized
Too large → OOM, increased latency

Heuristics¶

For Decode:

optimal_batch_size = GPU_memory / (model_size + max_seq_len × kv_per_token)

For Prefill:

Limited by compute, not memory
Larger batches better (up to memory limit)

Adaptive Batching¶

Monitor queue depth and adjust:

High queue → Increase batch size (if memory allows)
Low latency requirement → Decrease batch size
Balance throughput vs latency SLO

10. Multi-Query Batching (MQA/GQA Context)¶

Relevance to Batching¶

Multi-Query Attention (MQA):

Fewer KV heads → Smaller KV cache
Enables larger batch sizes

Grouped-Query Attention (GQA):

Middle ground between MHA and MQA
Llama 3, Mistral use GQA

Impact¶

Llama 2 (MHA): 32 heads for K, 32 for V
Llama 3 (GQA): 8 KV heads, 32 Q heads

KV Cache Reduction: 4x
Batch Size Increase: ~4x at same memory

11. Interview Q&A¶

Q: Why is continuous batching better than dynamic batching?
A: Dynamic batching schedules at batch level, causing GPU idle time when short sequences complete. Continuous batching schedules at iteration level, immediately filling freed slots with new requests. This eliminates bubbles and improves throughput by 20-30%.

Q: What's the trade-off between batch size and latency?
A: Larger batches increase throughput but also increase latency per request (more compute per iteration, longer queue wait times). Optimal batch size depends on SLO requirements and whether you're optimizing for throughput or latency.

Q: How does chunked prefill prevent latency spikes?
A: Without chunking, a 10k token prefill blocks the GPU for many iterations, starving decode operations. Chunked prefill breaks it into smaller pieces (e.g., 512 tokens) and interleaves with decode, keeping decode latency predictable.

Q: Why does prefix caching improve effective batch size?
A: Common prefixes (system prompts, RAG contexts) are stored once in memory and shared across requests. If 10 requests share a 1k token prefix, you save 9k tokens of KV cache, allowing 9 more requests to fit in the same memory.

Q: When would you use static batching instead of continuous batching?
A: Offline batch processing with known sequence lengths, where latency doesn't matter and implementation simplicity is valued. For production online serving, continuous batching is almost always better.

Q: How does speculative decoding interact with batching?
A: Draft model generates k tokens for entire batch, then target model verifies all in parallel. Benefit multiplies across batch. Challenge is handling variable acceptance rates—some sequences may accept all k tokens while others reject after first token.

Q: What determines the optimal prefill chunk size?
A: Balance between prefill throughput and decode latency. Smaller chunks (256-512 tokens) minimize decode latency impact but reduce prefill efficiency. Larger chunks (1024-2048) maximize prefill throughput but can cause latency spikes. Typically tuned based on workload mix.

Q: Why does continuous batching require PagedAttention or similar?
A: Variable batch sizes per iteration need dynamic memory allocation. Traditional contiguous allocation can't handle requests joining/leaving mid-batch efficiently. Paged memory allows independent per-sequence management without fragmentation.