Self-Consistency
1. Overview¶
Self-consistency is a prompting technique that improves the reliability and accuracy of large language models (LLMs) by generating multiple reasoning paths and selecting the most consistent answer through majority voting.
Key Idea: Instead of relying on a single greedy decode path, generate diverse reasoning paths and marginalize out the reasoning process to arrive at the most consistent final answer.
2. How It Works¶
Basic Algorithm¶
- Generate Multiple Samples: Use temperature-based sampling (T > 0) to generate N diverse reasoning paths for the same prompt
- Extract Answers: Parse the final answer from each reasoning path
- Majority Vote: Select the answer that appears most frequently across all samples
Example¶
Question: "If there are 3 cars in the parking lot and 2 more arrive, how many cars are there?"
Sample 1: "Initially 3 cars. 2 more arrive. 3 + 2 = 5 cars" → Answer: 5
Sample 2: "We start with 3. Adding 2 gives us 3 + 2 = 5" → Answer: 5
Sample 3: "3 cars plus 2 cars equals 5 cars total" → Answer: 5
Final Answer: 5 (unanimous)
3. Technical Details¶
Mathematical Formulation¶
Given a prompt x, instead of greedy decoding:
argmax p(r, a | x)
Self-consistency marginalizes over reasoning paths:
argmax Σ p(r, a | x)
a r∈R(a)
Where:
r= reasoning patha= answerR(a)= set of reasoning paths leading to answera
Implementation Parameters¶
- Temperature: 0.5 - 1.0 (enables diverse sampling)
- Number of Samples (N): 5-40 (typically 10-20 for good balance)
- Sampling Method: Temperature sampling or nucleus sampling (top-p)
- Answer Extraction: Regex patterns or structured output parsing
4. Advantages¶
- Improved Accuracy: 10-20% boost on arithmetic and commonsense reasoning tasks
- Uncertainty Estimation: Vote distribution indicates model confidence
- No Fine-tuning Required: Works with pre-trained models out-of-the-box
- Robust to Prompt Variations: Averages out biases from specific phrasings
5. Limitations¶
- Computational Cost: N times more expensive than single inference
- Latency: Parallel processing helps but still slower than greedy decode
- Answer Extraction: Requires reliable parsing of diverse outputs
- Not Universal: Most effective for tasks with discrete, verifiable answers
6. Recent Developments (2023-2025)¶
Universal Self-Consistency (USC)¶
- Extends to open-ended generation tasks
- Uses semantic similarity instead of exact match
- Clusters similar responses and selects representative answer
Self-Consistency with Chain-of-Thought (CoT)¶
- Combined with CoT prompting for complex reasoning
- Standard practice in modern LLM applications
- Implemented in frameworks like LangChain, DSPy
Weighted Voting Schemes¶
- Confidence-weighted voting using log probabilities
- Quality-based weighting using separate verifier models
- Adaptive sample sizes based on initial agreement
Integration with Tool Use¶
- Self-consistency over tool-augmented reasoning paths
- Multiple execution paths with external APIs/calculators
- Verification through diverse computational approaches
8. Implementation Example¶
def self_consistency(prompt, model, n_samples=10, temperature=0.7):
"""
Implement self-consistency for LLM reasoning
"""
# Generate diverse reasoning paths
responses = []
for _ in range(n_samples):
response = model.generate(
prompt=prompt,
temperature=temperature,
max_tokens=256
)
responses.append(response)
# Extract answers
answers = [extract_answer(r) for r in responses]
# Majority vote
from collections import Counter
vote_counts = Counter(answers)
most_common_answer = vote_counts.most_common(1)[0][0]
confidence = vote_counts[most_common_answer] / n_samples
return most_common_answer, confidence
def extract_answer(response):
"""Extract final answer from reasoning path"""
# Pattern matching for common formats
import re
patterns = [
r"(?:answer is|answer:|final answer:)\s*([^\n]+)",
r"#### ([^\n]+)", # Common in math problems
r"\n\n([^\n]+)$" # Last line
]
for pattern in patterns:
match = re.search(pattern, response, re.IGNORECASE)
if match:
return match.group(1).strip()
return response.strip()