Self-Consistency

1. Overview¶

Self-consistency is a prompting technique that improves the reliability and accuracy of large language models (LLMs) by generating multiple reasoning paths and selecting the most consistent answer through majority voting.

Key Idea: Instead of relying on a single greedy decode path, generate diverse reasoning paths and marginalize out the reasoning process to arrive at the most consistent final answer.

2. How It Works¶

Basic Algorithm¶

Generate Multiple Samples: Use temperature-based sampling (T > 0) to generate N diverse reasoning paths for the same prompt
Extract Answers: Parse the final answer from each reasoning path
Majority Vote: Select the answer that appears most frequently across all samples

Example¶

Question: "If there are 3 cars in the parking lot and 2 more arrive, how many cars are there?"

Sample 1: "Initially 3 cars. 2 more arrive. 3 + 2 = 5 cars" → Answer: 5

Sample 2: "We start with 3. Adding 2 gives us 3 + 2 = 5" → Answer: 5

Sample 3: "3 cars plus 2 cars equals 5 cars total" → Answer: 5

Final Answer: 5 (unanimous)

3. Technical Details¶

Mathematical Formulation¶

Given a prompt x, instead of greedy decoding:

argmax p(r, a | x)

Self-consistency marginalizes over reasoning paths:

argmax Σ p(r, a | x)
  a    r∈R(a)

Where:

r = reasoning path
a = answer
R(a) = set of reasoning paths leading to answer a

Implementation Parameters¶

Temperature: 0.5 - 1.0 (enables diverse sampling)
Number of Samples (N): 5-40 (typically 10-20 for good balance)
Sampling Method: Temperature sampling or nucleus sampling (top-p)
Answer Extraction: Regex patterns or structured output parsing

4. Advantages¶

Improved Accuracy: 10-20% boost on arithmetic and commonsense reasoning tasks
Uncertainty Estimation: Vote distribution indicates model confidence
No Fine-tuning Required: Works with pre-trained models out-of-the-box
Robust to Prompt Variations: Averages out biases from specific phrasings

5. Limitations¶

Computational Cost: N times more expensive than single inference
Latency: Parallel processing helps but still slower than greedy decode
Answer Extraction: Requires reliable parsing of diverse outputs
Not Universal: Most effective for tasks with discrete, verifiable answers

6. Recent Developments (2023-2025)¶

Universal Self-Consistency (USC)¶

Extends to open-ended generation tasks
Uses semantic similarity instead of exact match
Clusters similar responses and selects representative answer

Self-Consistency with Chain-of-Thought (CoT)¶

Combined with CoT prompting for complex reasoning
Standard practice in modern LLM applications
Implemented in frameworks like LangChain, DSPy

Weighted Voting Schemes¶

Confidence-weighted voting using log probabilities
Quality-based weighting using separate verifier models
Adaptive sample sizes based on initial agreement

Integration with Tool Use¶

Self-consistency over tool-augmented reasoning paths
Multiple execution paths with external APIs/calculators
Verification through diverse computational approaches

8. Implementation Example¶

def self_consistency(prompt, model, n_samples=10, temperature=0.7):
    """
    Implement self-consistency for LLM reasoning
    """
    # Generate diverse reasoning paths
    responses = []
    for _ in range(n_samples):
        response = model.generate(
            prompt=prompt,
            temperature=temperature,
            max_tokens=256
        )
        responses.append(response)

    # Extract answers
    answers = [extract_answer(r) for r in responses]

    # Majority vote
    from collections import Counter
    vote_counts = Counter(answers)
    most_common_answer = vote_counts.most_common(1)[0][0]
    confidence = vote_counts[most_common_answer] / n_samples

    return most_common_answer, confidence

def extract_answer(response):
    """Extract final answer from reasoning path"""
    # Pattern matching for common formats
    import re
    patterns = [
        r"(?:answer is|answer:|final answer:)\s*([^\n]+)",
        r"#### ([^\n]+)",  # Common in math problems
        r"\n\n([^\n]+)$"  # Last line
    ]
    for pattern in patterns:
        match = re.search(pattern, response, re.IGNORECASE)
        if match:
            return match.group(1).strip()
    return response.strip()