Self-Consistency

Self-Consistency is an advanced prompting technique designed to improve the reliability and accuracy of language model outputs, especially for complex reasoning tasks.

Core Concept 🎯

Instead of relying on a single response, generate multiple answers and pick the most frequent one!

Instead of relying on a single response, the model is prompted multiple times (often with different sampling seeds), and the most consistent answer among the outputs is selected as the final result.

This approach was introduced by Wang et al. (2022) to address the variability and occasional errors in chain-of-thought (CoT) reasoning, particularly for arithmetic and commonsense reasoning tasks.

Use When

The task is ambiguous or has multiple possible answers.
You want to improve reliability by aggregating outputs.
The model’s responses vary significantly between runs.

Pattern

Prompt the model with the same question multiple times (using temperature sampling or different seeds).
Collect all generated answers.
Select the most frequent or consistent answer as the final output.

Examples

Example 1: Arithmetic Reasoning (from Wang et al., 2022)

Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?

Model outputs (sampled multiple times):

- When I was 6, my sister was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.
- When I was 6, my sister was half my age, so she was 3. Now that I am 70, she is 70 - 3 = 67. The answer is 67.
- When I was 6, my sister was half my age, so she was 3. Now I am 70, so she is 70/2 = 35. The answer is 35.

Final answer (by majority):

Example 2: Sentiment Classification (new)

Classify the sentiment of the following review: "The product exceeded my expectations."

Model outputs (sampled multiple times):

- Positive
- Positive
- Positive
- Neutral

Final answer (by majority):

Positive

Benefits

Reliability: Reduces the impact of random errors or outlier responses.
Accuracy: Aggregates multiple reasoning paths to find the most likely answer.
Robustness: Especially useful for tasks with inherent ambiguity or multiple valid solutions.

Pitfalls

Increases computational cost (requires multiple model runs).
May not resolve ambiguity if the model is inconsistent or biased.
Not always necessary for simple or deterministic tasks.