Ask a large language model to solve a math word problem directly, and it will often get it wrong. But add four magic words to your prompt -- "Let's think step by step" -- and accuracy can improve dramatically. This is the essence of chain-of-thought (CoT) prompting, one of the most influential discoveries in prompt engineering. By encouraging LLMs to show their reasoning process, we unlock capabilities that are hidden when the model is asked to jump straight to an answer.
What Is Chain-of-Thought Prompting?
Chain-of-thought prompting is a technique that encourages LLMs to generate intermediate reasoning steps before producing a final answer. Instead of directly mapping an input to an output, the model breaks the problem down into smaller steps and works through them sequentially.
The original CoT paper by Wei et al. (2022) demonstrated this with few-shot examples. By including examples that showed step-by-step reasoning in the prompt, the model learned to apply the same approach to new problems. The improvement was striking: on the GSM8K math benchmark, chain-of-thought prompting more than doubled the accuracy of PaLM 540B.
"Chain-of-thought prompting allows large language models to decompose multi-step problems into intermediate steps, enabling complex reasoning that was previously out of reach." -- Wei et al., 2022
Why Does It Work?
The effectiveness of CoT prompting can be understood through several lenses.
Computation Allocation
When a model generates intermediate steps, it is effectively allocating more computation to the problem. Each token generated gives the model additional "thinking time" -- the internal representations at each step serve as a working memory that carries information forward. This is fundamentally different from trying to compute the answer in a single forward pass through the network.
Error Decomposition
Complex problems require multiple reasoning steps. If the model must produce the final answer directly, it needs to get every step right simultaneously. With CoT, each step can be relatively simple, and errors in individual steps are more likely to be caught and corrected in subsequent steps.
Pattern Activation
The intermediate text generated during CoT activates relevant patterns and knowledge in the model's parameters. By articulating the problem structure in natural language, the model makes it easier for its next-token prediction to follow the logical path to the correct answer.
Key Takeaway
Chain-of-thought prompting works by giving the model more computation time, decomposing complex problems into manageable steps, and activating relevant knowledge through intermediate text generation.
Variants of Chain-of-Thought
Zero-Shot CoT
The simplest form of CoT requires no examples at all. Simply appending "Let's think step by step" to the prompt triggers step-by-step reasoning in most modern LLMs. This zero-shot approach, discovered by Kojima et al. (2022), is remarkably effective despite its simplicity. It works because instruction-tuned models have been trained on data that includes step-by-step reasoning.
Few-Shot CoT
The original approach includes a few examples of problems solved with step-by-step reasoning in the prompt. The model learns from these demonstrations and applies the same reasoning pattern to the new problem. Few-shot CoT generally outperforms zero-shot, particularly on more complex problems, but requires the effort of crafting good examples.
Self-Consistency
Self-consistency generates multiple chain-of-thought reasoning paths for the same problem and takes the majority vote on the final answer. Different reasoning paths may make different errors, but the correct answer tends to appear more frequently. This approach can significantly improve accuracy at the cost of increased compute.
Tree of Thought
Tree of Thought extends CoT by allowing the model to explore multiple reasoning branches at each step, evaluate their promise, and backtrack when a branch seems unproductive. This is analogous to how humans solve complex problems -- considering multiple approaches, evaluating partial solutions, and focusing on the most promising direction.
Reasoning Models (o1, o3, DeepSeek-R1)
The most recent evolution goes beyond prompting to models specifically trained for extended reasoning. OpenAI's o1 and o3 models and DeepSeek's R1 model are trained with reinforcement learning to generate long chains of thought, including self-correction and exploration. These models can spend significantly more tokens on reasoning, sometimes generating thousands of tokens of internal thought before producing an answer.
When to Use Chain-of-Thought
CoT is not always helpful. Here are guidelines for when to use it:
- Use CoT for: Math problems, logic puzzles, multi-step reasoning, complex analysis, planning tasks, and any problem that requires combining multiple pieces of information.
- Skip CoT for: Simple factual recall, classification tasks, creative writing, and tasks where intermediate reasoning does not help and may slow down response time.
- Consider model size: CoT primarily benefits larger models (roughly 10B+ parameters). Smaller models may produce plausible-sounding but incorrect reasoning chains.
Practical Tips for Effective CoT
- Be specific in your instructions. Instead of just "think step by step," specify the type of reasoning you want: "First identify the relevant variables, then set up the equation, then solve."
- Provide high-quality examples. For few-shot CoT, the quality of your examples matters enormously. Include examples that demonstrate the type of reasoning needed for your specific problem type.
- Use self-consistency for critical tasks. When accuracy is paramount, generate multiple reasoning paths and take the majority answer. Three to five paths is often sufficient.
- Verify the reasoning, not just the answer. A correct answer derived from flawed reasoning is unreliable. Check the intermediate steps to ensure the reasoning is sound.
Key Takeaway
Chain-of-thought prompting is one of the most reliable techniques for improving LLM reasoning. It costs nothing to implement, requires no model changes, and can dramatically improve performance on complex tasks.
