Chain-of-Thought Prompting

Motivation

Large language models generate text token by token in a single left-to-right pass. For tasks requiring multiple inferential steps — arithmetic, logical deduction, commonsense reasoning, multi-step planning — generating the final answer directly often fails: the model has no opportunity to work through intermediate steps that would constrain and inform the conclusion.

Chain-of-thought (CoT) prompting (Wei et al. 2022) addresses this by eliciting intermediate reasoning steps as part of the generated output. Instead of producing an answer directly, the model produces a reasoning chain followed by the answer. Both the chain and the answer are generated autoregressively in the same forward pass — no architectural change is needed.

Few-Shot Chain of Thought

The simplest form: include demonstrations in the prompt that pair each question with a reasoning chain and a final answer.

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many does he have now?
A: Roger starts with 5 balls. 2 cans × 3 balls = 6. Total: 5 + 6 = 11.
   The answer is 11.

Q: The cafeteria had 23 apples. They used 20 for lunch, then bought 6.
   How many do they have?
A:

The model generalizes the demonstrated reasoning pattern to the new question. On multi-step arithmetic and symbolic reasoning benchmarks, CoT prompting dramatically improves accuracy over standard few-shot prompting. The gain grows with model scale and is most pronounced when problems require more than one or two inferential steps.

Zero-Shot Chain of Thought

Appending “Let’s think step by step.” to the prompt elicits reasoning chains without any demonstrations, substantially improving accuracy on the same tasks. This suggests the model has internalized reasoning patterns during pretraining and simply needs to be cued to apply them.

Why Chain of Thought Helps

Several mechanisms plausibly contribute:

  • Additional computation budget. Each generated token requires a bounded computation. A long reasoning chain provides many tokens — effectively more computation — before committing to an answer.
  • Intermediate constraints. Each step constrains what comes next. If the chain establishes “total = 11”, the model is unlikely to conclude “the answer is 5”.
  • Training distribution match. Pretraining on text that includes step-by-step solutions, mathematical proofs, and worked examples teaches the model to generate and follow reasoning chains.

Faithfulness

A concern with CoT is faithfulness: does the chain reflect the computation used to reach the answer, or is it a post-hoc rationalization generated to look plausible? Evidence suggests CoT chains are often unfaithful — the model can produce answers inconsistent with a chain that is altered after the fact, and can reach the same answer via different chains. This limits the degree to which CoT chains can be used to audit or verify model reasoning.

Extensions

  • Self-consistency. Sample multiple reasoning chains from the same prompt, then take the majority vote over final answers. Reduces variance and consistently outperforms single-chain CoT.
  • Least-to-most prompting. Decompose the problem into subproblems, solve them in order, feeding earlier answers as context. Improves generalization to problem instances harder than the demonstrations.
  • Program-of-thought / tool use. Generate code instead of natural-language chains; execute it externally. This grounds computation in an exact executor, eliminating arithmetic errors.
  • Process reward models. Train a model to score each reasoning step individually rather than just the final answer, then use it to guide chain generation. See post-training RL.

References

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.