Prompting

Motivation

A pretrained large language model represents an enormous amount of world knowledge compressed into its weights. Prompting is the practice of steering model behavior through the text placed in the input — the prompt — rather than through gradient updates to the parameters. Prompting is fast (no training required), flexible (any task expressible in text), and often surprisingly capable.

Zero-Shot Prompting

A zero-shot prompt describes the task in natural language without providing any examples. The model infers the correct behavior from the description alone.

Classify the sentiment of the following review as Positive or Negative.

Review: "The movie was slow but ultimately rewarding."
Sentiment:

Zero-shot prompting works because pretraining on diverse text gives the model implicit knowledge of many tasks. Quality depends strongly on how clearly the task is described and how well it matches patterns in the pretraining distribution.

Few-Shot Prompting

A few-shot prompt includes several input-output examples (demonstrations) before the query. The model generalizes from the examples without updating its weights — this is in-context learning (Brown et al. 2020):

Review: "Absolutely loved it." → Positive
Review: "Terrible waste of time." → Negative
Review: "The movie was slow but ultimately rewarding." →

Few-shot prompting consistently outperforms zero-shot, especially for specialized or unusual tasks, because the examples resolve ambiguity about the output format and the task definition.

Instruction Following

Modern large language models are fine-tuned to follow explicit natural-language instructions. Instruction-tuned models respond more reliably to task descriptions and exhibit less sensitivity to exact prompt wording than base pretrained models.

For instruction-tuned models, a typical prompt consists of:

  1. A system prompt specifying the model’s role or constraints.
  2. A user message describing the task.
  3. Optionally, few-shot examples or background context.

Prompt Sensitivity

A consistent finding is that prompts are brittle: small changes to wording, example order, or format can substantially change model outputs. Sources of sensitivity:

  • Surface form. “Classify as positive or negative” vs. “Is this review good or bad?” may yield different results even for the same underlying task.
  • Example order. In few-shot settings, the order of demonstrations affects accuracy, sometimes by large margins.
  • Label verbalization. Whether labels are presented as “Positive/Negative”, “1/0”, or “Yes/No” affects performance.

This brittleness means that effective prompt engineering is partly empirical.

Chain-of-Thought Prompting

When tasks require multi-step reasoning, providing demonstrations that include reasoning chains — intermediate steps before the final answer — substantially improves accuracy (Wei et al. 2022). This is covered in the chain-of-thought article.

Prompt Engineering in Practice

Effective prompts are developed iteratively:

  1. Start with a clear task description and a small set of examples.
  2. Evaluate on a validation set.
  3. Identify failure modes and refine the description or examples.
  4. Test robustness to wording variations.

Automated prompt optimization — searching for a prompt that maximizes validation accuracy — is an active research area, with approaches ranging from gradient-based methods (when model internals are accessible) to using a separate language model to propose and refine prompts.

References

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems (NeurIPS), 1877–901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.