AI Evaluation
Motivation
Every empirical claim about an AI system rests on an evaluation: a choice of task, metric, and test protocol. Without reliable evaluation there is no way to know whether a system works, to compare competing approaches, or to measure progress over time. Yet designing a good evaluation is itself a hard problem — the metric must capture the real goal, the test data must reflect the real distribution, and the protocol must resist gaming. Getting any of these wrong can make a weak system look strong or a strong system look weak.
Why Evaluation Is Hard
Evaluating an AI system requires defining what “good” means — and this is rarely simple. The definition must be precise enough to measure, broad enough to capture genuine capability, and robust enough that scoring well means actually solving the problem. These requirements are frequently in tension.
The Turing Test
The most famous evaluation criterion for general AI is the Turing Test (TURING 1950). A human judge conducts text conversations with both a human and a machine, without knowing which is which. If the judge cannot reliably distinguish the machine from the human, the machine passes.
The test has historical importance but well-known limitations:
- It tests mimicry more than understanding. A system can pass by exploiting social conventions without representing the world.
- It is anthropocentric: intelligent behavior is defined relative to human behavior, not relative to correct task performance.
- Different judges produce different results; the test is not reproducible.
- A system optimized to pass the Turing Test may be quite different from a system that is genuinely useful.
Benchmark Evaluation
The dominant evaluation paradigm uses held-out test sets with ground-truth labels:
- Define a task (e.g., image classification, reading comprehension, machine translation).
- Collect or construct a dataset of (input, correct output) pairs.
- Measure performance — accuracy, F1, BLEU score, etc. — on examples the system never saw during training.
Strengths: reproducible, cheap to run, comparable across systems.
Weaknesses:
- Benchmark saturation. Systems eventually match or exceed human performance on a benchmark without matching human capability on the underlying task. ImageNet accuracy reached human-level in 2015; general image understanding remained far harder.
- Distribution shift. Test examples may not reflect real deployment conditions. A model achieving 95% on the test set may fail in production.
- Goodhart’s Law. Once a benchmark becomes a widely-used target, the incentive to “teach to the test” creates systems that score well while solving the problem superficially.
- Data contamination. Large language models trained on internet text may have seen benchmark examples during pretraining, making test-set performance an overestimate.
Task-Specific Metrics
Different tasks use different metrics:
| Task | Common metric |
|---|---|
| Classification | Accuracy, precision/recall/F1, AUC |
| Regression | MSE, MAE |
| Machine translation | BLEU, chrF, COMET |
| Text generation | Perplexity, human preference ratings |
| Structured prediction | Exact match, span F1 |
| Reinforcement learning | Cumulative reward, win rate |
Automated metrics are cheap but imperfect proxies. Human evaluation — having people rate or compare outputs — is more valid but expensive and slow. Best practice is to use both.
Evaluation Design Principles
Good evaluations share several properties:
- Construct validity. The metric measures the underlying capability of interest, not a correlated proxy.
- Reliability. Repeated evaluations of the same system yield similar scores.
- Difficulty calibration. The task discriminates among systems — not so hard that all systems fail, not so easy that all pass.
- Adversarial robustness. Changing wording, order, or surface form should not drastically alter scores for a truly capable system.
Modern practice increasingly combines automatic metrics, human evaluation, and capability elicitation — deliberately searching for inputs that reveal a system’s failure modes before deployment.