Evaluation Methods and AI as a Judge - AI Engineering: Building Applications with Foundation Models

Key Principle

Foundation model evaluation requires a three-tier taxonomy because no single method covers the full quality surface. (1) Perplexity measures prediction uncertainty as the exponential of cross entropy — it is useful for data contamination detection and deduplication, but breaks as a comparative metric once post-training is applied, because SFT and RLHF narrow the model's distribution and increase perplexity on general-corpus text even as task performance improves. (2) Functional correctness is "the ultimate metric for evaluating the performance of any application, as it measures whether your application does what it's intended to do" (Chapter 3) — it executes outputs against specifications rather than comparing surface form. (3) AI-as-judge is fast, cheap, and reference-free; GPT-4 agrees with human raters 85% of the time on MT-Bench, exceeding human-human agreement of 81% (Zheng et al., 2023, Chapter 3). Three documented AI judge biases: self-bias (GPT-4 favors its own outputs by +10% win rate; Claude-v1 by +25%), position bias (favors first answer in a list), and verbosity bias (prefers longer responses even when factually wrong). Judge drift — changing the model or prompt mid-project — mutates the metric without changing the application, making score trends uninterpretable. (4) Comparative/pairwise evaluation uses rating algorithms (Bradley-Terry, Elo) to rank models from head-to-head outcomes; it "will never get saturated as long as newer, stronger models are introduced" (Chapter 3), making it the structural solution to benchmark saturation.

Why This Matters

Evaluation is harder for foundation models than traditional ML because the space of valid responses is unbounded — there is no fixed label to compare against. Four compounding factors amplify this: open-ended outputs, black-box API access with no logit introspection, higher model intelligence raising the bar for evaluators, and benchmark saturation (GLUE saturated within one year of release). Classical ML evaluation assumed a closed label set; foundation model evaluation must cope with the absence of one. The stakes are not merely academic: real failures include chatbot-induced suicide, hallucinated legal citations upheld as evidence, and an airline held liable for chatbot misinformation (Chapter 3).

If you use perplexity to compare a base model against its aligned (post-trained) variant, you will observe higher perplexity for the aligned model and incorrectly conclude it is worse. The mechanism is the opposite of what that conclusion implies: post-training improves task performance by narrowing the model's distribution toward preferred patterns, which increases its measured surprise on general-corpus text. If you change your AI judge mid-project — by upgrading the underlying model or revising the prompt — you produce a different judge, and any score change gets misattributed to application quality rather than metric drift. Teams must version-lock both model and prompt to maintain a valid longitudinal signal.

Good Examples

Functional correctness for code evaluation. Instead of computing BLEU score between a generated function and a reference solution, execute the generated code against a test suite. The pass@k metric (Chen et al., OpenAI, 2021) generates k samples per problem and counts problems where at least one sample passes all tests. BLEU scores for correct and incorrect code solutions on HumanEval are indistinguishable — functional correctness is the only metric that detects the difference (Chapter 3).

Designing a reliable AI judge prompt. Use discrete classification rather than continuous scoring (e.g., a 1–5 scale rather than 0.0–1.0). Include few-shot examples in the prompt: adding examples raised GPT-4 judge consistency from 65% to 77.5% (Zheng et al., 2023, Chapter 3). Version-lock the model identifier and the full prompt text. For pairwise comparisons, run each pair in both orderings and average results to cancel position bias.

Pairwise ranking to avoid benchmark saturation. Rather than scoring models against a fixed benchmark that will eventually be saturated, present model outputs side-by-side and record win/loss outcomes. Apply Bradley-Terry or TrueSkill to derive a leaderboard. LMSYS Chatbot Arena collected 244,000 such comparisons across 57 models (January 2024, Chapter 3). New models can always be added; the ranking never hits a ceiling.

Counterpoints

Using BLEU or ROUGE on generative tasks. These n-gram overlap metrics assume a narrow reference distribution. For code, BLEU scores on correct and incorrect solutions are indistinguishable (Chapter 3). For open-ended generation where surface-form variation is the norm, BLEU penalizes valid paraphrases. Using BLEU as the primary metric for generative tasks produces scores that can be maximized while functional quality collapses.

Naively comparing perplexity of base vs. aligned models. Post-training (SFT, RLHF) typically increases perplexity on general-corpus text while improving task performance. Ranking model variants by perplexity after post-training conflates two incompatible objectives and produces misleading conclusions — the aligned model will appear worse by this metric even when it is strictly better for the application (Chapter 3).

Changing the AI judge mid-project without recalibrating baselines. An AI judge is a system comprising model + prompt + sampling parameters. Any change to any component produces a different judge (Chapter 3). Teams that upgrade their judge model or revise the judge prompt without re-scoring historical baselines will observe spurious score trends — improvements or regressions that reflect the metric change, not application quality. All prior scores become incomparable to new scores.

Key Quotes

"Without a way to quality control AI outputs, the risk of AI might outweigh its benefits for many applications." (Chapter 3)

"Functional correctness is the ultimate metric for evaluating the performance of any application, as it measures whether your application does what it's intended to do." (Chapter 3)

"An AI judge is not just a model — it's a system that includes both a model and a prompt. Altering the model, the prompt, or the model's sampling parameters results in a different judge." (Chapter 3)

"Comparative evaluations will never get saturated as long as newer, stronger models are introduced." (Chapter 3)

Evaluation Method Selection

The tier to use is determined by what can be verified, not by preference (Chapter 3):

Condition	Method
Output is executable or testable	Functional correctness
High-quality reference data exists	Lexical or semantic similarity
No reference; quality is multidimensional	AI as a judge (versioned spec required)
Relative ranking across model variants	Comparative / pairwise

Reference data is a liability, not an asset, unless actively maintained. Adept's Fuyu model was penalized for valid captions absent from reference data; WMT 2023 found many bad reference translations, making reference-free metrics competitive. Cross-tool metric comparisons (MLflow vs. Ragas vs. LlamaIndex "faithfulness") are meaningless without verifying the underlying judge specification — each uses different prompts and incompatible scoring scales (Chapter 3).

Rules of Thumb

Never use perplexity to compare a base model against a post-trained variant; post-training increases perplexity even when it improves task performance.
Use functional correctness (execute and test) whenever outputs are executable; surface-form metrics (BLEU, ROUGE) are insufficient for code and structured outputs.
Version-lock both the model and the prompt of any AI judge; changing either mid-project makes historical scores incomparable.
Mitigate position bias by running pairwise comparisons in both orderings; mitigate verbosity bias by controlling for response length in judge prompts.
When benchmarks saturate, switch to comparative/pairwise evaluation — it scales with model capability and never hits a ceiling.
Do not cross-compare "faithfulness" or other named metrics across evaluation frameworks without inspecting the underlying judge prompt and scoring scale.

Related References

Evaluation-Driven Development - Operationalizing these methods into a system
Foundation Model Internals - Why post-training breaks perplexity as a metric