Evaluation-Driven Development - AI Engineering: Building Applications with Foundation Models

Key Principle

Evaluation-driven development means defining what "good" looks like before writing application code — analogous to test-driven development, where acceptance criteria are made explicit before implementation so development is directed rather than exploratory. Evaluation criteria defined early become the scoring rubrics and guidelines used to annotate finetuning data later (Chapter 4), making early investment doubly leveraged.

Model selection follows a four-bucket framework: (1) domain-specific capability — can the model do the task at all; (2) generation capability — is the output high-quality and safe; (3) instruction-following capability — does the model respect format, length, style, and content constraints; (4) cost and latency — is the model fast and cheap enough. All four buckets must be tested because a model can pass domain-capability tests and still be useless if it cannot follow format instructions (Chapter 4).

When selecting models, hard attributes (license terms, model size, training data, privacy policy, on-device requirement) cannot be changed through adaptation and must be filtered first. Soft attributes (accuracy, toxicity, factual consistency) can be improved and are optimized after. Latency is conditionally hard or soft: soft when you control the deployment stack, hard when using a hosted API you cannot modify (Chapter 4). Data contamination — where benchmark questions appear in training data — inflates public benchmark scores via memorization, not capability, making them unreliable for direct model selection (Chapter 4).

Why This Matters

The "lamppost problem" in AI evaluation describes a failure mode where teams restrict themselves to only easily evaluable problems, potentially excluding the highest-value applications. The correct discipline is to push to make hard-to-measure things measurable rather than avoid them. The inverse failure — deploying without any evaluation — is equally dangerous: a deployed system that cannot be evaluated is strictly worse than no system because it costs to maintain but removing it may cost more (Chapter 4). Evaluation-driven development resolves this by forcing the measurement problem to be confronted before code is written, not after.

The ordering of hard-before-soft attribute filtering matters because confusing a hard attribute for a soft one wastes optimization resources, while confusing a soft attribute for a hard one forecloses improvement opportunities entirely. If a candidate model's license terms prohibit commercial use, no amount of prompt tuning or finetuning resolves that — the candidate must be eliminated before any experimentation begins. Soft attributes like accuracy and toxicity, by contrast, are precisely the attributes that can be improved through adaptation, so they should be evaluated only after the hard-attribute filter has narrowed the candidate pool (Chapter 4).

Good Examples

Applying the four-bucket framework to a legal document assistant: First test domain capability — can the model perform legal reasoning and cite relevant statutes correctly? Then generation quality — is the output factually consistent and free of hallucinated case references? Then instruction-following — does it respect requested output format (e.g., numbered clauses, maximum length)? Finally cost and latency — does response time meet the workflow's threshold? Failing to test all four buckets risks deploying a model that passes domain tests but silently fails on format compliance (Chapter 4).

Filtering hard attributes before running a private evaluation pipeline: A team building an on-device mobile assistant immediately eliminates all hosted-API models (hard attribute: on-device requirement) before any benchmark comparison. Among remaining open-weight candidates, they eliminate those with licenses prohibiting commercial use. Only then do they run benchmark screening and private evaluation on the reduced candidate pool — saving the cost of evaluating models that could never be deployed (Chapter 4).

Detecting data contamination via perplexity anomaly: Brown et al. (2020) found that 13 benchmarks had at least 40% data overlap with GPT-3 training data. Rylan Schaeffer's 2023 satirical paper "Pretraining on the Test Set Is All You Need" demonstrated the mechanism exactly: a 1M-parameter model trained exclusively on benchmark data achieved near-perfect scores, outperforming far larger models. A perplexity anomaly — unexpectedly low perplexity on benchmark examples relative to similar held-out text — is a detection signal that a benchmark may be contaminated in a model's training data (Chapter 4).

Counterpoints

Evaluating soft attributes before filtering hard attributes wastes evaluation budget and can produce a misleading recommendation. If a team identifies the most accurate model through extensive private evaluation and then discovers its license prohibits their use case, the evaluation work is entirely lost. Hard attributes are the correct first filter precisely because they are binary and cheap to check (Chapter 4).

Building evaluation after the product — the lamppost problem — leads to optimizing what is easy to measure rather than what matters. Teams that define evaluation criteria post-hoc tend to retrofit metrics to the outputs they already have, rather than measuring whether the system achieves the application's actual goals. The causal chain runs the other direction: clear evaluation criteria drive measurable iteration loops, which drive reliable production systems (Chapter 4).

Using a single benchmark score instead of the four-bucket framework conflates distinct capability dimensions. A model that scores well on domain-capability benchmarks may still fail to follow format instructions or meet latency requirements. Additionally, benchmark correlation analysis (Galambosi, January 2024) shows that WinoGrande, MMLU, and ARC-C correlate at 0.89–0.90, meaning a composite score dominated by those benchmarks over-weights a single reasoning dimension and under-weights orthogonal capabilities like truthfulness, which has only moderate correlation (0.42–0.55) with reasoning benchmarks (Chapter 4).

Key Quotes

"The applications most reliably in production — recommender systems, fraud detection, code generation, classification — share one property: obvious, measurable evaluation criteria." (Chapter 4)

"Hard attributes should eliminate candidates before any experimentation begins." (Chapter 4)

"These same guidelines become annotation guidelines for finetuning data in Ch. 8, making early investment doubly valuable." (Chapter 4)

"Public benchmark comparisons are unreliable for model selection. They are useful only for coarse filtering — screening a large candidate pool down to a small one before private evaluation." (Chapter 4)

Rules of Thumb

Define evaluation criteria before writing application code; treat undefined "good" as a blocking requirement, not a later task.
Filter hard attributes (license, size, privacy, on-device) first — any candidate that fails a hard attribute is eliminated before benchmarking begins.
Use the four-bucket framework as a checklist; a model must pass all four buckets, not just the domain-capability bucket.
Treat public benchmark scores as a coarse filter only; run a private evaluation pipeline on your actual data before making a final selection.
Invest in evaluation guidelines early — the same rubrics that define "good" for evaluation will become annotation guidelines for finetuning, compounding their value.

Related References

Evaluation Methods and AI as a Judge - The evaluation methods used in each bucket
Dataset Engineering and Data Quality - Evaluation guidelines = annotation guidelines
The Three-Axis Model and AI Engineering Discipline - Evaluation is the entry point to all three axes