AI Engineering: Building Applications with Foundation Models

Three-Axis Quality Model: Response quality = f(instructions, context, model). Optimize in that order — each step is ~10× more expensive than the previous.
Evaluation first: Define evaluation criteria before writing any application code. Evaluation guidelines become finetuning annotation guidelines — early investment is doubly leveraged.
Failure-type routing: Information failures (wrong facts) → RAG. Behavior failures (wrong form/style/format) → finetuning. Misrouting wastes both cost and quality.
Data is the moat: In a world of converging model architectures, proprietary user feedback data — not model quality — is the primary long-term competitive differentiator.
Goodput over throughput: Optimize for requests/second satisfying SLOs, not raw GPU utilization.

Situation	Do This	Avoid This
Output quality is poor	Exhaust prompt engineering first	Jump to finetuning
Model gives wrong facts	Add RAG (information failure)	Finetune to add knowledge
Model uses wrong style/format	Finetune (behavior failure)	Add more RAG chunks
Evaluating model quality	Use functional correctness	Use BLEU/ROUGE on generative tasks
Comparing base vs. aligned models	Use task-specific metrics	Compare perplexity (breaks on aligned models)
Selecting a model	Filter hard attributes first	Evaluate soft attributes before filtering
Building agent systems	Scope autonomy to measured reliability	Grant write-actions before measuring accuracy
Optimizing inference	Define SLOs, then optimize goodput	Maximize GPU utilization
Synthetic training data	Mix with real data as a floor	Train recursively on synthetic only

"The three-axis model tells you not just what to do, but in what order to do it — and why skipping axes is expensive." — Chip Huyen, Preface

Overview