AI Engineering Implementation Playbook - AI Engineering: Building Applications with Foundation Models

Key Principle

Build in cost order — exhaust cheaper axes before advancing to more expensive ones. The sequence is not arbitrary; each phase is roughly an order of magnitude more expensive than the previous.

Phase 0: Define evaluation criteria first (before any code) Write your evaluation suite before writing application logic. What triggers advancement: you have measurable pass/fail criteria and a baseline. What NOT to do: write application code before you can measure success.

Phase 1: Prompt engineering — exhaust this axis first Adjust instructions, few-shot examples, chain-of-thought, and system prompts. Trigger to advance: you have systematically tested prompt variants and hit a quality ceiling. Failure modes to check: under-specified instructions, missing chain-of-thought for complex tasks, no few-shot examples. What NOT to do: abandon prompting after one or two attempts.

Phase 2: Context (RAG/agents) — add when prompt engineering is exhausted Add retrieval or agent steps when the model's knowledge or reasoning span is the limiting factor. Route by failure type: information gaps → RAG; multi-step reasoning over external state → agents. Trigger to advance: context construction cannot close the quality gap. Failure modes: wrong chunking strategy, missing hybrid search, agent autonomy scoped too broadly for measured per-step reliability. What NOT to do: add RAG to fix behavior/format failures that belong in Phase 1.

Phase 3: Model adaptation (finetuning) — only after product shows promise Finetune to change form, style, or specialized behavior — not to inject missing facts. Trigger to advance: prompting and context cannot achieve the required output style or domain behavior. Failure modes: finetuning on a context gap (teaches the model to confabulate), insufficient evaluation data to produce annotation guidelines. What NOT to do: finetune before the product has validated demand.

Phase 4: Infrastructure optimization (inference, production architecture) Add routing, caching, guardrails, and inference optimization incrementally. The five-step production architecture (context enhancement → guardrails → router/gateway → caching → agents) is additive. Trigger to advance each step: measured production need. What NOT to do: build full five-layer architecture before validating product-market fit.

Phase 5: Data flywheel — start collecting from day one, train when sufficient Collect implicit feedback (edit pairs, regeneration requests, engagement time) from the first user interaction. The compounding advantage begins only when collection starts. What NOT to do: defer feedback instrumentation until you have "enough users."

Why This Matters

Each phase transition carries a cost jump of roughly an order of magnitude in engineering effort, compute, and iteration time. Prompt changes are free and instantaneous; RAG requires a retrieval pipeline and index maintenance; finetuning requires a labeled dataset, training runs, and re-evaluation. Advancing prematurely to finetuning over a problem that prompt engineering would solve wastes resources and often makes the problem worse — training a model on a context gap teaches it to confabulate rather than admit ignorance.

The two symmetrical failure modes are equally costly. Engineers who skip phases prematurely burn budget on infrastructure before validating that the product is worth building. Engineers who stop too early — accepting a quality ceiling after minimal prompting effort — leave a large fraction of achievable quality on the table at near-zero marginal cost. The cost-ordering principle keeps both failure modes in check by making the decision gate explicit: you must exhaust the current phase before the next one is justified.

Good Examples

Diagnosing an output quality problem and routing to the right fix: A model produces correct information but in the wrong format and tone. The failure is behavioral, not informational. The correct route is Phase 1 (prompt engineering: add output format instructions and few-shot examples of the target style) or Phase 3 (finetuning for persistent style/format adaptation), not Phase 2 (RAG adds no value when knowledge is not the gap).

Deciding between RAG and finetuning: A customer support model gives outdated product pricing. Pricing is a factual, frequently changing knowledge gap — RAG is the correct tool. A model that answers in the wrong persona or fails to follow brand voice guidelines — that is a behavioral/form failure — finetuning is the correct tool. The routing rule is: information-based failures → RAG; behavior/format/style failures → finetuning.

Scoping agent autonomy based on measured error rate: An agent pipeline has 10 sequential tool-call steps. If per-step reliability is measured at 95%, end-to-end task success is 95%^10 ≈ 60%. Before extending the agent's autonomy or adding steps, the measured per-step reliability must be improved or human-in-the-loop checkpoints must be inserted to catch compounding failures.

Counterpoints

Jumping to finetuning before exhausting prompting: The most common execution failure. Finetuning is perceived as "more powerful," but it is also ~100x more expensive to iterate on. Most format, style, and instruction-following failures are solvable at the prompt level. Prematurely finetuning on a context gap specifically trains the model to hallucinate rather than retrieve.

Building five-layer production architecture before validating product-market fit: The five-step production architecture (context enhancement, guardrails, routing, caching, agents) is designed to be adopted incrementally. Building all layers before measuring production traffic patterns wastes engineering time and couples optimization decisions to assumptions that production data would quickly invalidate.

Starting data flywheel collection too late: The data flywheel's compounding advantage (user feedback → training data → model improvement → more users) requires a head start. Teams that defer instrumentation until they have "enough users" discover that their earliest and most exploratory user interactions — the highest-signal data — were never captured. Collection infrastructure should be in place before the first external user.

Key Quotes

"Exhaust prompt engineering before RAG, RAG before finetuning. Each step is an order of magnitude more expensive. Skipping to finetuning over a context gap trains the model to confabulate." — Preface / Top 10 Actionable Takeaways

"Define evaluation criteria before writing application code. The same evaluation artifacts become finetuning annotation guidelines — early investment is doubly leveraged." — Ch. 4 (Evaluation-Driven Development)

"Route by failure type, not by perceived need: information-based failures → RAG; behavior/format/style failures → finetuning. Misrouting wastes both cost and quality." — Ch. 7

"Design the data flywheel from day one: collect implicit feedback even before you have enough to train on. The compounding advantage begins only when collection starts." — Ch. 10

Rules of Thumb

Never advance to the next phase until you have evidence the current phase is exhausted, not just attempted once.
Filter on hard attributes (license, context window, size) before spending effort on soft attribute optimization.
Use functional correctness as your evaluation standard — not lexical similarity — so quality gates reflect actual task success.
Treat each agent step as a probability multiplication point; scope autonomy to match your measured per-step reliability.
Optimize for goodput (requests per second satisfying SLO), not raw throughput — SLO violations do not count as useful work.
Keep your evaluation judge (model + prompt) frozen across longitudinal comparisons; judge drift changes the metric, not just the score.
Collect implicit feedback (edit pairs, regeneration clicks, engagement time) from user interaction day one — the flywheel only compounds from when collection starts.

Related References

The Three-Axis Model and AI Engineering Discipline - Three-axis model underlying the build order
Evaluation-Driven Development - Phase 0 in detail
Prompt Engineering and In-Context Learning - Phase 1 in detail
RAG, Agents, and Context Construction - Phase 2 in detail
Finetuning, LoRA, and Model Merging - Phase 3 in detail
Production Architecture and the Data Flywheel - Phases 4-5 in detail