Rules of Thumb and Heuristics

Key Principle

This file collects the book's key heuristics organized by the type of decision they address. Rather than re-reading chapters, use this as a fast lookup when you face a concrete engineering decision. Each heuristic is derived directly from the frameworks and takeaways in "AI Engineering" by Chip Huyen (O'Reilly, 2025).

Technique Selection

Choose between prompt engineering, RAG, and finetuning using the three-axis cost order and failure-type routing:

Cost-order rule: Exhaust prompt engineering before adding RAG; exhaust RAG before finetuning. Each step is an order of magnitude more expensive than the previous.
Form vs. facts routing: Route information-based failures (wrong facts, missing knowledge) to RAG. Route behavior, format, and style failures to finetuning. Misrouting wastes both cost and quality. (Ch. 7)
Context gap warning: Skipping to finetuning over a context gap trains the model to confabulate rather than retrieve. (Ch. 7)
When to add agents: Add agents only when task-step reliability is high enough. At 95% per-step reliability, a 10-step task succeeds ~60% of the time. Scope autonomy to match measured reliability. (Ch. 6)
When to use hybrid search: Combine term-based and embedding-based retrieval by default — they cover complementary failure modes. (Ch. 6)
When to optimize inference: Define per-request latency SLOs before choosing a batching or parallelism strategy. Optimization choices are only meaningful relative to an SLO target. (Ch. 9)
Memory tier selection: Map memory to the three-axis model — in-context memory for immediate instructions, external storage for context construction, in-weights memory for persistent behavior. (Ch. 6)
Structured output: Use constrained decoding or schema enforcement when downstream systems require machine-readable output. Do not rely on instruction-following alone. (Ch. 2)

Evaluation

Define and instrument evaluation before writing application code:

Evaluation before code: Define evaluation criteria before writing application code (Evaluation-Driven Development — the AI analog of TDD). Early investment is doubly leveraged because evaluation artifacts become finetuning annotation guidelines. (Ch. 4)
Functional correctness as ultimate metric: Evaluate whether the output performs its intended function, not whether it looks correct. BLEU on code fails; correct-or-not succeeds. (Ch. 3)
Hard attributes first: Filter candidate models on hard attributes (license compatibility, context window, latency envelope) before optimizing soft attributes (accuracy, toxicity). No prompt engineering fixes a licensing incompatibility. (Ch. 4)
AI-as-judge is a system: An AI judge is the combination of model plus prompt — not just a model. Treat judge selection and prompt as locked configuration. (Ch. 3)
Judge stability rule: Never change the judge mid-project. Judge drift changes the metric and corrupts longitudinal comparisons. (Ch. 3)
Comparative evaluation never saturates: When absolute scoring stalls, switch to pairwise ranking — it continues to discriminate as quality improves. (Ch. 3)

Data and Training

Treat data quality as the binding constraint on model quality:

Annotation equals evaluation guidelines: The criteria you write for human evaluators in Ch. 4 are the same criteria you need for annotation in Ch. 8. Define them once; reuse across both. (Ch. 4, Ch. 8)
Synthetic data mixing rule: Mix synthetic data with real data rather than replacing it. Use real data as a floor; use synthetic data to amplify known-good patterns. (Ch. 8)
Model collapse threshold: Recursive training on synthetic data amplifies distribution skew. Prevent collapse by maintaining a real-data anchor in every training run. (Ch. 8)
Data flywheel timing: The compounding advantage of the data flywheel begins only when feedback collection starts. Design collection (implicit signals: edit pairs, regeneration requests, engagement time) from day one, even before you have enough data to train on. (Ch. 8, Ch. 10)
Quality × coverage × quantity triage: Quality problems cannot be fixed by adding more data. Fix quality first, then coverage gaps, then scale quantity. (Ch. 8)

Production and Architecture

Build incrementally and instrument for feedback from the start:

Five-step architecture ordering: Add production components in this sequence — context enhancement → guardrails → router/gateway → caching → agents. Each step is independently deployable and testable. (Ch. 10)
Goodput over throughput: Optimize for requests-per-second satisfying SLO, not raw throughput. Maximizing utilization at the cost of SLO violations reduces actual useful work delivered. (Ch. 9)
Agent scope rule: Agent autonomy must match measured per-step reliability. Expand scope only after reliability is validated; use human-in-the-loop checkpoints for high-stakes intermediate steps. (Ch. 6, Ch. 10)
Feedback loop design: Collect implicit user feedback (edit pairs, regeneration requests) before you collect explicit ratings. Implicit signals scale further and reflect actual behavior. Guard against degenerate feedback loops where model predictions corrupt the training signal. (Ch. 10)
Observability as production evaluation: At production scale, the observability stack (metrics, logs, traces, drift detection) is the continuation of evaluation-driven development. Instrument for the same criteria defined in Ch. 4. (Ch. 10)

Why This Matters

The heuristics in this file are durable precisely because they are tied to structural constraints — cost curves, error compounding, distribution shift — rather than to specific models or APIs. Chip Huyen invokes Lindy's Law in the Preface to make this point explicit: techniques that have survived multiple generations of model releases tend to reflect real underlying tradeoffs, not temporary workarounds. In a field where models change quarterly, decision rules grounded in cost ordering, failure-type routing, and feedback loop design provide stable leverage regardless of which model is current.

Key Quotes

"Exhaust prompt engineering before RAG, RAG before finetuning. Each step is an order of magnitude more expensive." — Preface / Core Thesis
"Finetuning is for form, RAG is for facts." — Ch. 7
"At 95% per-step reliability, a 10-step task succeeds 60% of the time." — Ch. 6
"The compounding advantage begins only when collection starts." — Ch. 10