Library
AI Engineering: Building Applications with Foundation Models · 1 of 13
AI Engineering: Building Applications with Foundation Models
AI Software Development CRITICAL

The Three-Axis Model and AI Engineering Discipline

ai-engineering three-axis-model foundation-models evaluation discipline

Key Principle

Response quality is a function of three variables: (1) Instructions — what the model is told to do (prompt engineering); (2) Context — information available to the model at inference time (RAG, agents); (3) The Model — the underlying parameters. These axes must be optimized in order of cost and reversibility: instructions are cheapest and fastest to iterate, context requires infrastructure but no model changes, and model adaptation (finetuning) is the most expensive and least reversible. (Preface)

AI engineering is a discipline that emerged from a double movement: foundation models simultaneously increased the range of possible tasks and removed infrastructure barriers via model-as-a-service. The resulting workflow inversion — build product first, invest in data and model only if the product validates — is what structurally distinguishes AI engineering from traditional ML engineering, which commits to expensive data collection before validating the product hypothesis. (Chapter 1)

The capability-application inflection point explains why this shift is qualitative, not incremental: below a reliability threshold, AI can augment a workflow but not replace human review; above it, entirely new interaction modalities become feasible. "I thought a small increase in model quality metrics might result in a modest increase in applications. Instead, it resulted in an explosion of new possibilities." (Preface)

Lindy's Law provides a practical filter for technique durability: a technique that has been important for several years is more likely to remain important than one that emerged six months ago. The failure mode is investing in tool-specific knowledge — specific prompt frameworks, vector database APIs, agent orchestration libraries — that decays when the tool is superseded, as happened with TensorFlow-specific expertise in 2017. (Preface)

Why This Matters

The three-axis model solves the misdiagnosis problem: when output quality is poor, teams frequently jump to finetuning because it feels like the most powerful lever. But the actual problem is usually an instruction gap (the model has not been told what good looks like for this task) or a context gap (the model lacks the information needed to respond correctly). Finetuning over a context gap trains the model to confabulate rather than to reason from supplied information. Finetuning over an instruction gap teaches a style instead of fixing a specification. Both waste resources and obscure the true source of quality failures.

Inverting or skipping the axes also produces a false ceiling effect. Teams exhaust their finetuning budget, see diminishing returns, and conclude the task is unsolvable — when a structured prompt and a retrieval step would have resolved it in a day. The cost-ordering exists because it is also an epistemology: each cheaper axis, when properly applied, isolates whether the remaining gap is structural (requiring a different or adapted model) or tractable by the next cheaper intervention. Without this discipline, engineers cannot distinguish a model limitation from an engineering limitation.

Good Examples

Correct axis ordering — context gap caught before finetuning: A team building a customer-support assistant finds the model giving outdated policy answers. Rather than finetuning on correct answers (which would need to be refreshed every policy update), they add a RAG layer that retrieves current policy documents at inference time. The instruction axis is also updated to tell the model to cite the retrieved passage. Finetuning is never needed.

Workflow inversion applied: A team hypothesizes an AI-powered code review tool. Instead of collecting training data and training a model, they build a prompt-based prototype in a week. User testing reveals the actual need is explanation of existing bugs, not detection of new ones. The product pivot costs one prompt rewrite. Under traditional ML workflow, the team would have collected code/review pairs for months before discovering the hypothesis was wrong.

Lindy's Law applied to technique selection: An engineer must decide whether to build expertise in a specific agent orchestration library or in the underlying pattern of tool-calling and state management. Because the pattern has been stable across multiple library generations, it passes the Lindy filter; the specific library's API does not. Investment goes into fundamentals, which transfer when the library is superseded.

Counterpoints

Premature finetuning: Jumping to finetuning before exhausting prompt and context optimization is the most common and expensive antipattern. It trains on symptoms rather than causes, creates a model tightly coupled to a snapshot of requirements, and makes it harder to diagnose subsequent quality failures because the model's behavior no longer matches its base documentation.

Treating AI engineering as entirely new: The surface artifacts of AI engineering — APIs, JSON outputs, prompt templates — can mislead practitioners into thinking there is nothing to carry forward from ML engineering. "The familiarity and ease of use of many AI engineering techniques can mislead people into thinking there is nothing new to AI engineering. But while many principles for building AI applications remain the same, the scale and improved capabilities of AI models introduce opportunities and challenges that require new solutions." (Preface) Discarding ML fundamentals — evaluation rigor, data quality discipline, production monitoring — leaves teams without the tools needed for the hard 20–40% of quality that prompting alone cannot close.

Tool-specific knowledge as a substitute for fundamentals: Building expertise in specific prompt frameworks, vector database APIs, or agent libraries rather than in the underlying principles of instruction design, retrieval, and model adaptation creates brittle expertise. Scribd observed a two-orders-of-magnitude cost drop in AI costs in a single year (Chapter 1); the specific tools driving costs will continue to change. The principles behind the three axes are stable; their implementations are not.

Key Quotes

  • "I thought a small increase in model quality metrics might result in a modest increase in applications. Instead, it resulted in an explosion of new possibilities." (Preface)
  • "The familiarity and ease of use of many AI engineering techniques can mislead people into thinking there is nothing new to AI engineering. But while many principles for building AI applications remain the same, the scale and improved capabilities of AI models introduce opportunities and challenges that require new solutions." (Preface)
  • "With traditional ML engineering, you usually start with gathering data and training a model. Building the product comes last. However, with AI models readily available today, it's possible to start with building the product first." (Chapter 1)
  • "In AI, there are generally three types of competitive advantages: technology, data, and distribution... With foundation models, the core technologies of most companies will be similar." (Chapter 1)

Rules of Thumb

  • Exhaust instruction optimization before adding retrieval infrastructure; exhaust context optimization before finetuning.
  • If a quality failure cannot be reproduced with a fixed prompt and fixed context, it is not a finetuning problem — it is an evaluation problem.
  • The workflow inversion means your eval set cannot fully exist before you have users; design for iterative eval expansion, not a fixed benchmark.
  • Apply Lindy's Law before investing in technique mastery: if a technique has not survived at least two generations of tooling churn, treat it as provisional.
  • AI criticality (critical vs. complementary, proactive vs. reactive) sets the minimum required reliability floor and therefore the minimum evaluation investment — determine this before choosing an approach.

Related References