RAG, Agents, and Context Construction - AI Engineering: Building Applications with Foundation Models

Key Principle

Chapter 6 operationalizes Axis 2 (context) of the three-axis quality model. RAG is the systematic practice of constructing context for foundation models — and the structural analogy is exact: "Context construction for foundation models is equivalent to feature engineering for classical ML models. They serve the same purpose: giving the model the necessary information to process an input." (Chapter 6) Just as model accuracy in classical ML is bounded by feature quality regardless of algorithm choice, response quality in foundation models is bounded by context quality regardless of model capability. Retriever quality is therefore the binding constraint on the full RAG pipeline — a generator cannot recover from a retriever that returned the wrong documents.

Hybrid search (term-based + embedding-based) is structurally necessary, not optional, because the two retrieval types have complementary and non-overlapping failure modes. Term-based retrieval fails on semantic mismatch; embedding-based retrieval fails on exact identifiers (error codes, model numbers). A system that uses only one type will fail on the other's failure cases. Hybrid search resolves this by combining both: either sequentially (sparse narrowing → semantic reranking) or in parallel via Reciprocal Rank Fusion (Score(D) = Σ 1/(k + rᵢ(D))).

The three-tier memory system maps directly onto the book's three axes: in-weights memory (Axis 3, finetuning), in-context memory (Axis 1, prompt engineering), and external memory (Axis 2, RAG). Every information failure has a specific locus and a specific remedy. The agent planning frameworks — ReAct (interleaved reasoning and action) and Reflexion (evaluator + self-reflection for within-task learning) — extend RAG's retrieve-then-generate pattern into a loop, but introduce compound error: at 95% per-step accuracy, accuracy over 100 steps drops to 0.6%.

Chunking — how documents are split before indexing — is a non-obvious failure mode. Poor chunking produces retrieval failures invisible at the evaluation stage: the right document is retrieved, but the relevant passage is split across two chunks, neither containing the complete information. There is no universal optimal chunk size; the tradeoff is between diversity (smaller chunks) and coherence (larger chunks), and overlap size must be determined empirically. Contextual retrieval (Anthropic) mitigates this by prepending 50–100 token AI-generated context to each chunk before indexing, recovering document-level context lost when chunks are embedded in isolation.

Why This Matters

RAG should be chosen over finetuning when the failure is informational — the model lacks current, proprietary, or session-specific facts — not behavioral. Finetuning addresses in-weights knowledge and style; RAG addresses external retrieval. Applying finetuning to inject frequently changing facts incurs training cost and still cannot generalize to new facts. RAG addresses this at the right locus. The operational threshold: if the knowledge base is under ~200,000 tokens (~500 pages), including it in-prompt is viable; above that, RAG is necessary. RAG will also remain relevant despite longer context windows because context requirements expand with available data, and attention quality degrades over longer contexts.

Hybrid search is structurally necessary because no single retrieval method covers all query types. Pure embedding-based retrieval fails on precise technical identifiers that become obscured during embedding; pure term-based retrieval fails on natural-language semantic queries. The compound error rate has direct implications for agent scope decisions: at 95%^100 = 0.6% accuracy, granting write-action autonomy to a 100-step agent over irreversible operations is indefensible. Scope decisions must be calibrated to measured per-step accuracy — the fewer steps and the higher the per-step accuracy, the more autonomy is justified.

Good Examples

Hybrid search implementation: Build two retrieval paths in parallel — a TF-IDF/BM25 sparse retriever for keyword and exact-match queries, and an ANN-based dense retriever for semantic queries. Combine rankings using Reciprocal Rank Fusion with k ≈ 60, which avoids the need for calibrated relevance scores across retrievers. This covers both natural-language queries (where semantic retrieval excels) and technical queries with error codes or product identifiers (where term-based retrieval excels).

ReAct vs. Reflexion choice: Use ReAct when the task requires step-by-step self-correction during execution — the Thought → Act → Observation loop enables recovery from intermediate errors before they compound. Use Reflexion when the task benefits from learning across multiple attempts within a session — the evaluator scores outcomes and the self-reflection module diagnoses failure causes, allowing the agent to propose a revised trajectory. Note that Reflexion's reflection tokens and example storage consume context space, reducing room for task content.

Scoping write-actions to match measured reliability: Before granting an agent permission to send emails or execute SQL writes, measure per-step accuracy and compute the compound accuracy for the intended task length. At 95% per-step accuracy over 10 steps, compound accuracy is ~60% — acceptable for low-stakes tasks but not for irreversible operations. Gate write-action autonomy on demonstrated compound accuracy, and require human confirmation for irreversible actions until reliability is demonstrably higher.

Counterpoints

Using only embedding-based retrieval: Embedding-based retrieval obscures exact identifiers — error codes like EADDRNOTAVAIL (99), product model numbers, precise version strings — after the embedding transformation. A system relying solely on semantic retrieval will fail systematically on technical queries requiring exact matches. Hybrid search is the structural fix; dropping term-based retrieval to simplify the system introduces a predictable failure mode.

Granting write-action permissions before measuring per-step accuracy: Write actions (SQL writes, email sends, bank transfers) are categorically different from read-only tools — they are often irreversible. Granting write-action autonomy without measuring compound accuracy assumes the agent is reliable. "Just as you shouldn't give an intern the authority to delete your production database, you shouldn't allow an unreliable AI to initiate bank transfers." (Chapter 6) The compound error math makes this concrete: 0.6% accuracy at 100 steps means near-certain failure.

Treating RAG as solving both information and behavior failures: RAG addresses the external memory locus — facts the model does not have. It does not address behavioral failures (the model knows the facts but formats, tones, or reasons incorrectly). Using RAG to fix behavioral problems wastes retrieval infrastructure and leaves the actual problem unaddressed. Behavioral failures belong to prompt engineering (Axis 1) or finetuning (Axis 3).

Key Quotes

"Context construction for foundation models is equivalent to feature engineering for classical ML models. They serve the same purpose: giving the model the necessary information to process an input." (Chapter 6)
"If the model's accuracy is 95% per step, over 10 steps, the accuracy will drop to 60%, and over 100 steps, the accuracy will be only 0.6%." (Chapter 6)
"No matter how long a model's context length is, there will be applications that require context longer than that. After all, the amount of available data only grows over time." (Chapter 6)
"Just as you shouldn't give an intern the authority to delete your production database, you shouldn't allow an unreliable AI to initiate bank transfers." (Chapter 6)

Rules of Thumb

Retriever quality bounds generator quality — retrieval method selection is an architectural decision, not an implementation detail.
Use hybrid search by default; pure embedding-based retrieval will fail on exact-match technical queries, and pure term-based will fail on semantic queries.
Before granting write-action autonomy, compute compound accuracy for the intended task length (per-step accuracy ^ number of steps).
Match the memory tier to the failure locus: external retrieval for missing facts, in-context for session-specific data, finetuning for universal behavioral patterns.
Validate plans before executing them — generating and checking a plan before running it prevents burning tool call budgets on fundamentally flawed execution paths.
Ablate tools empirically: remove tools that don't improve performance, identify misused tools via call distribution analysis — more tools increase planning complexity and hallucination surface area.
Avoid FIFO context window management for long agent sessions; the first message often contains task constraints that govern the entire session and must be preserved.

Related References

Finetuning, LoRA, and Model Merging - RAG vs. finetuning decision (form vs. facts)
The Three-Axis Model and AI Engineering Discipline - RAG as Axis 2 of the three-axis model
Production Architecture and the Data Flywheel - Agents in five-step production architecture