Prompt Engineering and In-Context Learning - AI Engineering: Building Applications with Foundation Models

Key Principle

Prompt engineering is Axis 1 of the three-axis quality model (instructions → context → model). It is the lowest-cost, highest-leverage intervention before RAG or finetuning, and most teams hit model limitations before prompt limitations (Chapter 5).

In-context learning (ICL) works because models trained on next-token prediction have seen billions of demonstrations before tasks — tutorials, worked examples, Stack Overflow Q&A. Adding examples activates behavioral patterns already in the weights; it does not teach new information. This explains why stronger models need fewer shots (Microsoft 2023: GPT-4 shows only limited few-shot improvement on general tasks) and why domain-specific APIs with novel syntax still benefit from few-shot — the model never saw those demonstrations (Chapter 5).

Chain-of-thought (CoT) has a causal mechanism: intermediate reasoning tokens constrain the probability distribution over the answer token. The model conditions its answer on correct-looking reasoning rather than pattern-matching directly from input to output. This is why CoT reduces hallucinations — hallucinations are plausible-sounding tokens generated without sufficient contextual constraint; CoT adds constraint tokens before the high-risk answer region (Chapter 5).

System vs. user prompts are concatenated into a single string via the model's chat template and processed identically. The system prompt has no architectural privilege. Any priority advantage comes from positional primacy (leading tokens receive stronger attention) or from instruction-hierarchy finetuning (Wallace et al., 2024), which achieved up to 63% robustness improvement but requires explicit finetuning — it is not the default (Chapter 5).

Context length and "lost in the middle": Liu et al. (2023) showed models perform worst when target information is in the middle of long prompts. A larger context window does not mean equal attention across all tokens. Structural placement of critical instructions at the beginning or end is a real engineering decision (Chapter 5).

Why This Matters

Prompt engineering is the right starting point because it requires no infrastructure changes, no training data, and no deployment pipeline — only iteration on text. Starting with RAG or finetuning before exhausting prompt engineering wastes engineering time on heavier interventions that may not be necessary. The GoDaddy case (2024) illustrates this: a 1,500+ token prompt was decomposed, reducing both cost and improving performance — a pure prompt-level fix (Chapter 5).

CoT's hallucination-reduction mechanism matters because it reframes the problem: hallucinations are not random model failures but a consequence of insufficient contextual constraint during generation. Each reasoning token generated before the answer narrows the distribution over the answer. This means CoT is not a stylistic preference — it is an architectural intervention within autoregressive generation, and skipping it on reasoning-heavy tasks leaves that constraint mechanism unused.

Good Examples

Few-shot prompt for domain-specific syntax (Ibis dataframes)

# Convert pandas to Ibis syntax

pandas:  df.groupby("region")["sales"].sum()
ibis:    t.group_by("region").agg(t.sales.sum())

pandas:  df[df["revenue"] > 1000]
ibis:    t.filter(t.revenue > 1000)

pandas:  df.sort_values("date", ascending=False).head(10)
ibis:

This works because training data underrepresented Ibis syntax; the demonstrations supply what the model never saw.

Zero-shot CoT prompt for a reasoning task

A warehouse receives 240 units on Monday and ships 35% on Tuesday.
On Wednesday it receives 80 more units. How many units remain?

Think through this step by step before giving your final answer.

The "think step by step" instruction forces intermediate tokens that constrain the arithmetic answer.

Prompt decomposition to address the "lost in the middle" problem Instead of a single 1,500-token prompt containing instructions, persona, constraints, examples, and the user query, decompose into:

System prompt: concise role + top-priority constraints (placed first)
Retrieved context: only the most relevant chunks (not all available)
User query: at the end (recency bias works in your favor) This mirrors the GoDaddy approach that reduced cost and improved performance (Chapter 5).

Counterpoints

Treating ICL as magic without understanding the mechanism. Engineers add shots mechanically to improve performance without recognizing that cost scales linearly with shots and that stronger models may not benefit at all. Microsoft (2023) found GPT-4 shows only limited improvement from few-shot on general tasks. Empirical determination per model per task is required (Chapter 5).

Relying on system prompts for security without instruction-hierarchy finetuning. A developer assuming system prompt instructions are enforced by architecture will discover that without finetuning, those restrictions are just tokens at position 0 — they have no mechanical priority over user input. Wallace et al. (2024) required explicit finetuning to achieve the 63% robustness improvement; that is not the default model behavior (Chapter 5).

Ignoring "lost in the middle" when stuffing long prompts. Engineers who assume a 2M-token context window means equal attention to all 2M tokens will find that critical instructions buried in the middle are effectively ignored. Window size is not retrieval quality. The RULER benchmark (Hsieh et al., 2024) measures this gap directly (Chapter 5).

Key Quotes

"Models trained on next-token prediction have seen billions of examples of humans demonstrating tasks before performing them... When you add examples to a prompt, you are activating behavioral patterns the model already learned; you are not teaching it new information." (Chapter 5)

"The system prompt has no architectural privilege — any performance advantage comes from positional priority... [and] post-training instruction hierarchy." (Chapter 5)

"CoT reduces hallucinations... hallucinations are often plausible-sounding tokens generated without sufficient contextual constraint. CoT adds constraint tokens before the high-risk answer region." (Chapter 5)

"Write your system prompt assuming that it will one day become public. Proprietary prompts are... 'more of a liability than a competitive advantage' — they require maintenance with every model update." (Chapter 5)

Rules of Thumb

Exhaust prompt engineering before reaching for RAG or finetuning; the cost difference is an order of magnitude.
Add few-shot examples only when the model lacks training exposure to the format or domain — verify empirically whether shots help for your model.
Use CoT for any task where the answer depends on multi-step reasoning; the latency cost is real but so is the hallucination reduction.
Place the highest-priority instructions at the beginning or end of the context window — never bury them in the middle.
Do not rely on system prompt position for security enforcement; pair restrictions with instruction-hierarchy finetuning or treat them as soft guidance only.

Related References

Prompt Attacks and Defense-in-Depth - Security aspects of prompting
The Three-Axis Model and AI Engineering Discipline - Prompt engineering as Axis 1
RAG, Agents, and Context Construction - Axis 2 when prompting is exhausted