Rules of Thumb - Designing Large Language Model Applications

Key Principle

Collected heuristics and decision rules from across the book for quick reference during LLM application development. Each rule is grounded in a specific causal mechanism explained in the relevant chapter.

Why This Matters

Production LLM development involves hundreds of small decisions. Having evidence-based heuristics prevents common mistakes and speeds up decision-making without requiring full re-derivation of the reasoning.

Good Examples

Data & Preparation

Data quality > model selection > prompt engineering — in order of impact (Ch. 2, 5)
Check tokenization first when debugging unexpected model behavior on specific inputs (Ch. 3)
Budget for 2-4x more tokens in non-English languages due to tokenizer inequality (Ch. 3)
Audit for contamination before trusting any benchmark result (Ch. 2)
Document parsing is the real bottleneck — invest here before model tuning (Ch. 11)

Model Selection & Evaluation

Build internal benchmarks on your task distribution (Ch. 5)
Prefer open source when you need logit access for debugging or confidence estimation (Ch. 5)
Benchmark both base and instruct variants — instruction tuning can regress specific capabilities (Ch. 5)
Don't trust LLM-as-evaluator — "We have no idea what evaluation criteria it uses" (Ch. 5)
8K tokens is the practical context window ceiling — performance degrades beyond this (Ch. 5)

Generation & Output

Use constrained decoding (not prompting) for structured output requirements (Ch. 5)
Generate n > 1 completions and post-process for high-reliability tasks (Ch. 1)
CoT helps reasoning but can hurt knowledge tasks and significantly increases cost (Ch. 1)
Place critical context at beginning or end of prompts, never the middle (Ch. 5, 12)
Version-control prompts alongside model versions to detect prompt drift (Ch. 1)

Retrieval & RAG

Start with hybrid search (BM25 + embeddings) — often sufficient (Ch. 12)
Use hard negatives for embedding fine-tuning; random negatives teach nothing (Ch. 11)
RAG is bottlenecked by retrieval quality, not generation quality (Ch. 12)
For models ≤7B, almost always prefer RAG over parametric memory (Ch. 12)
RAG outperforms fine-tuning for knowledge-intensive tasks, but they're complementary (Ch. 12)
Long-context ≠ no retrieval — "akin to storing all files in RAM instead of disk" (Ch. 12)

System Architecture

Default to explicit interaction paradigm — avoid autonomous agents for mission-critical tasks (Ch. 10)
Start with the smallest model that could work; escalate via cascades (Ch. 13)
Never trust LLM self-assessed confidence — use external signals (Ch. 13)
Cascades for high-stakes (fallback safety net); routers for latency-sensitive tasks (Ch. 13)
KISS principle for agents — complexity increases failure surface area (Ch. 10)
Every component must justify its latency cost with measurable improvement (Ch. 10)

Inference & Cost

K-V caching is essential for any multi-turn or RAG application (Ch. 9)
1,000 high-quality examples can create a strong distillation set (Ch. 9)
Consider speculative decoding when latency is the binding constraint (Ch. 9)
Prompts aren't magical incantations — manage expectations about prompt engineering (Ch. 1)

Counterpoints

These heuristics are defaults, not absolutes. Each has documented exceptions in the referenced chapters.
The LLM landscape evolves rapidly — heuristics about specific model sizes or capabilities may date faster than architectural principles.

Key Quotes

"It is very important to manage one's expectations about the effectiveness of prompt engineering. Prompts aren't magical incantations that unlock hidden LLM capabilities." — Suhas Pai, Chapter 1

"The fine-grained choice of LLM usually isn't the most important criteria determining the success of your task." — Suhas Pai, Chapter 5

Related References

Implementation Playbook - Structured decision framework
The Prototype-to-Production Gap - The thesis these rules operationalize