Designing Large Language Model Applications

The Core Framework

The central problem is the prototype-to-production gap: demos are trivially easy; production systems require holistic understanding of every LLM ingredient
Data quality > model selection > prompt engineering — in order of impact
LLM limitations (hallucination, reasoning failures, bias) are design constraints, not bugs — engineer around them
Production deployment is a systems engineering problem: routers, cascades, guardrails, verifiers, orchestration
Default to explicit interaction paradigms over autonomous agents for anything mission-critical

Quick Lookup

Situation	Do This	Avoid This
Starting a new LLM project	Clean data, check tokenization, build internal benchmarks	Chasing leaderboard rankings or picking the biggest model
Model behaving oddly on specific inputs	Check tokenization of those inputs first	Debugging at prompt or architecture level only
Need structured output	Use constrained decoding (Jsonformer, LMQL)	Hoping the model complies via prompting alone
Building a knowledge-grounded app	Implement RAG with hybrid search (BM25 + embeddings)	Relying on parametric memory or dumping everything into long context
Need high reliability	Generate n > 1 completions, use self-consistency voting	Single generation with no verification
Optimizing cost at scale	Use LLM cascades (smallest model first, escalate on low confidence)	Defaulting to the largest model for all requests
Building an agent	Use explicit (pre-programmed) tool orchestration	Autonomous agents for mission-critical applications

The Key Insight

"Plenty of software frameworks have emerged that enable rapid prototype development of LLM applications. However, advancing from prototypes to production-grade applications is a road much less traveled, and is still a very challenging task." — Suhas Pai, Preface

Designing Large Language Model Applications

Overview

The Core Framework

Quick Lookup

The Key Insight

References

LLM Agents, Tools, and Interaction Paradigms

The Prototype-to-Production Gap

Embeddings, Document Parsing, and Semantic Search

Implementation Playbook

Inference Optimization Taxonomy

Selecting and Evaluating LLMs

Multi-LLM System Architecture

Pre-Training Data: The Most Important Ingredient

Retrieval-Augmented Generation Pipeline

Rules of Thumb

Tokenization and Its Hidden Failure Modes

Transformer Architecture and Learning Objectives