Library
Designing Large Language Model Applications · 5 of 12
Designing Large Language Model Applications
ai HIGH

Inference Optimization Taxonomy

inference kv-cache distillation quantization speculative-decoding

Key Principle

Inference optimization makes LLM deployment economically and physically viable. Techniques are organized by the bottleneck they attack: compute (K-V caching, distillation), decoding speed (speculative decoding), and storage (quantization). Each involves a distinct resource tradeoff, and the right combination depends on your deployment constraints.

Why This Matters

Without inference optimization, production LLM deployment is prohibitively expensive or slow. A model that works in a demo at 5 requests/minute fails at 5,000 requests/minute. These techniques are systems engineering solutions to architectural limitations — the model itself cannot be made cheaper, but the serving infrastructure can compensate.

Good Examples

  • K-V Caching: Stores attention keys and values once for static prompt portions (system instructions, manuals, RAG context) and reuses them across generation steps. Without it, inference costs scale quadratically with conversation length. Caching also enables massive few-shot prompting that would otherwise be prohibitively expensive — "a lightweight alternative to fine-tuning" (Chapter 9).
  • Knowledge Distillation: A smaller student model trained to replicate a larger teacher's behavior. Zhou et al. show even 1,000 very high-quality examples can create a strong distillation set — data quality dominates quantity (Chapter 9). Five techniques ordered by access requirements: unsupervised generation, data augmentation, intermediate representations, teacher feedback, self-teaching.
  • Speculative Decoding: A cheap draft model generates multiple candidate tokens; the expensive main model verifies them in a single forward pass (verification is parallelizable; generation is not). Variants: DistillSpec, self-speculative (subset of layers), REST (retrieval of common phrases) (Chapter 9).

Counterpoints

  • K-V cache tradeoff: Compute savings come at the cost of storage. Caches grow with sequence length and must be evicted (e.g., Claude's five-minute default TTL), creating a latency-storage-staleness triangle (Chapter 9).
  • Distillation risks: All distillation risks capability degradation or catastrophic forgetting. White-box vs. black-box access constrains which techniques are available (Chapter 9).
  • Quantization precision: BF16 preserves FP32's exponent range (important for training stability) but requires newer GPUs. Activation quantization is harder than weight quantization because activation distributions aren't known before inference (Chapter 9).

Key Quotes

"Caching can also enable adding a lot of few-shot examples in the prompt. This can sometimes be a lightweight alternative to fine-tuning." — Suhas Pai, Chapter 9

Rules of Thumb

  • Use K-V caching for any multi-turn or RAG application — it's essential, not optional
  • For distillation, invest in data quality over data quantity (1,000 high-quality examples can suffice)
  • Consider speculative decoding when latency is the binding constraint
  • Quantize weights to BF16/int8 as a default; go to int4 only with careful quality validation
  • Weak-to-strong generalization: sometimes a smaller teacher can improve a larger student by eliciting latent knowledge

Related References