Inference Optimization and Production Metrics - AI Engineering: Building Applications with Foundation Models

Key Principle

Production inference optimization navigates the Cost/Quality/Latency trilemma under the constraint that inference can account for up to 90% of total ML costs in deployed systems (Chapter 9). Two fundamentally different bottlenecks define the problem: prefill (processing all input tokens in parallel) is compute-bound, while decode (sequential one-token-at-a-time generation) is memory-bandwidth-bound because the entire model's weight matrices must traverse from HBM to compute units for each token generated. Applying the wrong optimization to the wrong phase wastes engineering effort entirely.

Goodput — requests per second that satisfy SLOs — is the correct optimization target, not raw throughput or GPU utilization. A service completing 100 requests/minute but satisfying SLOs on only 30 of them has goodput of 30 requests/minute (Chapter 9).

KV cache stores the attention key/value vectors computed for previous tokens, reducing attention from O(n²) to O(n) per token, but at substantial memory cost. Size formula: 2 × B × S × L × H × M. At scale this dominates: a 500B+ parameter model at batch 512, context 2048 produces a KV cache of 3TB — 3× the model's own weights (Chapter 9).

Speculative decoding uses a smaller draft model to generate K candidate tokens, then the target model verifies them in parallel (exploiting the parallelizable prefill profile). Because decode is memory-bandwidth-bound, the FLOPs needed for verification are otherwise idle — verification is essentially free. Output quality is unchanged.

Continuous batching (in-flight batching) allows completed requests to return immediately and vacated slots to be filled with new requests, rather than holding all slots until the longest response completes. This prevents short responses from queuing behind long ones, which is the primary failure mode of static batching.

Why This Matters

Goodput is the correct metric because it ties infrastructure decisions directly to user experience. GPU utilization can be at 100% while most requests miss their latency SLOs — the model is busy but the service is failing. Raw throughput similarly ignores whether the throughput is being delivered within acceptable bounds. Goodput forces the engineering question to be: "how many users are getting acceptable responses per second?" rather than "how hard is my hardware working?" This reframing changes which optimizations are worth pursuing: it is often better to sacrifice maximum throughput to reduce tail latency than to maximize average requests/second.

That decode is memory-bandwidth-bound rather than compute-bound has deep architectural implications. It means adding FLOP/s (faster GPUs, more parallelism) yields no improvement for the decode phase. The bottleneck is how fast model weights can be read from memory, and this is a direct consequence of autoregressive generation — there is no escape without redesigning attention itself. This is why quantization (reducing bytes per parameter) and KV cache optimization (reducing bytes transferred per token) are so effective: they directly attack the binding constraint. It also explains why continuous batching improves throughput — by keeping memory bandwidth fully utilized with useful work rather than idle slots waiting for long responses to finish.

Good Examples

Diagnosing bottleneck type with MFU/MBU: Before choosing an optimization, compute MBU = (param_count × bytes/param × tokens/s) / theoretical bandwidth and MFU = observed throughput / theoretical peak FLOP/s. A 7B-parameter FP16 model running at 100 tokens/s on an A100-80GB has MBU of ~70% (700 GB/s / 2 TB/s). High MBU confirms the workload is memory-bandwidth-bound; applying compute-bound optimizations will not help. High MFU with low MBU identifies compute bottleneck. This determines whether speculative decoding (requires idle FLOPs), batching (increases MBU), or architecture changes are the productive path.

Setting per-request SLOs before choosing batching strategy: TTFT (Time to First Token) and TPOT (Time Per Output Token) are tradeable. If the application streams responses and users perceive start time as responsiveness, TTFT is the primary SLO. If the application returns complete responses (agentic, batch processing), total latency and throughput goodput matter more. LinkedIn found it possible to double or triple throughput via batching when willing to sacrifice TTFT and TPOT — but this trade-off is only acceptable if the SLO allows it. Defining SLOs before optimizing prevents optimizing the wrong metric.

Prompt caching for repeated system prompts: If 1 million daily API calls each include a 1,000-token system prompt, the service processes ~1 billion redundant tokens daily without prefix caching. Prefix caching reuses KV vectors for repeated prompt prefixes; Anthropic reports 90% cost reduction and 75% latency reduction for long repeated contexts (Chapter 9). This is a high-leverage, no-retraining optimization for any system with a fixed system prompt.

Counterpoints

Optimizing for GPU utilization instead of goodput: GPU utilization measures the percentage of time the GPU is active, not whether it is doing useful work within SLO bounds. A system can show 100% GPU utilization while delivering most responses outside latency targets. This antipattern leads to over-batching — accumulating requests to maximize hardware use — which increases TTFT beyond acceptable limits and destroys goodput for latency-sensitive workloads.

Using static batching in a high-variance request environment: Static batching holds all batch slots until the longest response completes. When response lengths vary widely (e.g., a mix of 100-token and 4,000-token responses), short responses queue behind long ones, wasting both latency and throughput. The model may be fully utilized while goodput collapses. Continuous batching eliminates this by immediately returning completed responses and filling vacated slots — it is the correct default for production LLM serving.

Applying compute-bound optimizations to a bandwidth-bound workload: Most current LLM inference is memory-bandwidth-bound at the decode phase. Optimizations designed for compute-bound workloads — increased parallelism, higher-FLOP/s hardware, operator fusion targeting arithmetic intensity — provide no improvement and may add latency from communication overhead. The diagnostic step (MFU/MBU measurement) must precede optimization selection.

Key Quotes

"A service completing 100 requests/minute but satisfying SLOs on only 30 of them has goodput of 30 requests/minute — the only number that matters for user experience." (Chapter 9)
"Inference cost can exceed training cost in deployed systems, accounting for up to 90% of total ML costs." (Chapter 9)
"A single output token has approximately the same latency impact as 100 input tokens." (Chapter 9)
"KV cache for a 500B+ parameter model at batch 512, context 2048 equals 3TB — 3× the model's own weights." (Chapter 9)

Rules of Thumb

Measure MFU and MBU before choosing an optimization; applying the wrong solution to the wrong bottleneck yields zero improvement.
Goodput (requests/second satisfying SLOs) is the production metric; raw throughput and GPU utilization are proxies that can mislead.
TPOT optimization below ~120 ms/token (6–8 tokens/second) produces no user-perceived improvement in streaming mode; reducing TTFT always matters.
Use tensor parallelism to hit latency SLOs on large models; use replica parallelism to scale throughput once latency is acceptable.
Prefix caching is high-leverage and requires no retraining — implement it whenever a system prompt is repeated across requests.

Related References

Production Architecture and the Data Flywheel - SLO targets defined at system level
Foundation Model Internals - KV cache arises from attention mechanism
Finetuning, LoRA, and Model Merging - Quantization reduces both training and inference memory