Library
Mastering Claude Code: Real-World Projects, Prompts, and Workflows for AI-Powered Development · 2 of 14
Mastering Claude Code: Real-World Projects, Prompts, and Workflows for AI-Powered Development
ai HIGH

Cost and Latency Optimization — Tokens as a Financial Control

cost tokens latency observability prompting

Key Principle

Tokens are the unit of both cost and latency, so prompt structure is a financial control, not just a quality one. Total cost = (input tokens × input rate) + (output tokens × output rate), scaling with model capability and task complexity. Without this per-token mental model teams hit "unpleasant surprises," because cost is invisible at the moment of spending and accrues silently across multi-turn sessions and CI loops (Chapter 12).

Why This Matters

The same context-richness lever that improves quality also drives cost. Spend compounds quietly: at CI and multi-agent scale, 6-cent tasks add up to real money, and silent compounding is the failure mode. Latency follows the same logic — it is driven by token-processing time, network latency, concurrency/rate limits, and redundant round-trips — so the discipline is to make spend a planned investment rather than a reactive expense and to measure everything, because "you can't manage what you don't measure" (Chapter 12).

Good Examples

  • Information Density — intent per token, not short prompts. Token efficiency is about conveying intent with the fewest words necessary, not minimizing word count. A verbose FastAPI prompt restructured into Task:/Requirements: bullets dropped 175→105 tokens (40%) with identical output — proving the cut tokens carried no information signal. Conversational filler ("please," "make sure") costs tokens and adds zero model signal (Chapter 12).
  • Cost-aware workflow / AI Budgeting. Estimate cost before execution, compare to a per-task threshold, and on breach either downshift the model (Sonnet→Haiku) or trim context — then log the decision. The CI pattern communicates the budget decision to the pipeline via exit code (e.g., sys.exit(42) = "switch") (Chapter 12).
  • Latency optimization set. Each lever targets one factor: prompt-hash caching (SHA-256 → return stored response) kills redundant calls (up to ~80% repeat-latency cut); batching collapses N round-trips into 1; async overlaps network waits; context summarization shrinks token-processing time; connection reuse drops handshake overhead. Critical complication: naive parallelism triggers throttling, so concurrency must be rate-aware (Chapter 12).
  • Modular / system prompting. Define a fixed SYSTEM_PROMPT once and inject only the variable task parts — cuts recurring context 30–50% over long sessions. Same lever as caching: stop resending what hasn't changed (Chapter 12).

Counterpoints

  • Equating short with efficient. A terse but vague prompt forces re-runs; brevity that strips intent raises total cost. Optimize intent-per-token, not length (Chapter 12).
  • Flying blind on spend. Without usage/performance/audit metrics you cannot budget, tune, or comply — "you can't manage what you don't measure" (Chapter 12).
  • Parallelizing without rate awareness. Naive concurrency triggers 429 throttling; concurrency is itself one of the latency cost factors and must be controlled (Chapter 12).

Key Quotes

"Token efficiency isn't about writing short prompts — it's about writing smart ones... information density: the ability to convey your intent to Claude clearly, with the fewest words necessary." — Kilian Voss, Chapter 12

"You can't manage what you don't measure." — Kilian Voss, Chapter 12

Rules of Thumb

  • AI Budgeting loop: estimate → compare to threshold → trim context or switch model → log the decision.
  • AI Observability — three metric families map to three needs: usage metrics (tokens/cost) → budgeting; performance metrics (latency/errors/retries) → tuning; audit logs (prompts/model/user) → compliance. Per-request JSON logging is the substrate governance later depends on.
  • Cache by prompt hash to eliminate repeat calls; batch to collapse round-trips; reuse connections; summarize context to cut token-processing time.
  • Define the system prompt once and inject only variable parts to cut recurring context 30–50%.
  • Pricing/model/context-window numbers in the book are [likely inaccurate to the real product] — the author's "late 2025 estimates" (e.g., ~4 chars/token; Haiku $0.0008/$0.004, Sonnet $0.003/$0.015, Opus $0.010/$0.050 per 1K in/out; ~200k context; a 2k-in/3k-out Sonnet call ≈ $0.051). Always confirm against live pricing.

Related References