Selecting and Evaluating LLMs - Designing Large Language Model Applications

Key Principle

Model selection is overrated relative to data quality, benchmark evaluation is broken, and the real differentiator of open source models is not cost but access to logits — which unlocks debugging, interpretability, and constrained generation. "Having full availability of all the token probabilities (logits) is a superpower" (Chapter 5).

Why This Matters

Teams select models based on leaderboard rankings, then discover the model fails on their actual task distribution. Benchmarks fail for compounding reasons: contamination, inconsistent methodology, provider optimization, and human evaluation biases. Without understanding these failure modes, model selection becomes cargo-cult engineering — choosing by reputation rather than evidence.

Good Examples

Logit access enables: (1) debugging why specific tokens were chosen, (2) constrained decoding (Jsonformer, LMQL, Guidance), (3) calibration and confidence estimation, (4) self-consistency via multiple sampled generations with majority voting. Proprietary APIs typically hide or restrict this access (Chapter 5).
Constrained decoding solves the "excessively chatty" LLM problem by restricting the token space at each generation step — the LLM generates content tokens, but the tool enforces structural tokens (braces, keys, delimiters) deterministically. More reliable than prompting alone because it operates on the decoding process rather than hoping the model complies (Chapter 5).
Self-consistency prompting: Generate multiple completions with CoT and use majority voting. Works because correct reasoning paths are more likely to converge than incorrect ones (Chapter 5).

Counterpoints

Instruction tuning side effects: Instruction-tuning shifts weight distribution toward instruction-following, which can override pre-training capabilities. Chung et al. showed FLAN instruction-tuning worsened CoT reasoning. It is a redistribution of capability, not a pure upgrade (Chapter 5).
Context window degradation: Performance degrades as context increases, with 8K as a practical tipping point. LLMs forget information in the middle of the context window (Liu et al.) — teams assuming longer context = better are mistaken (Chapter 5).
LLM-as-evaluator pitfalls: "Do not trust evaluations performed by GPT-4 or any other LLM. We have no idea what evaluation criteria it uses nor do we have a deeper understanding of its biases" (Chapter 5). Human evaluation (Elo ratings) also carries biases: preferences for longer text, overlooking factuality, and sensitivity to presentation order.

Key Quotes

"Having full availability of all the token probabilities (logits) is a superpower, as we will see throughout the book." — Suhas Pai, Chapter 5

"Evaluating LLMs is probably the most challenging task in the LLM space at present. Current methods of benchmarking are broken, easily gamed, and hard to interpret." — Suhas Pai, Chapter 5

"The fine-grained choice of LLM usually isn't the most important criteria determining the success of your task." — Suhas Pai, Chapter 5

Rules of Thumb

Build internal benchmarks on your task distribution rather than trusting leaderboards
Prefer open source when you need to diagnose or constrain outputs (logit access)
Benchmark both base and instruction-tuned variants on your specific tasks
Use constrained decoding (not prompting) for structured output requirements
For high-reliability tasks, use self-consistency (n > 1 generations + majority vote)
Place critical information at the beginning or end of context, never the middle

Related References

The Prototype-to-Production Gap - Why model selection is secondary to systems engineering
Pre-Training Data: The Most Important Ingredient - Contamination as a benchmark failure mode
Multi-LLM System Architecture - Logit access enables cascade confidence signals