Evaluating LLM Applications - Prompt Engineering for LLMs

Key Principle

Build the evaluation framework before any other application code. LLM outputs are non-deterministic and sensitive to small prompt changes — without a pre-existing evaluation harness, every change is a coin flip. The SOMA framework (Specific questions, Ordinal scales, Multi-Aspect coverage) makes LLM-as-judge evaluations reliable. LLM assessment is inherently relative, not absolute: use it to compare versions, not to assign standalone quality scores.

Why This Matters

Because LLMs are text completion engines that mimic training patterns, there is no principled way to predict how a prompt change will affect output quality. Evaluation is the only substitute for the predictability that deterministic code provides. Without it, development devolves into subjective "feels better to me" judgments, teams ship regressions unknowingly, or avoid changes out of fear.

"The very first bit of code we wrote was the evaluation, and it's only thanks to this that we were able to move so fast and so successfully with the rest." (Chapter 10)

Good Examples

Three offline evaluation approaches form a hierarchy:

Gold standard matching: Compare against known-good answers. Use when reference solutions exist.
Functional testing: Check executable properties (code compiles, tool calls parse). Use when outputs have verifiable behavior.
LLM assessment: Model as judge. Use for subjective quality dimensions. Evaluate the first decision point with a real chance of going wrong, not the final surface form. For a smart home responding to "I'm chilly," check whether it activates heating, not the exact temperature value. (Chapter 10)

SOMA constrains LLM-as-judge. Specific, verifiable questions rather than "Is this good?"; Ordinal 1-5 scales with explicit level descriptions; Multiple Aspects to prevent fixation on different dimensions for different examples. Present the rubric before the example — the model cannot retroactively apply criteria it hasn't read yet (unidirectional attention). Frame as grading a third party's work, never the model's own output. (Chapter 10)

Five categories of online metrics: (1) Direct feedback, (2) functional correctness, (3) user acceptance, (4) achieved impact, (5) incidental metrics. Acceptance metrics correlated more strongly with productivity gains than sophisticated impact measurements. Thumbs-down is more reliable than thumbs-up. (Chapter 10)

Counterpoints

LLM assessment is relative, not absolute. "Even though the questions to the LLM are often formulated as absolute quality questions (e.g., 'Is this correct?'), an LLM assessment a priori serves only as a relative quality judgement." (Chapter 10) The inconsistencies cancel statistically when comparing versions, but standalone scores are meaningless.

Break Goldilocks questions into two. "Was it just right?" should become "Was it enough?" and "Was it too much?" — the model handles unidirectional scales more reliably than bidirectional ones. (Chapter 10)

Delayed feedback beats immediate feedback. Post-hoc satisfaction reflects actual value better than in-the-moment appreciation. Conversation time is ambiguous — it could mean instant success or immediate rage-quit. (Chapter 10)

Key Quotes

"The very first bit of code we wrote was the evaluation, and it's only thanks to this that we were able to move so fast and so successfully with the rest." — Berryman & Ziegler, Chapter 10

"Even though the questions to the LLM are often formulated as absolute quality questions (e.g., 'Is this correct?'), an LLM assessment a priori serves only as a relative quality judgement." — Berryman & Ziegler, Chapter 10

Rules of Thumb

Write evaluation code before any prompt code — it's the foundation for all subsequent engineering
Use SOMA for LLM-as-judge: specific questions, ordinal scales, multiple aspects
Present the rubric before the example being evaluated — exploit unidirectional attention
Use LLM assessment for relative comparisons between versions, never for absolute quality scores
Start with acceptance metrics; use functional correctness and latency as guardrails
Always record latency and token consumption — cheap to collect, critical for catching regressions

Related References

LLMs as Text Completion Engines - Non-determinism of text completion makes evaluation necessary
Designing LLM Applications - Evaluation is the capstone of the feedforward pass pipeline
Taming Model Output - Logprobs feed evaluation metrics