Taming Model Output - Prompt Engineering for LLMs

Key Principle

The model's raw output is a probability distribution over tokens, and there are powerful techniques for exploiting that distribution beyond just reading the top-1 completion text. Logprobs reveal model confidence, enable classification calibration, and detect prompt anomalies. Fine-tuning (especially LoRA) is a continuation of prompt engineering — it bakes static instructions and few-shot examples into model parameters rather than consuming context window tokens.

Why This Matters

Without logprobs, applications treat all completions as equally reliable. Without understanding preamble types, engineers either suppress valuable reasoning or tolerate wasteful fluff. Without principled model selection and fine-tuning understanding, teams either over-invest in prompting when fine-tuning would be more effective, or attempt fine-tuning expecting capabilities the model lacks.

Good Examples

Three types of preambles require opposite strategies. (1) Structural boilerplate — waste; move into the prompt as a transition. (2) Reasoning — valuable; longer reasoning preambles produce more correct answers. (3) Fluff — RLHF-induced verbosity; banish by requesting the answer first. The key: preamble length correlates with quality only when the preamble contains reasoning. (Chapter 7)

Logprobs as quality signal. "Logprobs are like the model's tone of voice, and you can use them to see how confident it is in its answer." Average the probabilities (not logprobs): (exp(logprob_1) + ... + exp(logprob_n)) / n. This formula, from GitHub Copilot development, is more predictive of overall quality. (Chapter 7)

Critical points via echo. Using logprobs with the "echo" parameter to score prompt tokens reveals surprises: tokens with negative double-digit logprobs indicate typos, anomalies, or high-information passages. "complution" (typo for "completion") produced a logprob below -13. (Chapter 7)

Counterpoints

Classification token conflation. When "North America" and "Northeast Asia" both start with "North," the model sums their probabilities at that token position, potentially choosing "North..." even when "Europe" at 44% has higher individual probability. Ensure each classification option starts with a unique token. (Chapter 7)

Fine-tuning competes with prompting. LoRA teaches the model which existing capabilities to activate, not new capabilities. But if the prompt resembles original training data rather than fine-tuning data, "the model may 'forget' its fine-tuning — the two training paths compete for attention." (Chapter 7)

Longer text naturally has lower logprobs. The same idea can be expressed many ways ("for example" vs. "for instance" halves probability without quality decrease). Raw logprob averages must account for this. (Chapter 7)

Key Quotes

"Logprobs are like the model's tone of voice, and you can use them to see how confident it is in its answer — and that's a strong indicator of answer quality." — Berryman & Ziegler, Chapter 7

"LoRA doesn't really teach a model new tricks. Rather, LoRA teaches the model which of the tricks that it's already capable of performing it should expect to use, and in which way." — Berryman & Ziegler, Chapter 7

"Fine-tuning is a continuation of prompt engineering by other means." — Berryman & Ziegler, Chapter 7

Rules of Thumb

Use logprobs to assess completion quality — don't treat all outputs as equally reliable
Average probabilities (exp of logprobs), not raw logprobs, for quality prediction
Ensure classification options start with unique tokens to avoid token conflation
Use stop sequences to halt generation when the answer is complete — saves time and compute
Prototype with slightly larger models than you can afford; optimize model size later
Don't bake model choice into code — use abstraction layers, the landscape changes weekly

Related References

LLMs as Text Completion Engines - Logprobs expose the probability distribution underlying text completion
Evaluating LLM Applications - Logprobs feed into evaluation pipelines
What Goes Into the Prompt - Fine-tuning replaces static prompt content with model parameters