Library
Prompt Engineering for LLMs · 11 of 12
Prompt Engineering for LLMs
ai HIGH

How LLMs Process Information

Prompt Engineering for LLMs John Berryman and Albert Ziegler
transformer tokenization autoregressive attention minibrains temperature

Key Principle

Understanding the transformer's architectural constraints — tokenization, unidirectional attention, autoregressive generation, and fixed context windows — is what makes prompt engineering systematic rather than guesswork. Each constraint directly dictates a prompt design decision. The authors' "minibrains" model captures the key idea: one minibrain per token position, processing through layers, attending only to positions to the left.

Why This Matters

Without understanding these constraints, engineers make five predictable mistakes: (1) placing instructions after the content they apply to (violates unidirectional attention), (2) asking the model to count letters or reverse strings (violates token-level processing), (3) expecting the model to reconsider its opening once it's committed (violates autoregressive generation), (4) stuffing context without regard to position (ignores attention patterns), and (5) setting high temperature for factual tasks (compounds errors through self-reinforcement).

Good Examples

Order matters because attention is unidirectional. A word-counting request placed after the text to count fails because the minibrains processing the text didn't know counting was the goal. Instructions must precede the content they apply to. (Chapter 2)

Chain-of-thought exploits autoregressive generation. The only way for higher-layer insights to reach lower-layer processing is through emitted tokens. Generating reasoning tokens before the answer gives subsequent tokens access to those insights. "How could I know what I'm thinking before I've heard what I'm saying." (Chapter 2)

Tokenization makes character tasks impossible. "strange new worlds" = 4 tokens; "STRANGE NEW WORLDS" = 6 tokens; "gone" = 1 token but "GONE" = 2 tokens ([G][ONE]). The model never sees individual characters and cannot perform subtoken manipulation. Offload these to code. (Chapter 2)

Counterpoints

Temperature errors compound. Temperature only affects the final sampling layer, not internal computation. But the model recognizes its own temperature-induced errors as a pattern and tries to continue them, causing compounding degradation. Above 1 is almost never useful. Regimes: 0 for correctness-critical; 0.1-0.4 for slight variation; 0.5-0.7 for generating alternatives. (Chapter 2)

Repetition traps are structural, not bugs. Autoregressive generation is self-reinforcing: continuing a pattern is always more likely than breaking it. The model cannot step back and decide it has listed enough items. This must be handled through stop sequences or post-processing. (Chapter 2)

Context windows create hard ceilings. The transformer's pure-attention architecture introduced a fixed context window. "Since the transformer is the direct progenitor of the GPT models, this is a limitation that we have been pushing back against ever since." (Chapter 1) This is the root cause of all context management techniques in the book.

Key Quotes

"Reminiscent of the saying, 'How could I know what I'm thinking before I've heard what I'm saying,' this principle forms the basis of chain-of-thought prompting." — Berryman & Ziegler, Chapter 2

"The model can't google or edit, so it just guesses. Nor will the raw LLM express any doubt, add a disclaimer that it was just guessing, or show any other trace of evidence that the information is merely a guess rather than actual knowledge—because after all, the model always guesses." — Berryman & Ziegler, Chapter 2

Rules of Thumb

  • Place instructions before the content they apply to — the model reads left to right with no ability to re-read
  • Never ask the model to manipulate individual characters — offload to code
  • Use temperature 0 for factual/deterministic tasks; only increase for creative generation
  • When the model repeats itself, the fix is structural (stop sequences, post-processing), not prompt-level

Related References