Foundation Model Internals - AI Engineering: Building Applications with Foundation Models

Key Principle

Foundation models are built through a sequence of design decisions — training data, architecture, compute allocation, and post-training alignment — each of which creates behavioral constraints that application engineers inherit. The Chinchilla scaling law (DeepMind, 2022) established that compute-optimal training requires approximately 20 tokens per parameter, with both model size and token count scaling equally. However, the production-relevant extension (Sardana et al., 2023) shows that inference-optimal models are smaller than Chinchilla-optimal, because inference cost accumulates indefinitely — this is why the Llama authors intentionally under-scaled model size relative to the Chinchilla optimum.

Post-training (SFT followed by RLHF or DPO) consumes roughly 2% of total training compute yet fundamentally changes model behavior. Critically, it does not add new knowledge — it redistributes probability mass toward useful outputs, "unlock[ing] capabilities that the pre-trained model already has but are hard for users to access via prompting alone." (Chapter 2)

Sampling is the mechanism behind both creative capability and failure modes. Temperature divides logits before softmax — high values flatten the distribution (creative, inconsistent), low values sharpen it (predictable, rigid). Top-p nucleus sampling selects from the minimum token set covering cumulative probability p, adapting dynamically to context uncertainty. Because outputs are sampled from a probability distribution rather than deterministically selected, "anything with a non-zero probability, no matter how far-fetched or wrong, can be generated by AI." (Chapter 2)

Hallucination has two distinct causal roots. The self-delusion hypothesis (Ortega et al., DeepMind 2021) holds that the model cannot distinguish tokens it generated from tokens provided in context — a generated wrong token becomes conditioning context, snowballing into coherent but false elaborations. The mismatched internal knowledge hypothesis (Leo Gao; John Schulman, OpenAI 2023) holds that SFT trains the model to mimic labelers whose knowledge the model lacks, directly teaching it to produce confident-sounding responses on topics where it has no ground truth. (Chapter 2)

Why This Matters

Application builders who treat models as black boxes lose the ability to make principled engineering decisions. Sampling parameters are not arbitrary dials — temperature and top-p directly govern the tradeoff between creative capability and output consistency, and choosing them correctly requires knowing that stochastic sampling is the same mechanism that produces both fluent generation and hallucination. Engineers who don't understand this treat hallucinations as bugs to debug rather than structural properties to manage with mitigations like response caching, fixed seeds, or lower temperature for deterministic use cases.

Post-training mechanics determine which evaluation metrics are valid and which are misleading. Because SFT and RLHF increase perplexity while improving task performance, perplexity comparisons between aligned and base models are not informative about usefulness. Engineers who conflate the two metrics will make incorrect model selection decisions and underestimate the value of aligned models. Similarly, understanding that post-training does not inject new knowledge — it only reshapes access to existing knowledge — clarifies why finetuning on a weak base model cannot compensate for domain gaps in pre-training data.

Good Examples

Choosing sampling parameters by task type. A customer support assistant generating scripted responses should use low temperature (0.2–0.4) and conservative top-p to maximize consistency and reduce off-script outputs. A creative writing assistant benefits from higher temperature (0.8–1.0) because the flat distribution produces the stylistic variation users want. The tradeoff is the same mechanism — neither setting is universally better.

Avoiding misleading perplexity comparisons. A team evaluating whether to use a base model vs. an RLHF-aligned variant should not use perplexity as the comparison metric. The aligned model will score worse on perplexity because post-training shifts probability mass away from statistically likely but unhelpful completions toward lower-probability but accurate and safe responses. Task-specific benchmarks or human preference evaluations are the appropriate comparison method.

Applying structured output techniques appropriately. Because self-delusion causes the model to treat its own generated tokens as conditioning context, constraining output format (JSON schema, grammar-based decoding, explicit stop tokens) reduces the surface area for snowballing errors — the model cannot hallucinate into a field the schema does not permit. This is a principled mitigation for the self-delusion mechanism, not a general-purpose quality improvement.

Counterpoints

Using perplexity to compare aligned vs. base models. Perplexity measures next-token prediction loss. Post-training deliberately increases this loss in exchange for alignment. Comparing perplexity across models with different training objectives produces a number with no valid interpretation for task suitability or user preference.

Believing hallucination is fixable purely by scaling. The mismatched internal knowledge hypothesis implies hallucination is introduced by the post-training process itself — labelers teach the model to sound confident on topics it doesn't know. Scaling the base model does not fix this; it requires reward functions that explicitly penalize fabrication and training data that includes "I don't know" demonstrations. RLHF worsened measured hallucination in InstructGPT experiments even when overall preference scores improved, because preference labelers did not reliably penalize hallucination.

Treating temperature as a quality dial (lower = better). Temperature controls the creativity-consistency tradeoff, not output quality. Setting temperature near zero produces deterministic, locally-optimal token choices that can be rigid, repetitive, and brittle on open-ended tasks. The correct framing is: low temperature for tasks requiring consistency and factual grounding; higher temperature for tasks requiring stylistic range and generative diversity.

Key Quotes

"Post-training unlocks the capabilities that the pre-trained model already has but are hard for users to access via prompting alone." (Chapter 2)

"Sampling is perhaps one of the most underrated concepts in AI. Not only does sampling explain many seemingly baffling AI behaviors, including hallucinations and inconsistencies, but choosing the right sampling strategy can also significantly boost a model's performance with relatively little effort." (Chapter 2)

"Foundation models are aggregations of the opinions of the masses... Anything with a non-zero probability, no matter how far-fetched or wrong, can be generated by AI." (Chapter 2)

"The self-delusion hypothesis focuses on how self-supervision causes hallucinations, whereas the mismatched internal knowledge hypothesis focuses on how supervision causes hallucinations." (Chapter 2)

Rules of Thumb

Use ~20 tokens per parameter as the Chinchilla-optimal training ratio, but prefer smaller, well-trained models for production deployments where inference cost compounds over time.
Post-training changes behavior, not knowledge — if the base model lacks domain coverage, finetuning cannot compensate.
Never compare aligned and base model perplexity as a quality signal; use task-specific evaluation instead.
Set temperature based on the creativity-consistency tradeoff required by the task, not based on a general quality preference.
Treat hallucination as having two distinct root causes (self-delusion and mismatched knowledge) that require different mitigations — no single fix addresses both.

Related References

Evaluation Methods and AI as a Judge - Post-training explains why perplexity breaks as an evaluation metric
Prompt Attacks and Defense-in-Depth - Self-delusion mechanism explains prompt injection
Finetuning, LoRA, and Model Merging - Post-training mechanics underlie finetuning decisions