From Completion to Chat — RLHF and Alignment

Key Principle

RLHF transforms a document completion engine into an aligned chat assistant through a four-model pipeline. Each step exists because the previous one cannot teach a specific capability. The critical insight: honesty cannot be taught by example (SFT) — it requires the model to generate its own completions and learn that fabrication has consequences. Despite this transformation, chat is still document completion: the documents are now ChatML transcripts.

Why This Matters

Understanding the RLHF pipeline explains three things engineers encounter daily: (1) why the system message is the primary behavioral control surface, (2) why models produce fluff (polite preambles, caveats) that must be managed, and (3) why there's an alignment tax — RLHF can decrease raw intelligence on certain tasks while optimizing for helpfulness and safety.

Good Examples

The four-model pipeline and why each step is necessary:

Base model (~499B tokens): Completes documents but cannot follow instructions.
SFT model (~13,000 handcrafted conversations): Learns to follow instructions and chat, but cannot learn honesty — labelers don't know the model's internal knowledge state, so they may write confident answers about things the model doesn't know, inadvertently teaching fabrication. (Chapter 3)
Reward model (~33,000 human-ranked completions): Scores completion quality numerically.
RLHF model: Fine-tuned from SFT using reward scores. Honesty emerges because the model generates its own completions, which are ranked by humans — inconsistencies are scored as "bad." (Chapter 3)

ChatML provides structural security. Roles (system, user, assistant, tool) are annotated with special tokens that cannot be produced by user input through the API, providing a structural defense against prompt injection. (Chapter 3)

Counterpoints

Never inject user content into the system message. The model is conditioned to strictly obey system message content. Injecting user input there completely circumvents prompt injection protections. (Chapter 3)

The alignment tax is real. RLHF optimizes for helpfulness, honesty, and harmlessness, sometimes at the cost of raw task performance. Mitigated by mixing original training data back during RLHF training. (Chapter 3)

Chat is still completion. "At their core, LLMs are all just document completion engines. With the introduction of chat, this was still true — it's just that the documents are now ChatML transcripts." (Chapter 3) Engineers who forget this apply chat-specific intuitions that break down at the edges.

Key Quotes

"Honesty, it turns out, can't be taught by examples and rote repetition — it takes a bit of introspection." — Berryman & Ziegler, Chapter 3

"At their core, LLMs are all just document completion engines. With the introduction of chat, this was still true — it's just that the documents are now ChatML transcripts." — Berryman & Ziegler, Chapter 3

Rules of Thumb

The system message is your primary behavioral control surface — use it for instructions, rules, and persona
Never put user-provided content in the system message
Expect and plan for fluff in RLHF-trained models — parse it out or suppress it with formatting instructions
Account for the alignment tax: RLHF models may perform worse on narrow tasks than base models

Related References

LLMs as Text Completion Engines - Chat is still document completion
Taming Model Output - Managing fluff and completion structure
Assembling and Structuring the Prompt - System message placement in prompt structure