Key Principle
The Transformer architecture has been remarkably stable since 2017 — most gains have come from data and training methodology, not architectural innovation. Understanding self-attention, positional encoding, and learning objectives is essential for reasoning about model capabilities and limitations. "The Transformer implementation in current models isn't too different from the original version, despite hundreds of papers proposing modifications" (Chapter 4).
Why This Matters
Without understanding the architecture, engineers cannot reason about context window limitations, inference costs (quadratic attention), or why certain model types are better for generation vs. classification. The architecture determines how efficiently the model uses what it sees, but the leverage for improvement is elsewhere — primarily in data quality and systems engineering.
Good Examples
- Self-attention: Each token's representation becomes a weighted function of all other tokens. Query-key dot products produce attention scores; softmax normalizes them; scores multiply value vectors. Multi-headed attention runs parallel attention sets to capture different relationship types. This solved the LSTM bottleneck: "You can't cram the meaning of a whole sentence into a single vector!" — Ray Mooney, ACL 2014 (Chapter 4).
- Mixture of Experts (MoE): Multiple feedforward networks per Transformer block; a gating function routes each input to a subset (typically k=2 of 8). Total parameters are large but active compute per input is small. MoE breaks the capacity-vs-cost coupling critical for production deployment (Chapter 4).
- Chess LLM experiment: A model trained on PGN notation learned all chess rules including castling and checkmate. The same model trained on English move descriptions failed. "Language design is an important skill to acquire" (Chapter 4).
Counterpoints
- Positional encoding tradeoffs: Absolute encoding limits generalization beyond training lengths. ALiBi applies linear distance bias to attention scores. RoPE rotates query/key vectors by position-dependent angles, enabling relative position awareness. NoPE removes explicit encoding entirely, relying on causal attention masks — surprisingly competitive but still experimental (Chapter 4).
- Learning objective mismatches: FLM (next-token prediction) provides learning signal on every token — maximally sample-efficient and best for generation. MLM (masked language modeling, ~15% of tokens) produces bidirectional representations ideal for classification. Using a decoder-only (FLM) model for pure classification leaves performance on the table; encoder-only models remain excellent for classification and embeddings despite being unfashionable (Chapter 4).
- Upcycling risk: Converting dense models to MoE by copying feedforward layers (Komatsuzaki et al.) is faster than training from scratch but may inherit the dense model's limitations (Chapter 4).
Key Quotes
"You can't cram the meaning of a whole sentence into a single vector!" — Ray Mooney, ACL 2014 (cited in Chapter 4)
"Language design is an important skill to acquire." — Suhas Pai, Chapter 4
Rules of Thumb
- Self-attention enables parallelism and removes sequential dependencies — this is why Transformers scale
- MoE is a systems engineering solution: capacity without proportional compute cost
- Choose learning objectives to match your task: FLM for generation, MLM/encoder for classification and embeddings
- Architecture is stable; gains come from data, training methodology, and systems design
- For domain-specific learning, structured domain languages outperform natural language descriptions
Related References
- Inference Optimization Taxonomy - K-V caching addresses attention's quadratic cost
- Selecting and Evaluating LLMs - Model flavors and learning objectives for selection
- Embeddings, Document Parsing, and Semantic Search - Encoder models for embedding generation