Key Principle
Pre-training data is the most consequential and underappreciated LLM ingredient. Many downstream failure modes — hallucination, bias, privacy leaks, inflated benchmarks — trace causally back to decisions made during data preparation. "Data preprocessing is the most unglamorous and underappreciated part of the LLM training pipeline, yet perhaps the most important" (Chapter 2).
Why This Matters
Data quality determines the ceiling of what any model can do. No architecture, prompting technique, or fine-tuning method can recover information that was lost or corrupted during data preparation. The grounding problem means that text-only training inherently underspecifies the world — only ~12% of information in human text is explicitly stated (Chapter 2). This is a root cause of hallucination, and it cannot be solved by scale alone.
Teams that over-invest in model selection while under-investing in data curation are optimizing the wrong variable. The prototype-to-production gap often originates here.
Good Examples
- Scaling laws (Hoffmann et al., Ch. 1) proved that data and parameters must scale equally — earlier models like GPT-3 were severely undertrained because they prioritized parameters over data.
- Pythia experiment (Ch. 2): Replacing masculine pronouns with feminine ones for only the last 7% of training tokens had a measurable de-biasing effect, demonstrating that small, targeted data interventions can shift model behavior.
- McCoy et al.'s "Embers of Autoregression" (Ch. 2): LLMs perform better at base-10 than base-9 addition, and at alphabetical than reverse-alphabetical sorting — performance reflects training data frequency, not general capability.
Counterpoints
- Training set contamination: Pre-training data containing benchmark test data inflates evaluation scores. OpenAI used 13-gram overlap detection for GPT-3, but rephrased or translated benchmark data evades detection (Ch. 2). Model selection based on contaminated benchmarks leads to deploying models that underperform in production.
- Bias amplification: LLMs amplify biases beyond training data base rates. Keyword-based toxic content filtering creates a second bias vector by disproportionately removing text by/about minority communities — the model appears debiased but the filtering itself introduced different bias (Ch. 2).
- Privacy risks: Larger models memorize more easily (Carlini et al.). Deduplication reduces verbatim generation by 10x, but even single-occurrence information can be memorized (Ch. 2).
Key Quotes
"Data preprocessing is the most unglamorous and underappreciated part of the LLM training pipeline, yet perhaps the most important." — Suhas Pai, Chapter 2
"The fine-grained choice of LLM usually isn't the most important criteria determining the success of your task, and you are better off spending that bandwidth working on cleaning and understanding your data." — Suhas Pai, Chapter 5
Rules of Thumb
- Invest in data quality before model sophistication — higher ROI every time
- Check tokenization of your domain terms before selecting a model
- Build internal benchmarks rather than trusting leaderboard rankings
- Audit pre-training data for contamination against your evaluation sets
- For bias mitigation, examine both the data and the filtering process
Related References
- The Prototype-to-Production Gap - The prototype-to-production gap that data quality creates
- Tokenization and Its Hidden Failure Modes - How tokenizer training data causes downstream failures
- Selecting and Evaluating LLMs - Benchmark unreliability traced to contamination
- Retrieval-Augmented Generation Pipeline - RAG as the engineering solution to the grounding problem