Dataset Engineering and Data Quality - AI Engineering: Building Applications with Foundation Models

Key Principle

Data quality is not an additive cap on model capability — it is a multiplicative one. The Quality × Coverage × Quantity triad interacts such that any zero collapses the product entirely. Quantity without coverage produces brittleness outside the training distribution; coverage without quality floods gradient signal with noise; quality without quantity overfits to the curated slice. Zhou et al. (2023) confirmed this empirically: only a dataset that was both high-quality and diverse produced best performance on a 7B model — neither axis alone sufficed.

Annotation guidelines and evaluation guidelines are the same artifact applied in opposite directions. Evaluation guidelines specify what constitutes a good response; annotation guidelines specify what constitutes a good training example. Treating them as separate documents creates two partially-overlapping specifications, producing inconsistency between what the model is optimized for and what it is evaluated against.

Four synthesis techniques address distinct failure modes: (1) model distillation compresses capability from a larger model into a smaller one, but style imitation is not capability transfer (Gudibande et al., 2023); (2) instruction generation (Self-Instruct / Alpaca) addresses quantity scarcity; (3) reverse instruction generates only the instruction from existing high-quality content, leaving the response — the hallucination-prone component — as human-verified source material; (4) self-play and iterative bootstrapping allow weak models to generate instructions for finetuning and repeat.

Model collapse (Shumailov et al., 2023) operates through distribution skew amplification: AI generators over-represent probable events and under-represent rare ones; each recursive training iteration amplifies this skew. The quantified cost: repeating just 0.1% of training tokens 100 times degraded an 800M-parameter model to the effective performance of a 400M-parameter model — a 50% capacity loss from duplication alone, even though 90% of tokens remained unique (Hernandez et al., 2022). Collapse is avoidable by mixing synthetic with real data; it is an endpoint of pure synthetic recursion, not of synthetic data per se (Gerstgrasser et al., 2024).

The data processing pipeline sequence is Inspect → Deduplicate → Clean → Format. Ordering is determined by operation cost: if cleaning is expensive, deduplicate first — never pay to clean a document that will be eliminated as a duplicate. Databricks found that removing extraneous Markdown/HTML tokens improved model accuracy 20% and reduced input token length 60% simultaneously. (Chapter 8)

Why This Matters

Model quality is multiplicatively capped by data quality because the three axes — quality, coverage, and quantity — interact, not sum. Fixing only one axis while neglecting the others produces failures that look like a quality problem but are actually a distribution mismatch. A finetuning effort that acquires more data without ensuring coverage gains nothing on out-of-distribution inputs; one that curates quality without sufficient diversity overfits to the curated slice. The ceiling is set by the weakest axis, and the cost of diagnosing which axis is weak — rather than treating all three independently — is one of the most common sources of wasted compute in finetuning projects.

The compounding risk of model collapse for teams that rely heavily on synthetic data is not speculative — it is quantified and mechanistic. Each generation of synthetic training data amplifies the distribution skew introduced by the previous generator. Rare events are gradually forgotten, effective representational capacity shrinks, and the degradation is invisible until performance on tail cases collapses. Teams that use synthetic data without real-data mixing are running an experiment with a known endpoint. Annotation and evaluation guidelines must be the same document because the behavioral specification that defines a good output is logically identical in both contexts; maintaining two documents creates drift between what a model learns and what it is measured against, undermining the entire evaluation signal.

Good Examples

Applying the Quality × Coverage × Quantity diagnostic: A team observes a finetuned model performing well on common query types but failing on edge cases. Rather than defaulting to "add more data," they check the three axes: quality is high (human-reviewed), quantity is sufficient (10K examples), but coverage is narrow (only high-frequency query types represented). The fix is targeted: add diverse examples across rare query types, not more of the same.

Using reverse instruction to avoid hallucinated training data: Instead of generating both instruction and response synthetically, a team starts from an existing corpus of verified technical documentation. They use a model to generate instructions that the documentation answers, then use the original documentation text as the response. The AI-generated component (the instruction) is where errors are less costly; the response — the component where hallucination causes training harm — is human-verified source material. This is the pattern behind Köksal et al. (2023) and Li et al. (2023).

Maintaining synthetic-to-real data mixing ratios to prevent collapse: NVIDIA's Nemotron-4-340B used 98% synthetic data and exceeded its teacher model's performance — because it mixed with real data and applied functional verification at each stage. The mixing ratio and verification step are the variables that separate safe synthetic data use from collapse-bound recursion (Gerstgrasser et al., 2024).

Counterpoints

Treating quality, coverage, and quantity as independent axes is the most common antipattern in dataset engineering. Teams optimize one axis — usually quality, because it is the most legible — while neglecting the others. The result is a model that performs well on the curated evaluation set and poorly everywhere else. The interaction term is the insight; the checklist is not.

Using synthetic data without real-data mixing introduces collapse risk that compounds with each training iteration. The mechanism is distribution skew amplification: even a small bias in the generator's output distribution becomes a larger bias in the next generation's training distribution. Collapse is not a low-probability failure mode — it is the deterministic endpoint of pure synthetic recursion. The escape hatch (real-data mixing with verification) must be built in from the start, not added as a remediation after degradation is observed.

Creating separate annotation and evaluation guideline documents is an organizational antipattern that introduces specification drift. The investment required to define annotation guidelines precisely is large; that investment should pay off twice — once for training data annotation and once for evaluation quality gates. Teams that maintain two documents eventually optimize for different behavioral specifications in training and evaluation, making it impossible to interpret evaluation results accurately.

Key Quotes

"The best ML team in the world with infinite compute can't help you finetune a good model if you don't have data." (Chapter 8)
"Data will mostly just be toil, tears, and sweat." (Chapter 8)
Greg Brockman (OpenAI co-founder): "Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning." (Chapter 8)

Rules of Thumb

Before scaling data collection, validate the performance gain curve at 25%/50%/100% of current data — a plateau at 50% means more data yields diminishing returns; a steep slope means data is the bottleneck.
Deduplicate before cleaning when cleaning is expensive; filter before deduplicating when deduplication is expensive. Never pay to process data that will be eliminated by a later step.
Any synthetic data pipeline that lacks real-data mixing is on a path toward model collapse; the mixing ratio and verification step are not optional hygiene, they are capacity preservation.
Annotation guidelines and evaluation guidelines should be the same document — the behavioral specification that defines a good output is identical in both contexts.
At small data scales (fewer than ~100 examples), base model capability dominates; at large scales (hundreds of thousands of examples), model differences converge. Invest in a strong base model when data is scarce.

Related References

Evaluation-Driven Development - Evaluation guidelines = annotation guidelines
Finetuning, LoRA, and Model Merging - Finetuning needs this data
Production Architecture and the Data Flywheel - Data flywheel closes the loop