Embeddings, Document Parsing, and Semantic Search - Designing Large Language Model Applications

Key Principle

Embeddings are the infrastructure layer beneath RAG. The most underappreciated and most impactful step in the entire pipeline is document parsing — "the bane of NLP projects" and "the foundation on which high-quality products are built" (Chapter 11). Poor parsing and chunking create failure modes that no amount of model sophistication can fix.

Why This Matters

Embedding vectors have fixed dimensionality regardless of input length — "There is no such thing as infinite compression!" (Chapter 11). Longer inputs mean less information per dimension. This compression constraint, combined with semantic similarity's inherent limitations, means that the quality of what you embed (parsing, chunking) and how you train embeddings (hard negatives) determines retrieval ceiling before the RAG pipeline even begins.

Good Examples

Hard negatives: Random negatives are trivially distinguishable; the model learns nothing from them. Hard negatives — examples topically similar but less relevant — force the model to learn fine-grained semantic boundaries. Construct by retrieving top-k matches for anchor, excluding positives, filtering by relevance score range (Chapter 11).
Matryoshka Representation Learning (MRL): Training loss calculated over multiple truncation dimensions. Earlier dimensions encode more important information. Enables 98.37% performance at 8.3% of original dimensions — massive storage savings at 100M+ vector scale (Chapter 11).
Chunking progression: sliding window → metadata-aware (paragraph/section boundaries) → layout-aware (CV-extracted structure; tools: Textractor, Unstructured, LayoutLMv3) → semantic (topic-based clustering) → late chunking (Jina AI: encode full document, then pool over segments) (Chapter 11).

Counterpoints

Semantic similarity failures: Cosine similarity is a single scalar collapsing multiple dimensions of similarity (topical, factual, stylistic). "Mr. Pomorenko confirmed he is not retiring" vs. "Mr. Pomorenko announced his retirement yesterday" score 0.7870 similarity — nearly identical despite opposite factual content. Negation is largely invisible to embedding models (Chapter 11).
CLS token quality varies: Whether CLS/first token embeddings are useful depends on pre-training. BERT's next-sentence prediction enriches CLS; RoBERTa's lack thereof does not (Chapter 11).
Unsolved chunking problems: Scope boundaries (when does a rule on page 5 apply to content on page 84?), document structure detection, and long-range dependencies remain major RAG failure causes (Chapter 11).

Key Quotes

"There is no such thing as infinite compression! Embedding sizes are fixed, so the longer your input, the less information can be encoded in its embedding." — Suhas Pai, Chapter 11

"Effective document parsing...is the bane of NLP projects. A large proportion of failure modes in RAG can be attributed to poor document parsing...Ignore this at your own peril!" — Suhas Pai, Chapter 11

Rules of Thumb

Invest in document parsing before model tuning — it determines the retrieval ceiling
Use hard negatives for embedding fine-tuning; random negatives teach nothing
Consider MRL for large-scale deployments — truncate embeddings for 10-30x storage savings
Normalize embeddings to unit length to make dot product equivalent to cosine similarity (and faster)
100M 768-dimensional float32 vectors require ~300 GB — storage is a real constraint
Binary quantization (1-bit) can surprisingly outperform int8 for some embedding models

Related References

Retrieval-Augmented Generation Pipeline - Embeddings feed directly into the RAG retrieve step
Tokenization and Its Hidden Failure Modes - Tokenization affects what embeddings can capture
Pre-Training Data: The Most Important Ingredient - Document parsing as the data quality step for retrieval