Library
Designing Large Language Model Applications · 3 of 12
Designing Large Language Model Applications
ai HIGH

Embeddings, Document Parsing, and Semantic Search

embeddings hard-negatives chunking document-parsing matryoshka

Key Principle

Embeddings are the infrastructure layer beneath RAG. The most underappreciated and most impactful step in the entire pipeline is document parsing — "the bane of NLP projects" and "the foundation on which high-quality products are built" (Chapter 11). Poor parsing and chunking create failure modes that no amount of model sophistication can fix.

Why This Matters

Embedding vectors have fixed dimensionality regardless of input length — "There is no such thing as infinite compression!" (Chapter 11). Longer inputs mean less information per dimension. This compression constraint, combined with semantic similarity's inherent limitations, means that the quality of what you embed (parsing, chunking) and how you train embeddings (hard negatives) determines retrieval ceiling before the RAG pipeline even begins.

Good Examples

  • Hard negatives: Random negatives are trivially distinguishable; the model learns nothing from them. Hard negatives — examples topically similar but less relevant — force the model to learn fine-grained semantic boundaries. Construct by retrieving top-k matches for anchor, excluding positives, filtering by relevance score range (Chapter 11).
  • Matryoshka Representation Learning (MRL): Training loss calculated over multiple truncation dimensions. Earlier dimensions encode more important information. Enables 98.37% performance at 8.3% of original dimensions — massive storage savings at 100M+ vector scale (Chapter 11).
  • Chunking progression: sliding window → metadata-aware (paragraph/section boundaries) → layout-aware (CV-extracted structure; tools: Textractor, Unstructured, LayoutLMv3) → semantic (topic-based clustering) → late chunking (Jina AI: encode full document, then pool over segments) (Chapter 11).

Counterpoints

  • Semantic similarity failures: Cosine similarity is a single scalar collapsing multiple dimensions of similarity (topical, factual, stylistic). "Mr. Pomorenko confirmed he is not retiring" vs. "Mr. Pomorenko announced his retirement yesterday" score 0.7870 similarity — nearly identical despite opposite factual content. Negation is largely invisible to embedding models (Chapter 11).
  • CLS token quality varies: Whether CLS/first token embeddings are useful depends on pre-training. BERT's next-sentence prediction enriches CLS; RoBERTa's lack thereof does not (Chapter 11).
  • Unsolved chunking problems: Scope boundaries (when does a rule on page 5 apply to content on page 84?), document structure detection, and long-range dependencies remain major RAG failure causes (Chapter 11).

Key Quotes

"There is no such thing as infinite compression! Embedding sizes are fixed, so the longer your input, the less information can be encoded in its embedding." — Suhas Pai, Chapter 11

"Effective document parsing...is the bane of NLP projects. A large proportion of failure modes in RAG can be attributed to poor document parsing...Ignore this at your own peril!" — Suhas Pai, Chapter 11

Rules of Thumb

  • Invest in document parsing before model tuning — it determines the retrieval ceiling
  • Use hard negatives for embedding fine-tuning; random negatives teach nothing
  • Consider MRL for large-scale deployments — truncate embeddings for 10-30x storage savings
  • Normalize embeddings to unit length to make dot product equivalent to cosine similarity (and faster)
  • 100M 768-dimensional float32 vectors require ~300 GB — storage is a real constraint
  • Binary quantization (1-bit) can surprisingly outperform int8 for some embedding models

Related References