Encoding Formats and Schema Evolution - Designing Data-Intensive Applications

Key Principle

Every system that persists or transmits data must translate between in-memory representations and byte sequences. The encoding format chosen locks in trade-offs across compatibility, compactness, security, and organizational coupling. Rolling upgrades require both backward compatibility (new code reads old data) and forward compatibility (old code reads new data). Without both directions, every schema change forces big-bang deployments with downtime. Schema evolution reframes what seems like a database problem into a distributed systems coordination problem. (Chapter 4)

Why This Matters

"The five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then." (Chapter 4) Database contents accumulate values written under many historical schema versions. Code is updated frequently; data persists indefinitely. The encoding format determines whether schema changes require downtime, coordinated deployments, or neither. Schema-based binary formats (Protobuf, Thrift, Avro) dissolve the false dichotomy between "strict schemas" and "schemaless" by providing both flexibility and safety: "Schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON databases provide... while also providing better guarantees about your data and better tooling." (Chapter 4)

Good Examples

Field Tags Enable Evolution: Protobuf and Thrift assign numeric tags to fields and encode tags — not names — into the binary stream. This enables compact encoding (Thrift CompactProtocol 34 bytes vs. JSON 81 bytes), safe field renaming (names never appear in encoded data), and schema evolution (old code skips unrecognized tags, new code fills defaults for missing optional fields). "You cannot change a field's tag, since that would make all existing encoded data invalid." (Chapter 4)
Avro's Writer/Reader Schema Resolution: Avro removes all self-describing overhead by concatenating values in schema-defined order — the most compact encoding (32 bytes). It resolves writer/reader schema differences by matching fields by name, not position. This name-based resolution cleanly supports dynamically generated schemas (e.g., auto-generating from relational DB schemas without manual tag management). Schema registries double as documentation that cannot drift. (Chapter 4)
Three Modes of Dataflow: Each mode imposes different compatibility requirements. Through databases: both forward and backward compatibility (same process at different times — "sending a message to your future self"). Through services (REST/RPC): backward compatibility on requests, forward compatibility on responses. Through async message passing: decoupled producers and consumers with broker-mediated delivery. (Chapter 4)

Counterpoints

Language-specific serialization is a trap: Java's Serializable, Ruby's Marshal, and similar create language lock-in, open remote code execution attack surfaces, have no versioning story, and perform poorly. "It's generally a bad idea to use your language's built-in encoding for anything other than very transient purposes." (Chapter 4)
Binary JSON offers poor ROI: MessagePack achieves only 18% size reduction over JSON (66 vs. 81 bytes) while sacrificing human-readability. Schema-based formats achieve 59-60% reduction. Half-measures lose readability without meaningful compactness gain. Yet JSON/XML persist for inter-organizational interchange because "the difficulty of getting different organizations to agree on anything outweighs most other concerns." (Chapter 4)
The hidden field-dropping danger: During rolling upgrades, old code reading a record written by new code may decode, update, and re-encode — silently dropping unknown fields. This is data loss caused by impedance mismatch between application-level object models and the need for unknown-field preservation. Applications must explicitly preserve unknown fields during read-modify-write cycles. (Chapter 4)

Key Quotes

"Every field you add after the initial deployment of the schema must be optional or have a default value." — Martin Kleppmann, Chapter 4

"There's no point trying to make a remote service look too much like a local object in your programming language, because it's a fundamentally different thing." — Martin Kleppmann, Chapter 4

"The five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then." — Martin Kleppmann, Chapter 4

Rules of Thumb

New fields must always be optional or have defaults — making a new field required breaks all existing data
Removed fields must never have their tag numbers reused
Prefer schema-based binary formats (Protobuf, Thrift, Avro) over binary-JSON half-measures or language-native serialization
Data outlives code: design encoding for the reader you will be in five years
Datatype widening (32-bit to 64-bit) is safe forward but dangerous backward — old code silently truncates
RPC location transparency is a lie: network calls differ from local calls in failure modes, latency, and retry semantics
For cross-organizational APIs, maintain backward compatibility indefinitely — you cannot force client upgrades
Archival snapshots should re-encode using the latest schema and analytics-friendly column formats

Related References

Reliability, Scalability, Maintainability — The Three Pillars - encoding formats enable the rolling upgrades that reliability requires
Data Models — Relational, Document, and Graph - schema-on-write vs. schema-on-read maps to encoding evolution strategies
Storage Engines — LSM-Trees, B-Trees, and Analytical Storage - encoding format choice interacts with storage engine serialization overhead