DDIA Ch.4 — Encoding & Evolution

Chapter 4 closes Part I of DDIA by tackling a problem that every long-lived system faces: applications change, and when code changes, the shape of the data it produces changes too. New features add fields, refactors rename them, and old features get removed. Meanwhile the data already written to disk and in flight over the network doesn't magically update itself. This chapter is about how data is encoded (turned into bytes), how those encodings can evolve without breaking running systems, and the three ways data flows between processes — through databases, through services, and through message brokers.

⚡ Quick Takeaways

Two directions of compatibility — backward: new code can read old data (usually easy); forward: old code can read new data (harder — it must ignore fields it doesn't understand).
Rolling upgrades force both at once — during a staged deploy, old and new versions run side by side, so the format must be readable in both directions simultaneously.
Language-specific serialization is a trap — Java Serializable, Python pickle, and friends bring lock-in, security holes, poor versioning, and bloat. Avoid for anything persisted or shared.
JSON/XML/CSV are ubiquitous but sloppy — ambiguous numbers, no binary strings, optional schemas. Great for openness, weak for precision and size.
Schema-based binary formats win at scale — Protobuf and Thrift use numbered field tags; Avro matches a writer's schema against a reader's schema. Both encode evolution rules explicitly.
"Data outlives code" — your encoding choice decides how painful it is to evolve a system that's already running in production.

tldr

Encoding turns in-memory objects into bytes; decoding reverses it. Because old and new code coexist during rolling upgrades, formats must be both backward- and forward-compatible. Schema-based binary formats (Protobuf, Thrift, Avro) make evolution safe and compact. Data flows three ways — databases (writer and reader separated by time), services (REST/RPC, separated by the network), and message brokers (asynchronous, decoupled) — and every one of them is an encoding boundary.

Why Encoding Matters

Programs work with data in (at least) two different representations. In memory, data lives in objects, structs, lists, arrays, hash tables, and trees — structures optimized for efficient access and manipulation by the CPU, full of pointers. To write data to a file or send it over the network, you must translate it into a self-contained sequence of bytes — there are no pointers in a byte stream that another process can follow. The translation from the in-memory representation to a byte sequence is called encoding (also serialization or marshalling), and the reverse is decoding (parsing, deserialization, unmarshalling).

Because this happens constantly — every database write, every API call, every message — the choice of encoding has outsized effects on efficiency, and, crucially, on how easily the system can change over time.

Rolling Upgrades and Coexisting Versions

Large applications are rarely deployed all at once. Server-side systems use a rolling upgrade (staged rollout): a few nodes get the new version, you check nothing is broken, then you continue until all nodes are updated. Client-side apps are at the mercy of users who may not update for weeks. The consequence is unavoidable: old and new versions of the code, and old and new data formats, all coexist in the system at the same time.

For the system to keep running smoothly, compatibility has to hold in both directions:

Backward compatibility — newer code can read data that was written by older code. This is usually easy: you know the old format, so you write the new code to handle it.
Forward compatibility — older code can read data written by newer code. This is harder, because old code must ignore additions made by a version it knows nothing about, rather than choking on them.

key distinction

Backward compatibility looks back in time: new code reading old data. Forward compatibility looks forward: old code reading data from the future. The tricky one is forward compatibility — it requires the format and the code to gracefully skip unknown fields. Schema-based formats build this in; ad-hoc parsing usually doesn't.

Language-Specific Formats

Many languages ship built-in serialization: Java has java.io.Serializable, Python has pickle, Ruby has Marshal. They're tempting because they let you save and restore in-memory objects with minimal code. But Kleppmann is blunt about why they're a bad choice for anything beyond throwaway use:

Language lock-in — the encoding is tied to one programming language, making it very hard for another system (perhaps written in a different language) to read your data. You've coupled your data format to a language choice that may not survive a decade.
Security — to restore objects, the decoder must instantiate arbitrary classes. This is a notorious source of remote-code-execution vulnerabilities: an attacker who can get your app to deserialize hostile bytes can often run arbitrary code.
Versioning is an afterthought — because these libraries aim for quick convenience, forward and backward compatibility are usually neglected.
Efficiency — Java's built-in serialization is famous for being both bloated and slow.

The verdict: language-specific formats are fine for transient, same-process, same-version use, and a liability for anything you persist or send across a boundary.

Textual Formats: JSON, XML, CSV

The standardized, language-independent encodings most developers reach for are JSON, XML, and CSV. They're human-readable, supported everywhere, and great for openness. They also have real flaws that bite at scale:

Number ambiguity. XML and CSV can't distinguish a number from a string of digits without a schema. JSON distinguishes strings and numbers but not integers from floats, and has no notion of precision. Numbers larger than 2^53 lose precision in IEEE 754 doubles — which is exactly why Twitter returns tweet IDs as both a number and a string, to survive JavaScript parsers.
No binary strings. JSON and XML have no native binary type. The workaround is Base64-encoding the data — which works but inflates the size by ~33% and is a hack.
Optional, awkward schemas. XML and JSON have schema languages (XSD, JSON Schema) that are powerful but complicated, and many tools skip them entirely. Without a schema, the application code must hardcode the interpretation of the data.
CSV has no schema at all and is famously vague about escaping, quoting, and what a row even means.

Despite all this, textual formats remain an excellent default for many purposes — for public APIs and human-facing data, ubiquity and tooling usually outweigh the inefficiency. The problems matter most when you're encoding huge volumes or need precise, compact, evolvable data internally.

Binary Encoding

For data used only inside your organization, you can choose a format that's far more compact and faster to parse. A first step is a binary encoding of JSON — formats like MessagePack, BSON, and others. These shrink the data somewhat, but because they don't have a schema, they must still include all the object's field names in the encoded bytes. That's the key inefficiency the next family of formats removes.

The insight: if both the writer and the reader agree on a schema ahead of time, the field names never have to travel with the data. You can replace each field name with a compact numeric tag, and you gain type information for free. This is the foundation of Thrift, Protocol Buffers, and Avro.

Thrift and Protocol Buffers

Apache Thrift (originally from Facebook) and Protocol Buffers (Protobuf, from Google) are closely related binary encoding libraries. Both require the data to be described by a schema written in an interface definition language (IDL), and both ship a code-generation tool that produces classes in many languages from that schema.

person.proto — Protocol Buffers schema

message Person {
  required string user_name       = 1;   // tag 1
  optional int64  favorite_number = 2;   // tag 2 — safe to add later
  repeated string interests       = 3;   // 0..n values
}

Field Tags and the Encoding

The crucial detail is that each field has a numeric tag (the = 1, = 2, = 3). In the encoded bytes, the tag — not the field name — identifies the field, along with the field's type and value. Field names exist only in the schema, so they cost nothing at runtime. This is what makes the encoding compact.

Schema Evolution Rules

Because tags carry the meaning, the rules for changing a schema safely fall out naturally:

You can add a new field as long as you give it a new tag number. Old code reading new data simply ignores tags it doesn't recognize (forward compatibility). New code reading old data sees the field is missing and uses a default or null (backward compatibility) — which is why new fields must be optional or have a default, never required.
You can never change a field's tag number, because that would invalidate every byte stream already written.
You can remove a field only if it was optional, and you must never reuse its tag number afterward.
Changing a field's datatype is possible in limited cases but risks losing precision or truncating values (e.g. 64-bit to 32-bit).

Thrift and Protobuf differ in details — Thrift offers several encoding flavors (BinaryProtocol, CompactProtocol) and a richer set of container types — but the field-tag mechanism and the evolution rules are essentially the same.

Avro

Apache Avro (born from the Hadoop ecosystem) takes a different approach. It also uses a schema, but the encoded bytes contain nothing but values — no tag numbers, no field names, no type annotations. That makes Avro encodings the most compact of the three, but it raises an obvious question: how does the reader know what the bytes mean?

person.avsc — Avro schema (JSON form)

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName",       "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "interests",      "type": {"type": "array", "items": "string"}}
  ]
}

Writer's Schema vs Reader's Schema

Avro's answer is the key idea of the chapter. When data is encoded, it's encoded with the writer's schema — whatever version the producing code had. When data is decoded, the reader expects a reader's schema — whatever version the consuming code has. These two schemas need not be identical; they only need to be compatible. The Avro library resolves the differences by looking at both schemas side by side:

Fields are matched by name, not by position or tag, so the order can differ.
A field in the writer's schema but not the reader's is ignored.
A field in the reader's schema but not the writer's is filled in with the reader's default value.

This is why every field you add or remove must have a default — that default is exactly what lets schema resolution paper over the version gap, giving both backward and forward compatibility.

How the Reader Learns the Writer's Schema

The reader needs the writer's schema to decode. Avro handles this differently per context:

Large files with many records (the Hadoop case) — the writer's schema is included once at the start of the file, amortized over millions of records.
A database of records, each possibly written with a different schema — store a version number with each record and keep a registry of schema versions (this is exactly what Confluent's Schema Registry does for Kafka).
Sending records over a network connection — the two processes negotiate the schema version on connection setup (an RPC handshake).

why avro shines

Because Avro has no tag numbers and matches fields by name, it's ideal when schemas are generated dynamically — for example, auto-derived from a relational database's columns. If the DB schema changes, you just generate a new Avro schema; there are no tag numbers to assign by hand and no risk of accidentally reusing one. That property is a big reason Avro is popular in Hadoop, Kafka, and data-pipeline tooling.

The Merits of Schemas

Stepping back, the schema-based binary formats share a set of advantages that explain why they dominate at scale, even though they're less convenient than dumping JSON:

Compactness — they can omit field names from the encoded data entirely.
Documentation that can't drift — the schema is a precise, always-up-to-date description of the data, and code is generated from it.
Compatibility checking — keeping a database of schema versions lets you check forward and backward compatibility of a change before you deploy it.
Code generation — for statically typed languages, generated classes give compile-time type checking and IDE autocompletion.

Aspect	Textual (JSON/XML)	Schema binary (Protobuf/Thrift/Avro)
Readability	Human-readable	Opaque bytes (need schema)
Size	Verbose; field names repeat	Compact; names dropped
Schema	Optional, often skipped	Required and enforced
Field identity	By name in the data	By tag (PB/Thrift) or schema (Avro)
Evolution	Ad-hoc, error-prone	Explicit rules, checkable
Best for	Open/public APIs, debugging	High-volume internal data

Modes of Dataflow

The second half of the chapter zooms out: whenever you send data to another process that doesn't share your memory, you encode it. There are three main modes of dataflow, and each is a place where compatibility matters.

Dataflow Through Databases

With a database, the process that writes encodes the data and the process that reads decodes it. Those two processes may be the same application at different times — so in a sense you are sending a message to your future self. Backward compatibility is clearly needed (future code must read what past code wrote). But forward compatibility matters too, and in a sticky way: during a rolling upgrade, new code may write a record with a new field, and then old code may read that record, modify it, and write it back. If the old code doesn't understand the new field, the danger is that it drops the field it didn't recognize — silently losing data. The fix is for the format and the code to preserve unknown fields on a round trip.

Kleppmann's slogan for this section is "data outlives code." You might deploy a new version of your code in minutes, but the data in your database can be years old. Rewriting (migrating) every old record to a new schema is expensive, so most databases instead allow simple schema changes — adding a column with a null default, for example — and decode old rows on the fly. LinkedIn's document store Espresso uses Avro precisely to get these evolution properties.

Dataflow Through Services: REST and RPC

When processes communicate over a network, the common arrangement is clients and servers: servers expose an API, clients call it. The web works this way (browsers and web servers), and server-side applications are increasingly decomposed into smaller services that call each other — service-oriented or microservices architecture. A key goal is that services can be deployed and evolved independently, which means old and new versions of clients and servers must interoperate — the same compatibility problem again.

Two broad philosophies for web services:

REST — a design philosophy built on the principles of HTTP: resources identified by URLs, manipulated with the standard verbs (GET, POST, PUT, DELETE), simple formats, and cacheability. RESTful APIs are simple and dominant for public APIs.
SOAP — an XML-based protocol that deliberately avoids using HTTP features, described by a verbose machine-readable contract (WSDL) with heavy tooling and code generation. Now largely out of favor outside legacy enterprise systems.

The Problems with RPC

Remote Procedure Call (RPC) frameworks try to make a network request look like calling a local function in your own process (this is called location transparency). Kleppmann argues this abstraction is fundamentally flawed, because a network request differs from a local call in ways you can't paper over:

A local call is predictable — it succeeds or fails based only on your code. A network request can fail for reasons outside your control: the request or response may be lost, or the remote machine may be slow or unavailable.
A local call returns a result, throws an exception, or never returns (infinite loop / crash). A network request has another outcome: it may time out with no answer at all, and you simply don't know whether it succeeded.
Retrying a failed request is dangerous if the request actually got through but only the response was lost — you'd execute the action twice unless it's idempotent.
Latency is wildly variable, and every parameter must be encoded into bytes — easy for primitives, fraught for large objects or pointers.

interview-grade point

This is the chapter's connection to the "fallacies of distributed computing." Pretending the network is reliable, fast, and homogeneous — the lie that location-transparent RPC tells — is precisely what causes systems to behave badly in production. A good answer names the difference: a remote call can time out with an unknown outcome, which a local call never does, and that uncertainty drives the need for idempotency, retries, and timeouts.

Modern RPC frameworks are more honest about this: gRPC (built on Protobuf), Thrift, Finagle, and Avro RPC expose the asynchronous nature with futures/promises and streams, and add service discovery. RPC is still a fine fit for requests between services owned by the same organization, typically within one datacenter. For API evolution, services have it easier than databases: you can often update all the servers before the clients, so it's reasonable to maintain compatibility only across a few versions, signalling the version in the URL or an HTTP header.

Message-Passing Dataflow

The third mode sits between RPC and databases: asynchronous message passing via a message broker (RabbitMQ, ActiveMQ, Kafka, NATS, and others). A sender (producer) posts a message to a named queue or topic; the broker stores it and delivers it to one or more consumers. Like RPC, the message goes to another process with low latency; like a database, it goes through an intermediary that holds the data temporarily. The advantages over direct RPC:

Buffering — the broker acts as a buffer if the recipient is unavailable or overloaded, improving reliability.
Redelivery — it can automatically redeliver messages to a process that crashed, preventing message loss.
Decoupling — the sender doesn't need to know the recipient's IP address or port, which is especially useful in a cloud where instances come and go.
One-to-many — one message can be delivered to several consumers (publish/subscribe).
Sender independence — the sender simply publishes and forgets; it usually doesn't expect a reply.

Because messages are just byte sequences with some metadata, all the same encoding and compatibility concerns apply — and the asynchronous, decoupled nature actually makes forward/backward compatibility more important, since producers and consumers are deployed and upgraded entirely independently.

Distributed Actor Frameworks

A related model is the actor model: concurrency is expressed as actors — independent entities that hold local state and communicate only by sending each other asynchronous messages, sidestepping shared-memory threading problems. In a distributed actor framework (Akka, Microsoft Orleans, Erlang OTP), this message-passing model is extended transparently across nodes; because messages are already the unit of communication, the same framework scales an application from one machine to many. The catch returns one last time: when you do a rolling upgrade of an actor-based application, you must still ensure messages encoded by one version can be decoded by another.

takeaway

Encoding is where the abstract idea of "evolving a system" meets concrete bytes. Pick a format that makes compatibility explicit (schema-based binary for high-volume internal data; JSON for open APIs), and remember that databases, services, and message brokers are all encoding boundaries where old and new code meet. Get this right and you can change a running system fearlessly; get it wrong and every deploy risks silently corrupting or dropping data.

🎯 interview hot-takes

Backward vs forward compatibility? Backward = new code reads old data (easy). Forward = old code reads new data (hard; it must ignore unknown fields). Rolling upgrades need both at once.
Why are field tags so important in Protobuf/Thrift? The numeric tag, not the name, identifies a field in the bytes — that's what keeps it compact and what makes the evolution rules (add optional fields with new tags, never reuse a tag) safe.
What's special about Avro? It matches a writer's schema against a reader's schema by field name, with defaults filling gaps — no tag numbers — which is perfect for dynamically generated schemas like Kafka + Schema Registry.
Why is location-transparent RPC criticized? A network call can time out with an unknown result, unlike a local call; pretending otherwise ignores partial failure and forces you to design for idempotency, retries, and timeouts.