Chapter 5 opens Part II of DDIA, "Distributed Data." Replication means keeping a copy of the same data on multiple machines connected via a network. We do it for three reasons: to keep data geographically close to users (lower latency), to keep the system working even when some parts fail (availability), and to scale out the number of machines serving read queries (throughput). If the data never changed, replication would be trivial — copy it once and you're done. The entire difficulty, and the whole chapter, is about handling changes to replicated data. Kleppmann frames everything around three algorithms: single-leader, multi-leader, and leaderless replication.
- Three approaches — single-leader (one node accepts writes), multi-leader (several do), and leaderless (clients write to many nodes directly). Almost every distributed datastore is a variation on these.
- Sync vs async is a durability/availability trade — synchronous replication guarantees an up-to-date copy but blocks if a follower is down; asynchronous is fast but can lose recently-acknowledged writes on failover.
- Failover is dangerous — promoting a new leader risks lost writes, split brain (two leaders), and bad timeout tuning. Automatic failover is genuinely hard to get right.
- Replication lag breaks intuitive guarantees — you need read-after-write, monotonic reads, and consistent-prefix reads to paper over an asynchronous follower that has fallen behind.
- Multi-leader's core problem is write conflicts — the same record edited on two leaders must be reconciled (LWW, CRDTs, custom merge).
- Leaderless uses quorums — if w + r > n, read and write sets overlap, so a read sees the latest write. Read repair and anti-entropy heal stale replicas.
Single-leader replication is simple and the default (most relational DBs): writes go to a leader, which streams a change log to followers. Multi-leader adds write availability across datacenters at the cost of conflict resolution. Leaderless (Dynamo-style) drops the leader entirely and relies on quorums plus repair. The recurring theme is the tension between consistency and availability when the network is unreliable and replication is asynchronous.
Single-Leader Replication
The most common approach, used by PostgreSQL, MySQL, SQL Server, MongoDB, and many others, is leader-based replication (also called active/passive or master/slave). One replica is designated the leader. All writes must go to the leader, which first writes the new data to its own local storage. The other replicas are followers: whenever the leader writes, it also sends the change to every follower as part of a replication log or change stream. Each follower applies the changes in the same order the leader processed them. Reads can be served by either the leader or any follower, but writes are leader-only.
Synchronous vs Asynchronous Replication
A crucial configuration knob is whether replication happens synchronously or asynchronously:
- Synchronous — the leader waits until a follower confirms it received the write before reporting success to the client. The follower is guaranteed to have an up-to-date copy, but if it doesn't respond (crash, network fault), the write can't proceed — the leader must block all writes.
- Asynchronous — the leader sends the change and does not wait. Writes are fast and the leader keeps working even if followers fall behind, but if the leader fails before a write propagates, that write is lost even though it was confirmed to the client.
Making every follower synchronous is impractical — a single slow node would halt the whole system. A common compromise is semi-synchronous: one follower is synchronous, and if it gets too slow, another follower is made synchronous in its place. This guarantees at least two nodes have the latest data. In practice, many leader-based systems run fully asynchronous, accepting the risk of losing recent writes in exchange for performance and availability.
Setting Up New Followers
To add a follower without downtime, you can't just copy files while writes are happening. The standard procedure: take a consistent snapshot of the leader's database (most DBs support this without locking), copy it to the new follower, then have the follower connect to the leader and request all changes that happened since the snapshot — identified by an exact position in the replication log (PostgreSQL calls it a log sequence number, MySQL a binlog coordinate). Once the follower has caught up, it processes changes live.
Handling Node Outages
Any node can go down, and a goal of replication is to keep the system running through individual failures. How you recover depends on which node failed.
Follower Failure: Catch-Up Recovery
This is the easy case. Each follower keeps a log of the changes it has received. After a crash or a network blip, it knows the last transaction it processed; it reconnects to the leader and requests every change since then, applies them, and is back in sync. No human intervention needed.
Leader Failure: Failover
This is the hard case. If the leader fails, one of the followers must be promoted to be the new leader, clients must be reconfigured to send writes to it, and the other followers must start consuming changes from it. This process — called failover — can be manual or automatic, and it is riddled with pitfalls:
- Lost writes. With asynchronous replication, the promoted follower may not have received the leader's most recent writes. If the old leader rejoins, those writes conflict with new ones. The usual resolution is to discard the old leader's unreplicated writes — which violates clients' durability expectations.
- Dangerous interactions with other systems. Kleppmann's cautionary tale: at GitHub, an out-of-date MySQL follower was promoted; it had reused primary keys (autoincrement) that the old leader had already assigned, and those keys were also used in Redis, causing private data to leak to the wrong users.
- Split brain. Two nodes may both believe they're the leader and both accept writes, with no way to reconcile — data is corrupted or lost. Some systems shut one down, but a careless mechanism can shut down both.
- Timeout tuning. How long before you declare a leader dead? Too long means a longer outage; too short means unnecessary failovers that make things worse, especially under load spikes.
There are no easy answers to any of these failover problems — which is exactly why many operations teams prefer to perform failover manually, even though that means a longer outage. The deep reason these issues are so thorny is the unreliability of networks, clocks, and the difficulty of agreeing on a single truth — the subjects of Chapters 8 and 9.
Implementation of Replication Logs
How does the leader actually communicate changes to followers? There are several methods, each with trade-offs.
- Statement-based — the leader ships the literal write statements (e.g. each
INSERT/UPDATE). Compact, but breaks on nondeterminism:NOW(),RAND(), autoincrement columns, and statements with side effects produce different results on each replica. Largely abandoned for this reason. - Write-ahead log (WAL) shipping — replicate the exact byte-level log the storage engine already writes. Very precise, but tightly couples replication to the storage engine's internal format, so the leader and followers usually must run the same database version — making zero-downtime upgrades hard.
- Logical (row-based) log — a separate log describing changes at the row level (which row, which columns, new values), decoupled from the storage engine internals. This decoupling allows version mismatches and lets external systems consume the stream — the basis of change data capture (CDC).
- Trigger-based — application-level replication using database triggers/stored procedures. Most flexible (you can transform or filter data) but has more overhead and is more bug-prone.
Problems with Replication Lag
Reading from asynchronous followers lets you scale reads cheaply, but a follower may lag behind the leader. If you read from a lagging follower, you may see stale data — the system is eventually consistent, but "eventually" is deliberately vague and could be seconds or minutes under load. Three specific anomalies, and the guarantees that fix them, come up constantly in interviews:
Read-After-Write Consistency
A user submits a change, then reloads — and if their read hits a follower that hasn't received the write yet, their own update appears to have vanished. Read-after-write (or read-your-writes) consistency guarantees a user always sees their own writes. Techniques: read things the user may have modified from the leader; track the timestamp of the user's last write and only read from a follower that's caught up to it.
Monotonic Reads
If a user makes several reads from different replicas, they might see a value, then a moment later see an older value — time appearing to move backwards. Monotonic reads guarantees that once you've read newer data, you won't later read older data. A simple implementation: each user always reads from the same replica (e.g. chosen by a hash of their user ID).
Consistent Prefix Reads
If a sequence of writes happens in a certain order, anyone reading them should see them in that same order. Violations are jarring — you might see an answer to a question before the question itself. Consistent prefix reads guarantees causally-related writes are read in order. This is especially a problem in partitioned (sharded) databases where different partitions replicate at different speeds.
"Eventual consistency" is a weak guarantee that pushes a lot of complexity onto application developers. Rather than hoping the lag stays small and being surprised when it doesn't, it's better to think in terms of these specific guarantees — and to recognize that providing stronger guarantees is exactly what transactions and consensus (later chapters) are for.
Multi-Leader Replication
Single-leader has one obvious downside: all writes funnel through one node. Multi-leader replication allows more than one node to accept writes; each leader simultaneously acts as a follower to the other leaders. The main use cases:
- Multi-datacenter operation — a leader in each datacenter, so writes are local (low latency) and each DC keeps working if another, or the link between them, fails.
- Clients with offline operation — e.g. a calendar app on your phone is effectively a leader (it accepts writes offline) that syncs with the server when reconnected.
- Collaborative editing — real-time tools like Google Docs are a form of multi-leader replication.
Write Conflicts
The big problem multi-leader introduces is write conflicts: the same data can be modified on two different leaders concurrently, and the conflict isn't detected until the changes are asynchronously merged. Approaches:
- Conflict avoidance — route all writes for a given record to the same leader, sidestepping conflicts. Breaks down when that leader fails or the user moves.
- Converge to a consistent state — every replica must arrive at the same final value. Options include last-write-wins (LWW) (pick the write with the highest timestamp — simple but loses data), giving each write a unique ID and picking the highest, or recording the conflict and resolving it later.
- Custom resolution logic — let the application code merge conflicts, on write or on read.
- Automatic data structures — CRDTs (conflict-free replicated data types) and operational transformation are designed to merge concurrent edits sensibly.
Replication Topologies
With more than two leaders, you must decide the path writes take between them. All-to-all (every leader sends to every other) is the most robust but can suffer causality problems if some links are faster than others. Circular and star topologies reduce traffic but introduce a single point of failure and can break if a node goes down. To prevent infinite loops, each write is tagged with the identifiers of the nodes it has already passed through.
Leaderless Replication
The third approach abandons the concept of a leader entirely: any replica can accept writes directly from clients. This style was popularized by Amazon's Dynamo and is used by Cassandra, Riak, and Voldemort (often called Dynamo-style). The client (or a coordinator node acting on its behalf) sends each write to several replicas and reads from several replicas in parallel.
Quorum Reads and Writes
To know that a read sees the latest write despite some nodes being down, leaderless systems use quorums. With n replicas, a write must be confirmed by w nodes and a read must query r nodes. As long as w + r > n, the set of nodes written and the set read are guaranteed to overlap by at least one node — so a read will see at least one up-to-date copy.
n = 3 replicas
w + r > n guarantees the write set and read set overlap
example: w = 2, r = 2, n = 3
2 + 2 = 4 > 3 ✓ every read sees the latest write
tune for reads: w = 3, r = 1 (fast reads, slow/fragile writes)
tune for writes: w = 1, r = 3 (fast writes, must read widely)
Keeping Replicas in Sync
When a down node comes back, it has missed writes. Two mechanisms heal this:
- Read repair — when a client reads from several nodes and notices one has a stale value, it writes the newer value back to that node. Works well for frequently-read data.
- Anti-entropy — a background process continuously looks for differences between replicas and copies missing data. Unlike WAL-based replication, it doesn't preserve write order and may have significant delay.
Limitations of Quorums and Detecting Concurrent Writes
Quorums are not airtight. Sloppy quorums (with hinted handoff) let writes go to other reachable nodes when the "home" nodes are unavailable, increasing availability but weakening the overlap guarantee. And because multiple clients can write concurrently to a leaderless store, conflicts arise just like in multi-leader. Detecting them requires reasoning about the "happens-before" relationship: two operations are concurrent if neither knew about the other. Systems use version numbers per key, or version vectors across replicas, to tell whether one write supersedes another or whether they're truly concurrent (in which case the conflicting values, called siblings, must be merged by the application).
| Dimension | Single-leader | Multi-leader | Leaderless |
|---|---|---|---|
| Who accepts writes | One node | Several nodes | Any replica |
| Write conflicts | None (serialized) | Yes — must resolve | Yes — must resolve |
| Handling node loss | Failover (hard) | Other leaders continue | Quorum tolerates loss |
| Consistency | Strong-ish (if read leader) | Eventual | Eventual (tunable quorum) |
| Examples | PostgreSQL, MySQL | Multi-DC, CouchDB | Cassandra, Riak, Dynamo |
Replication is the foundation of every distributed datastore, and the three strategies are a spectrum of trade-offs between write availability and the cost of resolving conflicts. Single-leader avoids conflicts but bottlenecks and fails awkwardly; multi-leader and leaderless buy availability by forcing you to reconcile concurrent writes. Knowing where a given database sits — and what it does about lag and conflicts — is the heart of a strong distributed-systems interview answer.
Sync vs async replication? Sync guarantees a durable up-to-date copy but blocks on a slow follower; async is fast but can lose acknowledged writes on failover. Semi-sync (one sync follower) is the common compromise.
Why is failover hard? Lost async writes, split brain (two leaders), bad timeout tuning, and dangerous interactions with external systems — which is why many teams do it manually.
What does w + r > n give you? The read and write quorums overlap by at least one node, so a read is guaranteed to see the most recent write — the core leaderless consistency knob.
How do you detect concurrent writes? Track the happens-before relation with version numbers or version vectors; writes where neither saw the other are concurrent siblings that need merging.