DDIA Ch.9 — Consistency & Consensus

Chapter 9 is the climax of Part II. Having catalogued everything that can go wrong (Chapter 8), Kleppmann now builds the positive tools: general-purpose abstractions with strong guarantees that let applications ignore some of the chaos beneath them. The two big ideas are linearizability (the strongest single-object consistency model) and consensus (getting nodes to agree on something despite faults). The chapter's punchline is that a surprising number of problems — leader election, uniqueness constraints, atomic commit, lock services — are all equivalent to consensus, so solving consensus once (or outsourcing it to ZooKeeper) solves them all.

⚡ Quick Takeaways

Linearizability makes a replicated system look like a single copy with atomic operations — a recency guarantee: once a write completes, every later read sees it.
It's not free — the CAP theorem says that during a network partition you must choose between linearizable (and unavailable on one side) or available (and not linearizable). It's also slow.
Causality is a weaker, cheaper order — causal consistency is the strongest model that survives partitions, captured with version vectors or Lamport timestamps.
Total order broadcast (deliver the same messages to all nodes in the same order) is equivalent to consensus and underlies state-machine replication.
Two-phase commit gives atomic commit across nodes but blocks if the coordinator dies (the in-doubt problem) — a single point of failure.
Consensus algorithms (Paxos, Raft, Zab) use majority quorums to agree safely; ZooKeeper/etcd package this so you don't implement it yourself.

tldr

Linearizability is the gold-standard "behaves like one machine" guarantee, but CAP and latency make it costly, so weaker models (causal consistency) are often the right call. When you genuinely need agreement — who is the leader, did the transaction commit, is this name unique — you need consensus, which is equivalent to total order broadcast. Don't roll your own; use a consensus system like ZooKeeper or etcd.

Linearizability

The idea behind linearizability (also called strong consistency or atomic consistency) is to make a system appear as if there is only one copy of the data, and as if all operations on it are atomic. Even though data is replicated across many nodes, clients should never observe the replication: the moment one client's write completes, every subsequent read — from any client, on any replica — must return that new value (or a newer one). It's fundamentally a recency guarantee. A classic illustration: two people looking at a sports score on their phones; if one sees the final result, the other must not still see "in progress."

Linearizability vs Serializability

These names sound alike and are constantly confused, but they're different guarantees:

Aspect	Linearizability	Serializability
About	Recency of single-object reads/writes	Isolation of multi-object transactions
Guarantee	Looks like one copy, real-time order	Txns equivalent to some serial order
Concerns	Replication / freshness	Concurrency / interleaving
Together	Strict serializability = both at once (e.g. two-phase locking, Spanner)

When You Need It

Several things break without linearizability:

Locking and leader election — a lock must be linearizable so that only one node believes it holds the lock; this is why systems use ZooKeeper/etcd for coordination.
Uniqueness constraints — guaranteeing a username or email is unique requires a linearizable register (otherwise two nodes both accept the same name).
Cross-channel timing dependencies — e.g. a web server writes an image to storage, then sends a message to a resizer over a separate channel; without linearizability the resizer may read an old or missing version.

How It's Implemented, and the CAP Cost

Which replication methods are linearizable? Single-leader replication can be (if you read only from the leader, and it really is the leader). Consensus algorithms are. Multi-leader is not. Leaderless (Dynamo-style) is generally not linearizable even with strict quorums, because clock skew and the timing of read repair break the recency guarantee. The deep reason linearizability is hard is captured by the CAP theorem: when a network partition occurs, a system must choose — remain consistent (linearizable) by refusing requests on the side that can't reach a quorum (sacrificing availability), or remain available by serving possibly-stale data (sacrificing linearizability). CAP is often stated too broadly; it only concerns one fault (partitions) and one model (linearizability). But the practical lesson holds: linearizability requires network round-trips, so it's slow, and many systems deliberately give it up for performance and availability.

Ordering and Causality

Ordering keeps recurring because it's deeply tied to causality. Causality imposes an order on events: a question must come before its answer; a row must be created before it's updated. Causality defines a partial order — some events are ordered (causally related), others are incomparable (concurrent). Linearizability, by contrast, imposes a total order: every operation is in one timeline. Linearizability implies causality, but it's stronger (and costlier) than necessary.

Causal consistency is the strongest consistency model that does not require waiting on the network and can remain available during a partition — making it very attractive. To track causality you can use version vectors. Lamport timestamps give a total order that is consistent with causality (each node keeps a counter, attaches it to messages, and bumps it to the max it has seen) — but a Lamport timestamp alone can't tell you when an order is finalized (whether some other operation with a lower number is still in flight).

Total Order Broadcast

What single-leader systems really need is a way to decide on a total order of operations that's fixed and known. Total order broadcast (atomic broadcast) is the formal version: a protocol for delivering messages to all nodes reliably and in the same order. It's the basis of state machine replication — if every replica applies the same operations in the same order, they all end up in the same state. Total order broadcast is, it turns out, equivalent to consensus, and it's exactly what systems like ZooKeeper and etcd provide internally.

Distributed Transactions and Consensus

Consensus — getting several nodes to agree on something — is one of the most important and most subtle problems in distributed computing. It sounds simple but is fraught, especially because of the failures in Chapter 8.

Two-Phase Commit (2PC)

The standard algorithm for atomic commit across multiple nodes — ensuring either all nodes commit a transaction or all abort. A coordinator drives two phases: first it asks every participant "can you commit?" (prepare); each replies yes only if it's certain it can. If all vote yes, the coordinator sends commit; if any votes no, it sends abort. The promise a participant makes when it votes "yes" is irrevocable — which creates 2PC's fatal weakness.

2PC and the in-doubt problem

phase 1 (prepare):  coordinator → participants: "can you commit?"
                    participants → coordinator: "yes" (now they MUST obey)

phase 2 (commit):   coordinator → participants: "commit!"

  ✗ coordinator CRASHES after phase 1, before phase 2
    participants are stuck "in doubt" — they promised to commit,
    can't unilaterally decide, and must hold locks until the
    coordinator recovers.  → blocking; coordinator is a SPOF.

If the coordinator crashes after participants have voted yes but before it sends the decision, those participants are in doubt: they can't commit or abort on their own, so they block, holding locks, until the coordinator comes back. This makes the coordinator a single point of failure. (The XA standard implements 2PC across heterogeneous systems and suffers exactly these operational headaches.)

Fault-Tolerant Consensus

Consensus algorithms do better than 2PC by tolerating coordinator failure. A consensus algorithm must satisfy four properties: uniform agreement (no two nodes decide differently), integrity (no node decides twice), validity (a decided value was proposed by some node), and termination (every non-crashed node eventually decides — the fault-tolerance property). The well-known algorithms — Paxos, Raft, Zab, Viewstamped Replication — actually implement total order broadcast (deciding a sequence of values). They rely on majority quorums and an epoch/ballot number: each round of leadership has a unique increasing number, and a leader must collect votes from a quorum before deciding, which guarantees that any two quorums overlap and that an old leader can't override a newer one.

FLP, briefly

The famous FLP impossibility result proves that no algorithm can always reach consensus in a fully asynchronous model if even one node may crash. The escape hatch in practice: real systems are allowed to use timeouts (and sometimes randomization) to make progress, sidestepping the theoretical impossibility while still guaranteeing safety. Consensus has a cost too — it needs a majority to make progress and typically a leader, so it's sensitive to network delays.

Membership and Coordination Services

You rarely implement consensus yourself. Instead, you use a coordination service like ZooKeeper or etcd, which run a consensus algorithm internally and expose a small set of powerful primitives: linearizable atomic operations (e.g. compare-and-set for locks), total ordering of operations (ZooKeeper's zxid serves as a fencing token), failure detection (sessions and heartbeats), and change notifications (watches). With these you build leader election, partition/assignment management, service discovery, and distributed locks. The pattern is to outsource the hard consensus work to one of these battle-tested services and keep it off the critical path of your own application logic.

takeaway

The big realization of this chapter is the web of equivalences: linearizable compare-and-set, total order broadcast, atomic commit, leader election, and consensus are all reducible to one another. So you don't need to solve each separately — solve consensus once. And since correct consensus is extremely hard to implement, the engineering answer is almost always to lean on ZooKeeper, etcd, or a database that has consensus built in, rather than rolling your own.

🎯 interview hot-takes

Linearizability vs serializability? Linearizability = recency for single objects (looks like one copy); serializability = isolation for multi-object transactions. Strict serializability is both.
What does CAP actually say? During a network partition, choose consistency (linearizable, reject some requests) or availability (serve stale data). It's narrow — one fault, one model — but the trade is real.
Why is 2PC fragile? If the coordinator crashes after the prepare phase, participants are in doubt — they can't decide alone and block holding locks. The coordinator is a single point of failure.
Why use ZooKeeper instead of writing consensus? Consensus (Paxos/Raft) is famously hard to get right; ZooKeeper/etcd package linearizable ops, total ordering, and failure detection so you build leader election and locks on top.