Replication (Chapter 5) keeps copies of the same data on several nodes. But if a dataset is too big or a query load too heavy for any single node, you need to break the data into pieces called partitions (also known as shards). Chapter 6 is about how to split data sensibly so that each piece can live on a different node, letting you scale storage and throughput roughly linearly with the number of machines. Partitioning is almost always combined with replication: each partition is itself replicated across several nodes for fault tolerance.
- Partitioning = scalability — spread data and query load across many nodes so no single machine is the bottleneck. Combined with replication for fault tolerance.
- Key-range partitioning keeps keys sorted, enabling efficient range scans, but risks hot spots when access is skewed (e.g. a timestamp key sends all of today's writes to one partition).
- Hash partitioning spreads load evenly but destroys key ordering, so range queries must hit every partition.
- Hot spots can survive hashing — a single celebrity key still lands on one partition; the app must split it (e.g. random suffix).
- Secondary indexes don't map cleanly to partitions — local (by document) indexes make writes cheap but reads scatter/gather; global (by term) indexes make reads cheap but writes touch many partitions.
- Never use hash mod N for rebalancing — changing N moves almost everything. Use a fixed large number of partitions, or dynamic split/merge instead.
Choose a partitioning scheme (key-range for range scans, hash for even load) and watch for skew and hot spots. Secondary indexes force a trade-off between cheap writes (local/document-partitioned) and cheap reads (global/term-partitioned). Rebalancing should move as little data as possible — fixed partition counts or dynamic splitting, never hash mod N. Finally, something must route each request to the right partition: a routing tier, a partition-aware client, or a coordination service like ZooKeeper.
Partitioning and Replication, Together
Partitioning and replication are orthogonal and usually combined. The data is split into partitions, and each partition is stored on multiple nodes per the replication scheme of Chapter 5. A node may store more than one partition; in a leader-follower setup, each node can be the leader for some partitions and a follower for others. Keeping the two ideas separate in your head is essential — partitioning is about splitting, replication is about copying.
Partitioning of Key-Value Data
The goal is to spread the data and the query load evenly. If the split is unfair, some partitions get more data or queries than others — this is called skew, and a partition with disproportionately high load is a hot spot. A hot spot defeats the purpose of partitioning: one overloaded node bottlenecks the whole system while others sit idle. The simplest way to avoid skew would be to assign records to nodes randomly, but then you'd have no way to find a particular record without querying every node. So we need a scheme that's both even and lookup-friendly. Two main approaches.
Partitioning by Key Range
Assign a continuous range of keys (from some minimum to some maximum) to each partition, like the volumes of a paper encyclopedia (A–B, C–D, …). If you know the boundaries, you know which partition holds a key. The boundaries need not be evenly spaced — they're chosen to balance the data. The big advantage is that keys stay sorted within each partition, so range scans are efficient. The big risk is hot spots from a skewed access pattern: if the key is a timestamp, then every write for the current day goes to the same partition, overloading it while historical partitions are idle. Used by HBase and Bigtable. A common mitigation is to prefix the key with something else (e.g. a sensor name) so writes spread out.
Partitioning by Hash of Key
Apply a hash function to each key and assign hash ranges to partitions. A good hash function turns skewed input into a uniform distribution, so data and load spread evenly. The cost: you lose the ability to do efficient range queries, because adjacent keys are now scattered across all partitions — a range scan must query every partition. Used by Cassandra and MongoDB (with hash-based sharding). Cassandra offers a compromise with compound primary keys: the first column is hashed to choose the partition, and the remaining columns are used to sort data within that partition — so you get even distribution across partitions but efficient range scans within one.
Hot Spots Even with Hashing
Hashing reduces but doesn't eliminate hot spots. A single heavily-accessed key — a celebrity's user ID receiving millions of interactions — still hashes to a single partition, and no hash function helps because it's one key. Today most systems can't compensate automatically; it's left to the application. A typical trick is to add a random two-digit suffix to the hot key, splitting its writes across, say, 100 keys on different partitions — at the cost of having to read and combine all 100 on reads.
Key-range vs hash is fundamentally a trade between range-query efficiency and even load distribution. Range partitioning gives you cheap scans but invites hot spots on sequential keys; hash partitioning gives you even load but makes range scans expensive. Cassandra's compound key is the clever middle ground.
Partitioning and Secondary Indexes
Everything so far assumed a key-value model where you look up records by primary key. Real systems also have secondary indexes (find all records where color = red). Secondary indexes don't map neatly onto partitions, and there are two ways to handle them.
By Document (Local Index)
Each partition maintains its own secondary index covering only the documents in that partition — a local index. Writes are simple: you only touch the one partition holding the document. But reads are expensive: since the matching documents could be in any partition, a query on the secondary index must be sent to all partitions and the results combined. This scatter/gather approach makes read latency hostage to the slowest partition and is prone to tail-latency amplification. Most databases (MongoDB, Cassandra, Elasticsearch, Riak) use document-partitioned indexes.
By Term (Global Index)
Construct a global index that covers all partitions, but partition the index itself — by the term being looked up. For example, all records with color = red are listed in one index partition regardless of where the documents live. Reads are now efficient: a query goes to just the partition holding that term. But writes are slower and more complex: a single document write may affect multiple terms, and thus multiple index partitions, often requiring a distributed transaction. In practice global secondary indexes are usually updated asynchronously, so the index may briefly lag the data.
| Aspect | Local (by document) | Global (by term) |
|---|---|---|
| Index scope | One partition's own docs | All partitions, split by term |
| Writes | Cheap — one partition | Costly — many partitions |
| Reads | Scatter/gather all partitions | Hit one partition |
| Freshness | Synchronous | Often async (lags) |
| Used by | MongoDB, Cassandra, ES | DynamoDB global indexes |
Rebalancing Partitions
Over time things change: query throughput grows (add CPUs), the dataset grows (add disks), or a machine fails (others must take over). All of these require moving data and load from one node to another — rebalancing. A good rebalancing scheme has three properties: load ends up fairly shared, the database keeps serving reads and writes during the move, and only the necessary data is moved (to limit network and disk I/O).
The hash mod N Trap
The tempting naive scheme — assign key to hash(key) mod N nodes — is exactly the wrong choice, because changing N reshuffles almost every key.
key 123456, hash = 123456
N = 10 : 123456 mod 10 = 6 → node 6
N = 11 : 123456 mod 11 = 3 → node 3 (moved!)
N = 12 : 123456 mod 12 = 0 → node 0 (moved again!)
adding ONE node remaps almost every key → huge data movement
Better Strategies
- Fixed number of partitions. Create many more partitions than nodes up front (say 1,000 partitions for 10 nodes) and assign several to each node. When a node is added, it steals whole partitions from existing nodes until balance is restored; only entire partitions move, not individual keys. Used by Riak, Elasticsearch, Couchbase. The catch: you must pick the partition count wisely at the start, since it's fixed.
- Dynamic partitioning. Partitions split when they grow past a size threshold and merge when they shrink, like a B-tree — so the number of partitions adapts to the data volume. Used by HBase and RethinkDB. Avoids guessing the count in advance.
- Proportional to nodes. Have a fixed number of partitions per node; adding a node splits some existing partitions to take over. Used by Cassandra.
Fully automatic rebalancing combined with automatic failure detection is dangerous: if a node is merely slow (not dead), automatically moving its load elsewhere adds load to an already-stressed system and can cascade into a wider outage. Many systems keep a human in the loop to approve rebalancing — slower, but far safer.
Request Routing
Once data is spread across nodes and partitions move around, a client needs to know which node to contact for a given key — an instance of the general service discovery problem. There are three broad approaches:
- Contact any node (e.g. via a round-robin load balancer); if that node doesn't own the key, it forwards the request to the right node and relays the reply.
- Send all requests to a routing tier that knows the partition assignment and acts as a partition-aware load balancer.
- Make clients partition-aware so they connect directly to the owning node.
The hard part is that whoever makes the routing decision must have up-to-date knowledge of partition assignments, which change during rebalancing. Many systems delegate this to a separate coordination service like ZooKeeper, which authoritatively tracks the cluster's partition-to-node mapping and notifies routers of changes. (Cassandra and Riak instead use a gossip protocol so nodes share state among themselves, avoiding the external dependency.) For analytic workloads, massively parallel processing (MPP) query engines take this further, splitting a single complex query into stages that run across many partitions in parallel.
Partitioning is how databases scale beyond one machine, and the recurring enemy is skew: an uneven split that creates a hot spot and wastes your cluster. Pick the scheme that matches your queries (range vs hash), handle secondary indexes deliberately (cheap writes vs cheap reads), rebalance by moving whole partitions rather than rehashing the world, and have a robust story for routing requests as partitions move.
Key-range vs hash partitioning? Range keeps keys sorted (cheap scans) but risks hot spots on sequential keys; hash spreads load evenly but kills range queries. Cassandra's compound key gets both.
Local vs global secondary index? Local (by document) = cheap writes, scatter/gather reads; global (by term) = cheap reads, expensive multi-partition writes (usually async).
Why not hash mod N? Changing N remaps almost every key. Use a fixed large partition count or dynamic split/merge so only whole partitions move.
How are requests routed? Via a routing tier, partition-aware clients, or any-node forwarding — usually backed by a coordination service (ZooKeeper) or gossip to track assignments.