Sharding is a technique for horizontally partitioning a database across multiple independent nodes, each called a shard, so that the dataset and the write load are distributed rather than concentrated on a single machine. When a single database server can no longer handle the volume of data or query throughput — and vertical scaling (bigger CPU, more RAM) has hit its ceiling — sharding is the next step. But sharding is not a free lunch: it trades a simple, strongly-consistent single-node model for a distributed system that makes cross-shard queries, transactions, and schema changes dramatically harder. Understanding when and how to shard — and which of the three core strategies to use — is one of the most consequential architectural decisions you will make.

This guide covers the full sharding decision space: horizontal vs vertical partitioning, all three sharding strategies in depth, the consistent hashing algorithm and why virtual nodes matter, shard key selection criteria, hotspot management, resharding with minimal disruption, cross-shard query handling, distributed transactions, and how sharding relates to replication. By the end you will have the vocabulary and mental models to make confident sharding decisions — and to articulate the tradeoffs clearly in a system design interview.

⚡ Quick Takeaways
  • Hash-based sharding gives even distribution but breaks range queries and requires full data migration when node count changes — unless you use consistent hashing.
  • Range-based sharding enables efficient range scans but risks write hotspots on sequential keys like timestamps or auto-increment IDs.
  • Directory-based sharding offers maximum flexibility at the cost of a high-availability lookup service on the critical path.
  • Consistent hashing minimizes data movement to ~k/n keys when adding one node; virtual nodes smooth out uneven distribution.
  • Shard key design is the most consequential decision — high cardinality, even write distribution, aligned with query patterns.
  • Cross-shard joins must be done in the application layer (fetch + merge in memory) — avoid them by designing the shard key around your access patterns.
  • Sharding and replication are orthogonal — each shard should itself be replicated for HA; you need both, not one or the other.
tldr

Sharding splits data across multiple database nodes by a shard key. Hash-based sharding gives even distribution; range-based sharding enables efficient range scans; directory-based sharding offers maximum flexibility. Consistent hashing minimizes data movement when resharding. The cost is real complexity: cross-shard queries, joins, and transactions require significant engineering effort. Shard only after exhausting vertical scaling, read replicas, and caching.

Horizontal database sharding across multiple nodes
Horizontal database sharding across multiple nodes

Before You Shard: The Scaling Ladder

Sharding is an expensive operational commitment. Before reaching for it, exhaust the options on the scaling ladder in order — each step adds less complexity than sharding:

  1. Vertical scaling — give the primary more CPU, RAM, and faster storage. Straightforward, but bounded by the largest commercially available instance type. A modern cloud instance can carry 192 vCPUs and 3 TB of RAM, handling hundreds of thousands of queries per second — most workloads never need to go beyond this.
  2. Read replicas — route read traffic to replicas; concentrate writes on the primary. Typically a 5–10× effective read capacity multiplier for read-heavy workloads (90:1 or higher read:write ratios). The primary still bottlenecks on writes.
  3. Caching — a Redis cache absorbing 95% of read traffic means your database only sees 5% of the full load. The effective capacity gain is enormous and requires no data architecture change.
  4. Functional decomposition (vertical partitioning) — split different business domains into separate databases, each running on its own server. The orders database, the catalog database, and the user database each scale independently without any single-table row-level splitting.
  5. Sharding (horizontal partitioning) — when a single table is too large or too hot for one server even after the above steps, split its rows across multiple shards.
rule of thumb

A single PostgreSQL primary with replicas and a Redis cache can comfortably handle 100,000 reads/sec and 10,000 writes/sec. At 1,000,000 reads/sec or 100,000 writes/sec, you start to need sharding on the write path. Most startups never reach that scale. Shard when the pain is real, not when the architecture diagram looks cooler.

Vertical vs. Horizontal Partitioning

The word "partitioning" means different things depending on whether you split along the column axis or the row axis. Understanding the distinction matters because they solve different problems and are often used together.

DimensionVertical PartitioningHorizontal Partitioning (Sharding)
What is splitColumns — different tables/columns on different servers (functional decomposition)Rows — the same table spread across multiple nodes
Scale limitBounded by the number of domain boundaries you can drawScales horizontally with node count — theoretically unbounded
Complexity introducedCross-service API calls instead of joins; service boundary designCross-shard queries, distributed transactions, resharding ops
Use caseMicroservice decomposition, separating read-heavy from write-heavy columnsA single table exceeding the capacity of one server (users, events, messages)
ExamplesUser service DB, Order service DB, Product service DBUsers 1–10M on shard 1, Users 10M–20M on shard 2, ...

Vertical partitioning (functional decomposition) is essentially what microservices do: each service owns its database, and the services communicate over APIs rather than via SQL joins. This is often the right first move because it eliminates cross-domain joins, allows independent scaling, and maps cleanly to team ownership — without any of the row-level complexity of horizontal sharding.

Horizontal sharding kicks in when a single service's table grows beyond what one server handles. The canonical examples: a social platform with 2 billion users, a payment system processing 100 million transactions per day, a messaging system storing 1 trillion messages. No vertical decomposition resolves these — the table itself must be split across machines.

The Three Sharding Strategies

Hash-Based Sharding

Apply a hash function to the shard key and compute shard_id = hash(key) % num_shards to determine which shard holds a given row. The hash function distributes values uniformly regardless of the key's natural ordering, so data and write load are spread evenly across all shards.

python
def get_shard(user_id: int, num_shards: int) -> int:
    # consistent hash across shards — uniform distribution
    return hash(user_id) % num_shards

# user_id=1234567 with 8 shards → shard 3
# all data for user 1234567 lives on shard 3
# range query "users WHERE id BETWEEN 1M AND 2M" → must query ALL 8 shards

Range-Based Sharding

Partition rows into contiguous ranges of the shard key — e.g., user IDs 1–1,000,000 on shard 1, 1,000,001–2,000,000 on shard 2. The shard boundaries are explicitly defined in a routing table that maps each range to a shard.

range sharding hotspot mitigation

Avoid using timestamps or sequential auto-increment IDs as the range shard key for active write workloads. Instead, use a compound key that combines a hash prefix with the timestamp: shard = crc32(user_id) % 16 determines the initial shard, and within that shard data is stored in time order. This gives range-query efficiency within a shard while distributing writes across all shards.

Directory-Based Sharding

A centralized lookup table (the "directory") explicitly maps each key (or key range) to its target shard. Every database request first queries the directory to find the correct shard, then queries that shard directly. The directory itself must be highly available and extremely low-latency since it is on the critical path of every database operation.

StrategyDistributionRange QueriesResharding CostFlexibility
Hash-basedUniformScatter-gather all shardsHigh (remaps ~all keys)Low (fixed algorithm)
Range-basedUneven (hotspot risk)Single or few shardsMedium (split ranges)Medium (ranges adjustable)
Directory-basedConfigurableDepends on mappingLow (update mapping)High (arbitrary placement)
Consistent hashingUniform with vnodesScatter-gatherLow (~k/n keys move)Medium

Consistent Hashing: The Algorithm in Depth

Standard modulo hash sharding has a catastrophic resharding problem: adding one node to an 8-node cluster changes the modulus from 8 to 9, invalidating 7/8 ≈ 88% of all key-to-shard mappings. Every one of those keys must be migrated to its new shard. This is prohibitively expensive for a live production system. Consistent hashing solves this elegantly.

The ring abstraction

Consistent hashing maps both nodes and keys onto a logical ring of integers (typically the full MD5 or SHA-1 hash space: 0 to 2128–1). Each node is placed at a position on the ring determined by hashing its identifier. Each key is placed at a position determined by hashing the key itself. A key is owned by the nearest node clockwise from its position.

When a node is added, it is placed at a new position on the ring and claims only the key range between itself and its counterclockwise neighbor. All other key assignments are undisturbed. When a node is removed, its key range passes to its clockwise successor. In both cases, only ~k/n keys move — a fraction of the total dataset proportional to the newly added or removed capacity.

python
import hashlib
from sortedcontainers import SortedDict

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.ring = SortedDict()          # sorted by hash position
        self.vnodes = virtual_nodes

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        for i in range(self.vnodes):
            h = self._hash(f"{node}#vn{i}")
            self.ring[h] = node               # 150 virtual nodes per physical node

    def remove_node(self, node: str):
        for i in range(self.vnodes):
            h = self._hash(f"{node}#vn{i}")
            del self.ring[h]

    def get_node(self, key: str) -> str:
        if not self.ring:
            raise ValueError("Ring is empty")
        h = self._hash(key)
        idx = self.ring.bisect_right(h) % len(self.ring)
        return self.ring.peekitem(idx)[1]   # clockwise nearest node

Virtual nodes: why 150?

With a small number of physical nodes (say, 3), the probability that the hash function places them evenly around the ring is low — you might end up with one node owning 60% of the key space and another owning 15%. Virtual nodes solve this by assigning each physical node multiple positions on the ring (150 is the Cassandra default). With 150 virtual nodes per physical node, the law of large numbers kicks in and load balances to within a few percentage points of ideal. The tradeoff is memory — the ring table grows by a factor of the virtual node count, but even at 1000 virtual nodes with 100 physical nodes, the ring table is only 100,000 entries, which is trivially small.

Virtual nodes also simplify heterogeneous clusters: a node with twice the memory and CPU capacity can be assigned 300 virtual nodes instead of 150, causing it to own ~twice the key space and bear twice the traffic — without any special-casing in the routing logic.

Shard Key Design: The Most Consequential Decision

The shard key is the column (or combination of columns) that determines which shard holds a given row. A bad shard key creates hotspots, forces scatter-gather on common queries, or breaks application-level join patterns. Once data is sharded, changing the shard key is extremely disruptive — it requires a full table reshuffle across shards. Get this right before you write the first row.

Properties of a good shard key

Real-world shard key examples

Hotspots and Resharding

Hotspots: when a shard becomes a bottleneck

A hotspot occurs when a disproportionate fraction of reads or writes routes to a single shard. This can happen for three reasons:

Resharding: adding capacity without a full migration

As data grows, individual shards fill up and performance degrades. Resharding — splitting existing shards or adding new ones — is one of the most operationally intensive tasks in a sharded system. The techniques, roughly from least to most disruptive:

  1. Read replica promotion — if one shard is write-hot but read-heavy, promote a replica to a co-primary that handles reads independently. This buys time but does not actually split the data.
  2. Shard splitting — divide one overloaded shard into two. For range sharding: update the boundary in the routing table and migrate the upper half of rows to the new shard. For consistent hashing: add a new node to the ring; only the key range between the new node and its predecessor migrates.
  3. Online schema change tooling — when resharding requires schema changes across all shards, use gh-ost (GitHub's Online Schema Change) or pt-online-schema-change to make changes without blocking production writes.
  4. Double-write + backfill pattern — write to both the old and new shard layout simultaneously while a background job copies existing data from old to new. Once the backfill is complete and verified, cut over reads to the new layout and stop double-writes. This allows zero-downtime resharding at the cost of temporary write amplification.
flow
# Online resharding: double-write + backfill pattern

Phase 1 — Dual-write (no downtime):
  All writes → old shard AND new shard layout
  Reads still served from old shard

Phase 2 — Backfill (background):
  Copy all existing rows from old → new layout
  Track progress by primary key range
  Verify checksums per batch

Phase 3 — Cutover (brief read pause or zero-downtime):
  Verify new shard is fully consistent with old
  Switch reads to new shard layout
  Stop dual-writes to old layout

Phase 4 — Cleanup:
  Drop old shard data after validation period

Cross-Shard Queries and Transactions

Cross-shard queries: the scatter-gather problem

Any query that does not include the shard key in its WHERE clause must be broadcast to all shards (scatter), and the results merged in the application layer (gather). This is called a scatter-gather fan-out, and it is the most common performance problem in sharded systems:

Mitigations:

Cross-shard transactions: the hardest problem

ACID transactions spanning multiple shards are one of the hardest problems in distributed systems. The two standard approaches, and their tradeoffs:

design insight

The cleanest way to avoid cross-shard transactions is to design your sharding so that every critical transaction touches only one shard. This usually means co-locating related data that participates in the same transaction. For an e-commerce platform, sharding both orders and cart data by user_id means the checkout transaction (which reads the cart and writes an order) is single-shard for a given user.

Sharding vs. Replication: Orthogonal Concerns

Sharding and replication are frequently confused but solve entirely different problems:

DimensionShardingReplication
What it solvesWrite throughput and dataset size beyond a single nodeRead throughput, high availability, and fault tolerance
Data distributionEach row exists on exactly one shard (no duplication)Each row exists on multiple replicas (full duplication)
Node failure impactOne shard lost = that subset of data unavailablePrimary lost = replica promotes; no data loss
Read scalabilityN shards × 1 primary each = N parallel write+read primariesFan reads to replicas; primary handles writes only
Are they exclusive?No — combine them. Each shard should itself be replicated: a 4-shard cluster with 3 replicas each = 12 total nodes.

A production sharded system always combines both: sharding for write scalability and dataset partitioning, replication for read scalability and high availability. Without replication, losing a single shard node makes an entire partition of your data unavailable. Without sharding, a single replicated primary becomes a write bottleneck as data grows.

Schema Changes Across Shards

DDL operations (adding a column, adding an index, changing a data type) must be applied to every shard — often 8, 16, or 64 identical schema mutations that each need to run without locking production traffic. The operational discipline required:

Real-World Sharding Architectures

Vitess (YouTube / PlanetScale)

Vitess is a sharding middleware layer built around MySQL that adds connection pooling, query routing, and horizontal sharding on top of standard MySQL. It was originally built by YouTube to handle billions of rows of video metadata. Vitess introduces a concept called VShards (logical shards) that can be remapped to physical MySQL instances without application code changes. The routing layer handles scatter-gather, cross-shard aggregation, and shard-aware SQL rewriting automatically. PlanetScale offers Vitess as a managed service.

Cassandra's built-in sharding

Cassandra is a distributed database that uses consistent hashing with virtual nodes internally — sharding is built into the storage engine, not bolted on at the application layer. Data is partitioned by a partition key, and rows with the same partition key are co-located (these are called wide rows or clustering columns). A well-designed Cassandra data model aligns the partition key with the most common access pattern and keeps partition sizes manageable (under ~100 MB). Cassandra's strengths — linear write scalability, no master node, tunable consistency — make it well-suited to high-write, known-access-pattern workloads like messaging systems, IoT time series, and activity feeds.

DynamoDB's transparent sharding

DynamoDB shards automatically based on your provisioned or on-demand capacity settings — the shard topology is invisible to the application. You design a partition key (hash key) and an optional sort key (range key), and DynamoDB routes requests to the correct internal shard. The constraint is that a single partition key cannot exceed 10 GB of storage or 3,000 RCUs / 1,000 WCUs — violating this creates a hot partition that throttles requests.

takeaway

Shard only when you have exhausted vertical scaling, read replicas, and caching — the added operational complexity is real and permanent. When you do shard, invest heavily in shard key design: high cardinality, even write distribution, and alignment with your most common query pattern. Use consistent hashing with virtual nodes to minimize data movement during topology changes. Accept that cross-shard queries and transactions are expensive and design your data model to make them rare. And remember: each shard needs its own replication — sharding and replication are orthogonal, complementary tools.

🎯 interview hot-takes

Hash vs range sharding — when to pick each? Hash for even write distribution across many keys (user_id, device_id); range when you need efficient time-window or contiguous key scans and can tolerate hotspot risk (or you route writes through a hash prefix to distribute them).
Why is consistent hashing better than modulo sharding? Adding a node with modulo hashing remaps ~(N-1)/N of all keys (nearly everything); consistent hashing moves only ~k/n keys to the new node, minimizing data migration and allowing live resharding with minimal disruption.
What makes a bad shard key? Low cardinality (boolean, enum), monotonically increasing values (timestamps, auto-increment) with hash sharding, or keys misaligned with query patterns — all create hotspots or force expensive scatter-gather fan-outs.
How do you handle a cross-shard transaction? Prefer co-locating related transactional data on the same shard by design. When unavoidable, use the Saga pattern (eventual consistency with compensating actions) over 2PC (which creates blocking and coordinator bottlenecks).
What is the difference between sharding and replication? Sharding distributes different rows across nodes for write scalability; replication copies the same rows to multiple nodes for read scalability and HA. Production systems need both — each shard is itself replicated.

← previous
Caching