An index is the single biggest lever on database performance, and "why is this query slow?" almost always comes down to indexing. The idea is simple — keep an auxiliary structure that lets you find rows without scanning the whole table — but the engineering has depth: the two dominant structures (B-tree and LSM-tree) make opposite trade-offs between read and write speed, and how you compose indexes (clustered, secondary, composite, covering) decides whether a query flies or crawls. This is the practical companion to our DDIA notes on storage & retrieval.
- An index trades write speed and space for read speed — it turns an O(n) table scan into an O(log n) lookup, but every write must also update the index.
- B-trees are read-optimized, sorted, balanced, updated in place — the default for relational databases (PostgreSQL, MySQL/InnoDB).
- LSM-trees are write-optimized — buffer writes in memory, flush to immutable sorted files, compact in the background (Cassandra, RocksDB).
- Clustered vs secondary — a clustered index is the table sorted by key; a secondary index is a separate structure pointing back to rows.
- Composite indexes obey the leftmost-prefix rule — an index on (a, b, c) helps queries on a, a+b, a+b+c — not on b alone.
- A covering index answers a query from the index alone, skipping the table lookup entirely.
- Don't over-index — each index slows writes and costs space; useless indexes are pure overhead.
Indexes make reads fast by avoiding full scans, at the cost of slower writes and more storage. B-trees (sorted, in-place, read-optimized) power most relational databases; LSM-trees (in-memory buffer + immutable SSTables + compaction, write-optimized) power many NoSQL stores. Beyond the structure, design matters: clustered vs secondary, composite indexes (leftmost-prefix rule), and covering indexes (answer from the index alone). Index the columns your queries filter/sort on — and no more.
Why Index at All
Without an index, finding rows matching WHERE email = 'x' means a full table scan — read every row, O(n). At a thousand rows that's fine; at a hundred million it's catastrophic. An index is a sorted (or otherwise organized) auxiliary structure that lets the database jump to matching rows in O(log n). The cost is real and worth stating clearly: the index consumes storage, and every INSERT/UPDATE/DELETE must update every index on the table. So indexing is a deliberate trade — faster reads for slower writes and more space.
B-Tree Indexes
The B-tree (specifically B+tree) is the workhorse of relational databases. It's a balanced tree of fixed-size pages (typically 4–16 KB) where keys are kept sorted; internal pages route you down to leaf pages holding the actual keys/pointers. Because it's balanced, lookups, insertions, and range scans are all O(log n), and the sorted leaves make range queries and ordered reads efficient.
[ 30 | 70 ] ◀ root (routing keys)
/ | \
[10|20] [40|50|60] [80|90] ◀ internal
/ | \ ... ...
leaves: sorted keys → row pointers, linked for range scans
lookup 50: root(→mid) → internal(→50) → leaf = O(log n), few page reads
B-trees update in place: changing a value rewrites the page that holds it. This makes reads fast and predictable, but writes incur random I/O (find and modify the right page) and a write-ahead log is needed for crash safety. B-trees are what you get by default in PostgreSQL, MySQL/InnoDB, SQL Server, and Oracle.
LSM-Tree Indexes
The Log-Structured Merge-tree optimizes for writes. Instead of updating pages in place, it buffers writes in an in-memory sorted structure (the memtable); when that fills, it's flushed to disk as an immutable sorted file (an SSTable). Writes are therefore fast sequential appends — no random I/O. Over time many SSTables accumulate, so a background process compacts them — merging sorted files and discarding overwritten/deleted keys.
writes → memtable (in-memory, sorted)
│ full? flush (sequential write)
▼
SSTable_1 SSTable_2 SSTable_3 ... (immutable, sorted)
└──────── compaction ────────┘ → merged SSTable
(drops overwritten/deleted keys)
read: check memtable, then newest→oldest SSTables
(Bloom filters skip SSTables that can't contain the key)
The downside: a read may have to check the memtable and several SSTables, so LSM reads can be slower (mitigated by Bloom filters that quickly rule out SSTables lacking a key — the same trick used in our URL shortener). LSM-trees power Cassandra, RocksDB, LevelDB, and the storage engines behind many NoSQL systems.
B-Tree vs LSM-Tree
| Aspect | B-Tree | LSM-Tree |
|---|---|---|
| Optimized for | Reads | Writes |
| Writes | In-place (random I/O) | Append + flush (sequential) |
| Reads | Fast, predictable | May check several SSTables |
| Amplification | Write amplification (page rewrites) | Read & compaction amplification |
| Used by | PostgreSQL, MySQL, Oracle | Cassandra, RocksDB, LevelDB |
Clustered vs Secondary Indexes
A clustered index determines the physical order of the table's rows on disk — the table is the index, sorted by the clustering key (usually the primary key). There's only one per table, and looking up by that key returns the full row immediately. A secondary (non-clustered) index is a separate structure whose leaves point back to the row (by primary key or row location); using it requires a second hop to fetch the full row. The implication: looking up by the clustered key is the cheapest possible read, while a secondary-index lookup costs an extra step.
Composite Indexes and the Leftmost-Prefix Rule
An index can span multiple columns — a composite index on (country, city, name). The key insight is that it's sorted by those columns in order, so it only helps queries that use a leftmost prefix of the columns:
WHERE country=? ✓ uses index
WHERE country=? AND city=? ✓ uses index
WHERE country=? AND city=? AND name=? ✓ fully uses index
WHERE city=? ✗ skips the leftmost col → no use
WHERE name=? ✗ no use
column ORDER matters: put the most selective / most-filtered column
considerations first, and match your query patterns.
This is one of the most common real-world indexing mistakes: a composite index in the wrong column order, or a query that filters on a non-leading column, silently falls back to a scan.
Covering Indexes
If an index contains all the columns a query needs (both the filter columns and the selected columns), the database can answer the query from the index alone — no trip to the table. This is a covering index, and it's a big win for hot queries: SELECT name WHERE country=? served by an index on (country, name) never touches the table rows. The trade-off is a wider index (more storage, slower writes), so reserve covering indexes for your highest-traffic queries.
When NOT to Index
Indexes aren't free, and over-indexing is a real anti-pattern:
- Write-heavy tables — every index multiplies write cost; a logging table with ten indexes will crawl.
- Low-cardinality columns — indexing a boolean or a status with three values rarely helps; the database may scan anyway.
- Small tables — a full scan of a few hundred rows is faster than index overhead.
- Unused indexes — pure cost (space + write slowdown) with no benefit; audit and drop them.
The way to know whether an index is used is the query planner: run EXPLAIN (or EXPLAIN ANALYZE) and look for an index scan vs a sequential/full scan. A surprise full scan on a big table is the classic "missing or unusable index" signal — often a non-leftmost composite column, a function applied to the column, or a type mismatch defeating the index.
Indexing is a trade: reads get O(log n) at the cost of write speed and space. Know your storage engine (B-tree = read-optimized in-place; LSM = write-optimized append+compact), and design indexes around your actual query patterns — clustered key for the cheapest reads, composite indexes obeying the leftmost-prefix rule, covering indexes for hot paths — while resisting the urge to index everything.
What does an index cost? Storage plus write slowdown — every write updates every index. It buys O(log n) reads instead of an O(n) scan.
B-tree vs LSM-tree? B-tree = sorted, in-place, read-optimized (relational DBs); LSM = memtable + immutable SSTables + compaction, write-optimized (Cassandra, RocksDB).
Clustered vs secondary index? Clustered defines the table's physical order (the table is the index); secondary is a separate structure that points back to the row (extra hop).
Leftmost-prefix rule? A composite index on (a,b,c) serves queries on a, a+b, a+b+c — not b or c alone. Column order must match query patterns.
What's a covering index? One that contains every column a query needs, so it's answered from the index without reading the table.