Database Indexing — B-Tree vs LSM-Tree

An index is the single biggest lever on database performance, and "why is this query slow?" almost always comes down to indexing. The idea is simple — keep an auxiliary structure that lets you find rows without scanning the whole table — but the engineering has depth: the two dominant structures (B-tree and LSM-tree) make opposite trade-offs between read and write speed, and how you compose indexes (clustered, secondary, composite, covering) decides whether a query flies or crawls. This is the practical companion to our DDIA notes on storage & retrieval.

⚡ Quick Takeaways

An index trades write speed and space for read speed — it turns an O(n) table scan into an O(log n) lookup, but every write must also update the index.
B-trees are read-optimized, sorted, balanced, updated in place — the default for relational databases (PostgreSQL, MySQL/InnoDB).
LSM-trees are write-optimized — buffer writes in memory, flush to immutable sorted files, compact in the background (Cassandra, RocksDB).
Clustered vs secondary — a clustered index is the table sorted by key; a secondary index is a separate structure pointing back to rows.
Composite indexes obey the leftmost-prefix rule — an index on (a, b, c) helps queries on a, a+b, a+b+c — not on b alone.
A covering index answers a query from the index alone, skipping the table lookup entirely.
Don't over-index — each index slows writes and costs space; useless indexes are pure overhead.

tldr

Indexes make reads fast by avoiding full scans, at the cost of slower writes and more storage. B-trees (sorted, in-place, read-optimized) power most relational databases; LSM-trees (in-memory buffer + immutable SSTables + compaction, write-optimized) power many NoSQL stores. Beyond the structure, design matters: clustered vs secondary, composite indexes (leftmost-prefix rule), and covering indexes (answer from the index alone). Index the columns your queries filter/sort on — and no more.

Why Index at All

Without an index, finding rows matching WHERE email = 'x' means a full table scan — read every row, O(n). At a thousand rows that's fine; at a hundred million it's catastrophic. An index is a sorted (or otherwise organized) auxiliary structure that lets the database jump to matching rows in O(log n). The cost is real and worth stating clearly: the index consumes storage, and every INSERT/UPDATE/DELETE must update every index on the table. So indexing is a deliberate trade — faster reads for slower writes and more space.

B-Tree Indexes

The B-tree (specifically B+tree) is the workhorse of relational databases. It's a balanced tree of fixed-size pages (typically 4–16 KB) where keys are kept sorted; internal pages route you down to leaf pages holding the actual keys/pointers. Because it's balanced, lookups, insertions, and range scans are all O(log n), and the sorted leaves make range queries and ordered reads efficient.

B+tree: balanced, sorted, range-friendly

                 [ 30 | 70 ]            ◀ root (routing keys)
                /     |      \
        [10|20]   [40|50|60]   [80|90]   ◀ internal
        /  |  \    ...           ...
   leaves: sorted keys → row pointers, linked for range scans

  lookup 50: root(→mid) → internal(→50) → leaf   = O(log n), few page reads

B-trees update in place: changing a value rewrites the page that holds it. This makes reads fast and predictable, but writes incur random I/O (find and modify the right page) and a write-ahead log is needed for crash safety. B-trees are what you get by default in PostgreSQL, MySQL/InnoDB, SQL Server, and Oracle.

LSM-Tree Indexes

The Log-Structured Merge-tree optimizes for writes. Instead of updating pages in place, it buffers writes in an in-memory sorted structure (the memtable); when that fills, it's flushed to disk as an immutable sorted file (an SSTable). Writes are therefore fast sequential appends — no random I/O. Over time many SSTables accumulate, so a background process compacts them — merging sorted files and discarding overwritten/deleted keys.

LSM-tree: buffer → flush → compact

writes → memtable (in-memory, sorted)
              │ full? flush (sequential write)
              ▼
         SSTable_1  SSTable_2  SSTable_3 ...   (immutable, sorted)
              └──────── compaction ────────┘ → merged SSTable
                                              (drops overwritten/deleted keys)

  read: check memtable, then newest→oldest SSTables
        (Bloom filters skip SSTables that can't contain the key)

The downside: a read may have to check the memtable and several SSTables, so LSM reads can be slower (mitigated by Bloom filters that quickly rule out SSTables lacking a key — the same trick used in our URL shortener). LSM-trees power Cassandra, RocksDB, LevelDB, and the storage engines behind many NoSQL systems.

B-Tree vs LSM-Tree

Aspect	B-Tree	LSM-Tree
Optimized for	Reads	Writes
Writes	In-place (random I/O)	Append + flush (sequential)
Reads	Fast, predictable	May check several SSTables
Amplification	Write amplification (page rewrites)	Read & compaction amplification
Used by	PostgreSQL, MySQL, Oracle	Cassandra, RocksDB, LevelDB

Clustered vs Secondary Indexes

A clustered index determines the physical order of the table's rows on disk — the table is the index, sorted by the clustering key (usually the primary key). There's only one per table, and looking up by that key returns the full row immediately. A secondary (non-clustered) index is a separate structure whose leaves point back to the row (by primary key or row location); using it requires a second hop to fetch the full row. The implication: looking up by the clustered key is the cheapest possible read, while a secondary-index lookup costs an extra step.

Composite Indexes and the Leftmost-Prefix Rule

An index can span multiple columns — a composite index on (country, city, name). The key insight is that it's sorted by those columns in order, so it only helps queries that use a leftmost prefix of the columns:

leftmost-prefix rule — index (country, city, name)

WHERE country=?                         ✓ uses index
WHERE country=? AND city=?              ✓ uses index
WHERE country=? AND city=? AND name=?   ✓ fully uses index
WHERE city=?                            ✗ skips the leftmost col → no use
WHERE name=?                            ✗ no use

  column ORDER matters: put the most selective / most-filtered column
  considerations first, and match your query patterns.

This is one of the most common real-world indexing mistakes: a composite index in the wrong column order, or a query that filters on a non-leading column, silently falls back to a scan.

Covering Indexes

If an index contains all the columns a query needs (both the filter columns and the selected columns), the database can answer the query from the index alone — no trip to the table. This is a covering index, and it's a big win for hot queries: SELECT name WHERE country=? served by an index on (country, name) never touches the table rows. The trade-off is a wider index (more storage, slower writes), so reserve covering indexes for your highest-traffic queries.

When NOT to Index

Indexes aren't free, and over-indexing is a real anti-pattern:

Write-heavy tables — every index multiplies write cost; a logging table with ten indexes will crawl.
Low-cardinality columns — indexing a boolean or a status with three values rarely helps; the database may scan anyway.
Small tables — a full scan of a few hundred rows is faster than index overhead.
Unused indexes — pure cost (space + write slowdown) with no benefit; audit and drop them.

reading the plan

The way to know whether an index is used is the query planner: run EXPLAIN (or EXPLAIN ANALYZE) and look for an index scan vs a sequential/full scan. A surprise full scan on a big table is the classic "missing or unusable index" signal — often a non-leftmost composite column, a function applied to the column, or a type mismatch defeating the index.

takeaway

Indexing is a trade: reads get O(log n) at the cost of write speed and space. Know your storage engine (B-tree = read-optimized in-place; LSM = write-optimized append+compact), and design indexes around your actual query patterns — clustered key for the cheapest reads, composite indexes obeying the leftmost-prefix rule, covering indexes for hot paths — while resisting the urge to index everything.

🎯 interview hot-takes

What does an index cost? Storage plus write slowdown — every write updates every index. It buys O(log n) reads instead of an O(n) scan.
B-tree vs LSM-tree? B-tree = sorted, in-place, read-optimized (relational DBs); LSM = memtable + immutable SSTables + compaction, write-optimized (Cassandra, RocksDB).
Clustered vs secondary index? Clustered defines the table's physical order (the table is the index); secondary is a separate structure that points back to the row (extra hop).
Leftmost-prefix rule? A composite index on (a,b,c) serves queries on a, a+b, a+b+c — not b or c alone. Column order must match query patterns.
What's a covering index? One that contains every column a query needs, so it's answered from the index without reading the table.