Autocomplete (typeahead) is the dropdown of suggestions that appears as you type into a search box. It feels trivial, but it's deceptively demanding: a suggestion must come back in tens of milliseconds because the request fires on every keystroke, the suggestions must be the most relevant for a given prefix out of billions of possible queries, and the whole thing runs at enormous QPS. The design hinges on a single idea — precompute the top-k suggestions for every prefix so that serving is a near-instant lookup rather than a search.

⚡ Quick Takeaways
  • Use a trie (prefix tree) — walking a prefix is O(length of prefix), independent of how many phrases exist.
  • Precompute top-k at each node — store the best k completions at every trie node so a query is just "walk to the node, return its cached list." No per-request ranking.
  • Rank by popularity — suggestions are ordered by historical query frequency (often time-decayed so trending terms surface).
  • Reads and writes are decoupled — the trie is built offline from query logs; the serving path is read-only against an immutable snapshot.
  • Latency is the headline — target <50 ms; client-side debouncing, caching of hot prefixes, and edge/CDN delivery all help.
  • Shard the trie by prefix and replicate; keep it in memory for speed.
tldr

Store known phrases in a trie and precompute the top-k completions at every node, so serving a prefix is an O(prefix-length) walk plus returning a cached list — no ranking at request time. Build/update the trie offline from aggregated query logs (a batch or streaming pipeline), and serve a read-only in-memory snapshot, sharded by prefix and replicated. Debounce on the client, cache hot prefixes, and deliver from the edge to hit the sub-50 ms latency target.

trie with precomputed top-k
            (root)
              │ c
              ▼
             [c]  top: cat(90), car(70), care(40)
              │ a
              ▼
             [ca] top: cat(90), car(70), care(40)
            ╱      ╲
         t ▼        ▼ r
       [cat]       [car] top: car(70), care(40), cargo(20)
       top:cat        │ e
                      ▼
                    [care] top: care(40), career(25)

  query "car" → jump to node [car] → return its cached top-k. done.

Step 1 — Clarify Requirements

Functional: given a prefix, return the top k (say 5–10) most relevant complete suggestions; suggestions reflect what's popular and evolve over time. Non-functional: extremely low latency (<50 ms, since it fires per keystroke), very high read QPS, high availability, and scalability to a huge phrase corpus. Scope decisions to state: prefix match only (not fuzzy/typo correction, which is a bigger NLP problem — though worth mentioning), suggestions ranked by global popularity (personalization optional), and a small fixed k. We can tolerate suggestions being a little stale (rebuilt periodically) — autocomplete doesn't need to reflect the last second of activity.

Step 2 — Capacity Estimation

Suppose 5 billion searches/day. Because each search is several keystrokes and each keystroke can trigger a suggest call, raw autocomplete QPS could be ~10× the search QPS — easily hundreds of thousands of QPS, with peaks higher. That read volume, multiplied by a sub-50 ms latency target, is what forces precomputation and in-memory serving. The phrase corpus (distinct queries worth suggesting) might be tens to hundreds of millions of strings; a trie over them, with top-k lists at nodes, is large but shardable across memory on a fleet of servers.

Step 3 — API Design

core API
GET /suggest?prefix=car&limit=5
      → {suggestions: ["car", "care", "career", "cargo", "carpet"]}

# reads only. updates happen out-of-band via the data pipeline,
# not through this endpoint.

Step 4 — The Trie

A trie (prefix tree) stores strings character by character: each path from the root spells a prefix, and shared prefixes share nodes. Looking up a prefix is just walking down the corresponding characters — O(length of the prefix), completely independent of how many phrases the trie contains. That property is why it's the canonical autocomplete structure. The naive version stores complete words at leaves and, to answer a prefix query, traverses the entire subtree under the prefix node to gather and rank candidates — which is too slow at request time. The fix is Step 5.

Step 5 — Precompute Top-k at Each Node

Instead of searching the subtree on every request, cache the answer at the node: each trie node stores the top-k completions of its prefix, already ranked. Now a query is "walk to the prefix node, return its stored list" — essentially O(prefix length) with no ranking or subtree traversal at request time. The cost is moved to build time (computing those lists) and storage (k entries per node), which is exactly the right trade for a read-dominated system: do the expensive work once, offline, and make every one of the billions of reads cheap. This is the classic move of pushing work from the read path to the write/build path.

Step 6 — Ranking

"Top" suggestions are ranked primarily by popularity — how often each complete query has been searched historically. Refinements:

Step 7 — Building and Updating the Trie (the Data Pipeline)

The trie isn't built live; it's produced by an offline pipeline from query logs:

offline build pipeline
search logs ─▶ aggregate counts per query (MapReduce / stream)
            ─▶ filter (offensive, rare) + apply time-decay weights
            ─▶ build trie, compute top-k at every node
            ─▶ publish immutable snapshot ─▶ load into serving fleet

Aggregation is a textbook batch (or streaming) job. The output is an immutable snapshot; serving nodes load a new snapshot periodically (e.g. hourly/daily) and atomically swap to it. This read/write decoupling is what keeps the serving path simple, fast, and lock-free.

Step 8 — Serving and Sharding

The serving tier holds the trie in memory for speed and is horizontally scaled. Since the whole trie may not fit on one machine, shard by prefix: e.g. all prefixes starting with "a"–"m" on one shard group, "n"–"z" on another (or a finer split to balance load — common prefixes like "th" get more traffic). A lightweight aggregator/router sends each request to the shard owning that prefix. Each shard is replicated for availability and to spread read load. Because the data is read-only between snapshot swaps, replicas are trivially consistent.

Step 9 — Caching and Client-Side Tricks

Several layers cut latency and load:

Step 10 — Real-time vs Batch Updates

How fresh should suggestions be? A pure batch rebuild (hourly/daily) is simple and sufficient for most terms, but misses trending spikes (breaking news). The trade-off:

Update modelFreshnessCost / complexity
Batch rebuildHours — misses trendsSimple; the default for the bulk corpus
Streaming top-kSeconds–minutesCatches trending terms; more pipeline complexity

A practical hybrid: a large, slowly-rebuilt base trie plus a small real-time layer that tracks trending terms via a streaming counter and merges them into results. Mention this when the interviewer pushes on "what about breaking news?"

Step 11 — Key Tradeoffs

takeaway

Autocomplete is the cleanest example of trading build-time work for read-time speed: a trie plus precomputed top-k turns a "search and rank billions of phrases" problem into an O(prefix-length) lookup of a cached list. Keep the serving path read-only against an immutable, sharded, in-memory snapshot, build it offline from query logs, and lean on debouncing + edge caching to hit the latency budget.

🎯 interview hot-takes

Why a trie? Prefix lookup is O(length of prefix), independent of corpus size, and shared prefixes share nodes — ideal for "complete this prefix."
Why precompute top-k at each node? It moves ranking off the hot path: a query becomes "walk to the node, return its cached list," meeting the sub-50 ms budget at huge QPS.
How are suggestions ranked and kept fresh? By time-decayed query popularity, rebuilt offline from logs; a small streaming layer can capture trending terms.
How do you scale serving? Keep the trie in memory, shard by prefix, replicate each shard, and swap in immutable snapshots — reads stay lock-free.
What cuts client load? Debounce keystrokes (~200 ms) and cache already-typed prefixes; edge-cache hot short prefixes.

← previous
Design Pastebin