Design Nearby / Yelp (Proximity Service)

A proximity service answers one deceptively hard question: "what are the places near me?" Given a user's latitude/longitude and a radius, return nearby businesses ranked by distance and relevance — fast, at scale, for hundreds of millions of points of interest. This is the core of Yelp, Google Maps' "nearby," and the matching layer behind ride-sharing. The whole problem reduces to one thing a normal database can't do efficiently: geospatial indexing — turning two-dimensional coordinates into something you can index and query by region.

⚡ Quick Takeaways

A plain B-tree index can't do 2D range search efficiently — querying "within radius" by lat AND lng scans far too much. You need a spatial index.
Geohash interleaves lat/lng bits into a short string so that nearby points share a common prefix — turning proximity into a prefix lookup on a normal index.
Quadtree recursively splits space into 4 cells until each holds ≤ N points, adapting cell size to density (small cells downtown, big cells in the countryside).
The boundary problem — a place just across a cell border is "near" but in a different cell, so you must also query the 8 neighboring cells.
Read-heavy and nearly static — businesses rarely move, so cache aggressively; rebuild/precompute indexes offline.
Two-step query — use the index to fetch candidate places in nearby cells, then filter by exact distance and rank.

tldr

The crux is geospatial indexing. Geohash encodes a coordinate as a prefix-shareable string so "nearby" becomes a prefix query on an ordinary index; quadtrees subdivide space adaptively by density. Either way, a search fetches candidates from the user's cell plus its 8 neighbors (handling the boundary problem), then filters by exact distance and ranks. Because the data is read-heavy and almost static, cache hot regions hard and serve from replicas sharded by region.

geohash grid + 8 neighbors (boundary problem)

  ┌──────┬──────┬──────┐   query the user's cell (★) AND its
  │ 9q8yt│ 9q8yw│ 9q8yx│   8 neighbors, because a place just
  ├──────┼──────┼──────┤   across a border is "near" but lands
  │ 9q8ys│ 9q8y★│ 9q8yr│   in a different cell.
  ├──────┼──────┼──────┤
  │ 9q8ye│ 9q8yg│ 9q8yu│   nearby points share a geohash prefix:
  └──────┴──────┴──────┘   "9q8y…" → cheap prefix range scan

Step 1 — Clarify Requirements

Functional: given a location and radius (or a "show me restaurants near here" with a default radius), return nearby places; support adding/updating a business; optionally filter by category. Non-functional: low latency, very read-heavy, high availability, scale to ~hundreds of millions of places and high search QPS. A key simplifying observation: the data is largely static — businesses don't move and are added/edited rarely — so we can precompute and cache aggressively, and eventual consistency of updates is fine. (Contrast with ride-share, where the "places" are moving drivers updating every few seconds — that needs a live in-memory index instead.)

Step 2 — Capacity Estimation

Say 200M places and 100M daily searches (~1–2K searches/sec average, much higher at peak, e.g. lunchtime in dense cities). Each place record is small (id, name, lat/lng, category, rating) — a few hundred bytes — so the dataset is tens of GB, easily cached. The workload is overwhelmingly reads against rarely-changing data, which steers the entire design toward indexing + caching rather than write throughput.

Step 3 — API Design

core API

GET /search?lat=37.78&lng=-122.41&radius=2km&category=cafe
      → [{id, name, distance, rating}, ...]   # sorted
POST /places   {name, lat, lng, category, ...}   # add/update a business

Step 4 — Why a Normal Index Fails

The instinct is to store lat and lng columns and query WHERE lat BETWEEN ... AND lng BETWEEN .... The problem: a B-tree index is one-dimensional. An index on lat finds the right latitude band but includes every place at that latitude across the whole planet; you then filter by longitude in memory (or vice versa). For a global dataset that's a massive scan per query. We need an index that preserves 2D locality — points close on the map should be close in the index. Two standard structures do this.

Step 5 — Geohash

Geohash recursively divides the world into a grid and interleaves the bits of latitude and longitude, encoding the result as a short base-32 string. Two properties make it powerful: a longer geohash = a smaller, more precise cell (each extra character ~ refines the box), and nearby locations share a common prefix (e.g. two cafes on the same block might both start with 9q8yyk). So you store a geohash column, index it normally, and "find places near X" becomes "find places whose geohash starts with X's prefix" — a cheap prefix/range scan on an ordinary index. You pick the prefix length to match the search radius (longer prefix for a tight radius).

Step 6 — Quadtree

A quadtree takes an adaptive approach: start with the whole map as one node; if a cell contains more than a threshold of points, split it into four equal quadrants; recurse. The result is that dense areas (downtown) have many tiny cells while sparse areas (rural) have a few large ones — cell granularity follows data density automatically, so each leaf holds a bounded number of points. A search descends to the leaf containing the query point and collects nearby leaves. Quadtrees are typically held in memory and rebuilt periodically; they give very even result-set sizes regardless of density.

Aspect	Geohash	Quadtree
Cell size	Fixed grid per precision level	Adapts to point density
Storage	A string column — fits any DB index	In-memory tree structure
Dense areas	Cells can hold very many points	Auto-subdivides → bounded per cell
Simplicity	Very simple; prefix queries	More complex to build/maintain
Updates	Just recompute one string	May trigger node split/merge

Geohash is the simpler, more common interview answer (it rides on any existing database index); quadtrees shine when density varies wildly and you want even result sizes. Redis's geospatial commands and PostGIS use related techniques under the hood.

Step 7 — Data Model and Storage

schema (geohash approach)

places (
   place_id PK, name, category, rating,
   lat, lng,
   geohash        # indexed; e.g. "9q8yyk8ytpxr"
)
INDEX on geohash   # prefix range scans = nearby lookup

You can store several geohash precisions (or just query with a variable prefix length) to match different radii. The places table is the source of truth; the geohash column is the spatial index riding on a standard B-tree.

Step 8 — The Read Path and the Boundary Problem

A search runs in two phases. Phase 1 (index): compute the geohash of the query point at the precision matching the radius, then fetch candidate places whose geohash shares that prefix. Phase 2 (refine): compute the exact great-circle distance from the query to each candidate, drop those outside the radius, and sort. The catch is the boundary problem: a place 50 meters away might sit just across a cell border and thus have a different geohash prefix, so it'd be missed. The fix is to also query the 8 neighboring cells around the query cell, union the candidates, then refine. (Quadtrees have the analogous issue and similarly check adjacent leaves.)

Step 9 — Caching and Read Scaling

Because the data barely changes and reads dominate, caching is the main scaling lever. Cache results (or candidate sets) for popular cells — dense, frequently-searched areas like city centers — in Redis with a generous TTL, since a new restaurant appearing is not time-critical. Front read traffic with replicas, and let the mostly-static dataset live largely in memory. Most searches in busy areas should be served from cache without touching the primary store.

Step 10 — Sharding and Scaling Writes

Reads scale with replicas and cache; for storage and write distribution, shard by region (e.g. by geohash prefix, so a shard owns a contiguous area). This keeps a search's candidate cells on one shard most of the time. Watch for hotspots: a geohash-prefix shard covering Manhattan handles vastly more places and queries than one covering open ocean, so balance shards by load, not just by area (finer splits where it's dense). Writes (new/edited businesses) are rare and can update the index asynchronously.

Step 11 — Ranking

"Nearby" isn't only about distance. Final ordering typically blends distance, rating/popularity, category relevance, and sometimes sponsored placement. Because the candidate set after the spatial filter is small (places within the radius), this ranking is cheap to compute per request, or partially precomputed (e.g. a place's base popularity score) and combined with live distance.

Step 12 — Key Tradeoffs

Geohash vs quadtree. Geohash is dead-simple and rides on any DB index but has fixed cells that overflow in dense areas; quadtrees adapt to density and bound result sizes but are an in-memory structure that's more work to maintain.
Precision vs result count. A longer geohash prefix (smaller cell) returns fewer, tighter candidates but may need more neighbor cells for a given radius; shorter prefixes return more candidates to filter.
Static data → cache hard. The read-heavy, rarely-changing nature makes aggressive caching and replication the dominant scaling strategy, with eventual consistency on updates being acceptable.
Static places vs moving objects. This design assumes near-static points; tracking moving objects (drivers) needs a frequently-updated in-memory index instead — a different problem.

takeaway

Proximity search is fundamentally a geospatial-indexing problem: map 2D coordinates onto something a database can index by region. Geohash (prefix-shareable strings) and quadtrees (density-adaptive cells) are the two standard answers; both require checking neighboring cells to handle boundaries, then refining by exact distance. Since the data is read-heavy and almost static, the rest is caching and region-sharding.

🎯 interview hot-takes

Why can't you just index lat and lng? A B-tree is 1D; a 2D range query degenerates to scanning a whole latitude (or longitude) band. You need a spatial index that preserves 2D locality.
How does geohash work? It interleaves lat/lng bits into a string so nearby points share a prefix and a longer string means a smaller cell — turning proximity into a prefix range scan.
Geohash vs quadtree? Geohash = fixed grid, trivially stored as an indexed string; quadtree = density-adaptive cells held in memory, bounding points per cell.
What's the boundary problem? A nearby place across a cell border has a different cell, so you must query the 8 neighboring cells and then filter by exact distance.
Why is caching so effective here? Businesses rarely move, so the data is nearly static and read-heavy — popular regions can be cached with long TTLs.