A proximity service answers one deceptively hard question: "what are the places near me?" Given a user's latitude/longitude and a radius, return nearby businesses ranked by distance and relevance — fast, at scale, for hundreds of millions of points of interest. This is the core of Yelp, Google Maps' "nearby," and the matching layer behind ride-sharing. The whole problem reduces to one thing a normal database can't do efficiently: geospatial indexing — turning two-dimensional coordinates into something you can index and query by region.

⚡ Quick Takeaways
  • A plain B-tree index can't do 2D range search efficiently — querying "within radius" by lat AND lng scans far too much. You need a spatial index.
  • Geohash interleaves lat/lng bits into a short string so that nearby points share a common prefix — turning proximity into a prefix lookup on a normal index.
  • Quadtree recursively splits space into 4 cells until each holds ≤ N points, adapting cell size to density (small cells downtown, big cells in the countryside).
  • The boundary problem — a place just across a cell border is "near" but in a different cell, so you must also query the 8 neighboring cells.
  • Read-heavy and nearly static — businesses rarely move, so cache aggressively; rebuild/precompute indexes offline.
  • Two-step query — use the index to fetch candidate places in nearby cells, then filter by exact distance and rank.
tldr

The crux is geospatial indexing. Geohash encodes a coordinate as a prefix-shareable string so "nearby" becomes a prefix query on an ordinary index; quadtrees subdivide space adaptively by density. Either way, a search fetches candidates from the user's cell plus its 8 neighbors (handling the boundary problem), then filters by exact distance and ranks. Because the data is read-heavy and almost static, cache hot regions hard and serve from replicas sharded by region.

geohash grid + 8 neighbors (boundary problem)
  ┌──────┬──────┬──────┐   query the user's cell (★) AND its
  │ 9q8yt│ 9q8yw│ 9q8yx│   8 neighbors, because a place just
  ├──────┼──────┼──────┤   across a border is "near" but lands
  │ 9q8ys│ 9q8y★│ 9q8yr│   in a different cell.
  ├──────┼──────┼──────┤
  │ 9q8ye│ 9q8yg│ 9q8yu│   nearby points share a geohash prefix:
  └──────┴──────┴──────┘   "9q8y…" → cheap prefix range scan

Step 1 — Clarify Requirements

Functional: given a location and radius (or a "show me restaurants near here" with a default radius), return nearby places; support adding/updating a business; optionally filter by category. Non-functional: low latency, very read-heavy, high availability, scale to ~hundreds of millions of places and high search QPS. A key simplifying observation: the data is largely static — businesses don't move and are added/edited rarely — so we can precompute and cache aggressively, and eventual consistency of updates is fine. (Contrast with ride-share, where the "places" are moving drivers updating every few seconds — that needs a live in-memory index instead.)

Step 2 — Capacity Estimation

Say 200M places and 100M daily searches (~1–2K searches/sec average, much higher at peak, e.g. lunchtime in dense cities). Each place record is small (id, name, lat/lng, category, rating) — a few hundred bytes — so the dataset is tens of GB, easily cached. The workload is overwhelmingly reads against rarely-changing data, which steers the entire design toward indexing + caching rather than write throughput.

Step 3 — API Design

core API
GET /search?lat=37.78&lng=-122.41&radius=2km&category=cafe
      → [{id, name, distance, rating}, ...]   # sorted
POST /places   {name, lat, lng, category, ...}   # add/update a business

Step 4 — Why a Normal Index Fails

The instinct is to store lat and lng columns and query WHERE lat BETWEEN ... AND lng BETWEEN .... The problem: a B-tree index is one-dimensional. An index on lat finds the right latitude band but includes every place at that latitude across the whole planet; you then filter by longitude in memory (or vice versa). For a global dataset that's a massive scan per query. We need an index that preserves 2D locality — points close on the map should be close in the index. Two standard structures do this.

Step 5 — Geohash

Geohash recursively divides the world into a grid and interleaves the bits of latitude and longitude, encoding the result as a short base-32 string. Two properties make it powerful: a longer geohash = a smaller, more precise cell (each extra character ~ refines the box), and nearby locations share a common prefix (e.g. two cafes on the same block might both start with 9q8yyk). So you store a geohash column, index it normally, and "find places near X" becomes "find places whose geohash starts with X's prefix" — a cheap prefix/range scan on an ordinary index. You pick the prefix length to match the search radius (longer prefix for a tight radius).

Step 6 — Quadtree

A quadtree takes an adaptive approach: start with the whole map as one node; if a cell contains more than a threshold of points, split it into four equal quadrants; recurse. The result is that dense areas (downtown) have many tiny cells while sparse areas (rural) have a few large ones — cell granularity follows data density automatically, so each leaf holds a bounded number of points. A search descends to the leaf containing the query point and collects nearby leaves. Quadtrees are typically held in memory and rebuilt periodically; they give very even result-set sizes regardless of density.

AspectGeohashQuadtree
Cell sizeFixed grid per precision levelAdapts to point density
StorageA string column — fits any DB indexIn-memory tree structure
Dense areasCells can hold very many pointsAuto-subdivides → bounded per cell
SimplicityVery simple; prefix queriesMore complex to build/maintain
UpdatesJust recompute one stringMay trigger node split/merge

Geohash is the simpler, more common interview answer (it rides on any existing database index); quadtrees shine when density varies wildly and you want even result sizes. Redis's geospatial commands and PostGIS use related techniques under the hood.

Step 7 — Data Model and Storage

schema (geohash approach)
places (
   place_id PK, name, category, rating,
   lat, lng,
   geohash        # indexed; e.g. "9q8yyk8ytpxr"
)
INDEX on geohash   # prefix range scans = nearby lookup

You can store several geohash precisions (or just query with a variable prefix length) to match different radii. The places table is the source of truth; the geohash column is the spatial index riding on a standard B-tree.

Step 8 — The Read Path and the Boundary Problem

A search runs in two phases. Phase 1 (index): compute the geohash of the query point at the precision matching the radius, then fetch candidate places whose geohash shares that prefix. Phase 2 (refine): compute the exact great-circle distance from the query to each candidate, drop those outside the radius, and sort. The catch is the boundary problem: a place 50 meters away might sit just across a cell border and thus have a different geohash prefix, so it'd be missed. The fix is to also query the 8 neighboring cells around the query cell, union the candidates, then refine. (Quadtrees have the analogous issue and similarly check adjacent leaves.)

Step 9 — Caching and Read Scaling

Because the data barely changes and reads dominate, caching is the main scaling lever. Cache results (or candidate sets) for popular cells — dense, frequently-searched areas like city centers — in Redis with a generous TTL, since a new restaurant appearing is not time-critical. Front read traffic with replicas, and let the mostly-static dataset live largely in memory. Most searches in busy areas should be served from cache without touching the primary store.

Step 10 — Sharding and Scaling Writes

Reads scale with replicas and cache; for storage and write distribution, shard by region (e.g. by geohash prefix, so a shard owns a contiguous area). This keeps a search's candidate cells on one shard most of the time. Watch for hotspots: a geohash-prefix shard covering Manhattan handles vastly more places and queries than one covering open ocean, so balance shards by load, not just by area (finer splits where it's dense). Writes (new/edited businesses) are rare and can update the index asynchronously.

Step 11 — Ranking

"Nearby" isn't only about distance. Final ordering typically blends distance, rating/popularity, category relevance, and sometimes sponsored placement. Because the candidate set after the spatial filter is small (places within the radius), this ranking is cheap to compute per request, or partially precomputed (e.g. a place's base popularity score) and combined with live distance.

Step 12 — Key Tradeoffs

takeaway

Proximity search is fundamentally a geospatial-indexing problem: map 2D coordinates onto something a database can index by region. Geohash (prefix-shareable strings) and quadtrees (density-adaptive cells) are the two standard answers; both require checking neighboring cells to handle boundaries, then refining by exact distance. Since the data is read-heavy and almost static, the rest is caching and region-sharding.

🎯 interview hot-takes

Why can't you just index lat and lng? A B-tree is 1D; a 2D range query degenerates to scanning a whole latitude (or longitude) band. You need a spatial index that preserves 2D locality.
How does geohash work? It interleaves lat/lng bits into a string so nearby points share a prefix and a longer string means a smaller cell — turning proximity into a prefix range scan.
Geohash vs quadtree? Geohash = fixed grid, trivially stored as an indexed string; quadtree = density-adaptive cells held in memory, bounding points per cell.
What's the boundary problem? A nearby place across a cell border has a different cell, so you must query the 8 neighboring cells and then filter by exact distance.
Why is caching so effective here? Businesses rarely move, so the data is nearly static and read-heavy — popular regions can be cached with long TTLs.

← previous
Design Search Autocomplete (Typeahead)