Design Google Drive / Dropbox — System Design

Q: How do you handle two offline edits?

Optimistic compare-and-set on base_version; the stale committer's file is saved as a "conflicted copy" — never silently overwrite.

A file storage and sync service looks trivial from the outside: drop a file in a folder and it appears on every device. Underneath, it's one of the richer system design problems, because the requirements pull against each other. You must store petabytes durably (losing a user's file is unforgivable), sync changes across many devices within seconds, avoid re-uploading a 2 GB video when one byte changes, deduplicate identical content across millions of users without leaking their data to each other, resolve conflicts when two offline devices edit the same file, and serve downloads fast worldwide. The key architectural moves are chunking files into blocks, content-addressed deduplication, and splitting the metadata plane from the block-storage plane — and almost every interesting decision flows from those three.

⚡ Quick Takeaways

Split files into fixed-size chunks (e.g. 4 MB) — enables resumable uploads, parallel transfer, partial sync (only changed chunks move), and deduplication.
Content-addressed storage — a chunk's ID is the hash of its bytes, so identical chunks are stored once (block-level dedup), saving huge amounts of storage and bandwidth.
Separate metadata from blocks — a metadata service (file tree, versions, chunk lists) is small, transactional, and consistency-sensitive; a block service (S3-style blob store) is huge and append-only.
Sync = diff + delta upload + notify — the client hashes local chunks, uploads only new ones, commits a new version to metadata, and a notification service tells other devices to pull.
Never silently lose an edit — concurrent offline edits are resolved by versioning and "conflicted copy" files, not last-writer-wins.
Durability is the top non-functional — replicate blocks across AZs/regions (and erasure-code cold data) for ~11 nines, like S3.
Metadata is the scaling bottleneck — shard the file tree and sharing graph by user/namespace; the blob store scales almost linearly.

tldr

Chunk files into ~4 MB blocks and address each block by the hash of its content, so identical blocks are stored once. Keep two planes: a metadata service (file tree, versions, chunk lists — needs transactions and strong-ish consistency) and a block service backed by S3 (cheap, durable, append-only). Sync by hashing local chunks, uploading only the new ones, and committing a new version; a notification service pushes "pull now" to other devices. Resolve concurrent edits with versioning and conflicted copies. Replicate blocks for durability and serve downloads via CDN.

high-level architecture

                         ┌──────────────────┐
   ┌───────────┐  meta   │ Metadata Service │   ┌───────────────┐
   │  Client   │◀───────▶│ (tree, versions, │──▶│  Metadata DB  │
   │  watcher  │         │  chunk lists)    │   │ (sharded SQL) │
   │  + sync   │         └──────────────────┘   └───────────────┘
   │  engine   │         ┌──────────────────┐   ┌───────────────┐
   └────┬──────┘ chunks  │  Block Service   │──▶│  Blob store   │──▶ CDN
        │───────────────▶│ (upload/fetch)   │   │ (S3, chunks)  │
        │ subscribe      └──────────────────┘   └───────────────┘
        │                ┌──────────────────┐
        └───────────────▶│   Notification   │  long-poll / websocket:
                         │     Service      │  "your namespace changed"
                         └──────────────────┘

Step 1 — Clarify Requirements

Pin down scope before drawing boxes. Functional requirements: upload and download files; automatically sync a local folder across a user's devices; share files/folders with other users; keep version history; and work offline, syncing on reconnect. Non-functional: durability is paramount (never lose data — target ~11 nines), high availability, scale to hundreds of millions of users and exabytes of data, efficient use of bandwidth (don't re-send unchanged data), and low sync latency (changes propagate in seconds). It's worth explicitly scoping out real-time collaborative editing inside a document (that's a Google Docs / operational-transform problem); here a file is an opaque blob that one writer changes at a time.

Step 2 — Capacity Estimation

Ballpark numbers justify the architecture. Assume 500M registered users, 100M daily active, average 10 GB stored per user.

Storage: 500M × 10 GB = 5 EB raw — before dedup and before replication. Dedup may cut this 30–50%; replication (3×) multiplies it back up. The blob store must scale to many exabytes.
Write throughput: if each DAU uploads ~10 files/day at 1 MB average, that's ~1B file ops/day ≈ ~12K writes/sec average, with peaks several times higher.
Read:write: downloads and syncs dominate; expect a read-heavy ratio. Most reads are cheap metadata polls plus chunk fetches served from CDN.
Metadata QPS: clients poll for changes frequently; the metadata service sees far more QPS than the block service, but each request is tiny.

The headline takeaway: blocks are big and cheap-to-scale; metadata is small but high-QPS and consistency-sensitive. That asymmetry is exactly why we split them.

Step 3 — API Design

The API is chunk-aware rather than whole-file. A client first negotiates which chunks the server already has, uploads only the missing bytes, then commits a new file version referencing the chunk hashes.

core API (auth token implied)

POST /files/upload-urls   {chunk_hashes:[...]}  # which are missing?
        → {missing:[hash→presigned PUT url]}     # dedup happens here
PUT  <presigned-url>       <chunk bytes>          # client → blob store direct
POST /files/commit        {path, chunk_hashes:[...], base_version}
        → {version}                              # new file version
GET  /files/changes?cursor=X                     # delta since last sync
        → {changes:[...], next_cursor}
GET  /files/download?path&version                # → chunk list + CDN urls
POST /shares              {path, user, role}      # share a folder

Note the upload flow: the client asks "which of these chunk hashes do you already have?" and only uploads the missing ones, directly to the blob store via presigned URLs (keeping bulk bytes off the application servers). The commit call is the only one that mutates metadata.

Step 4 — Chunking: Why Split Files into Blocks

Instead of treating a file as one opaque blob, the client splits it into fixed-size chunks (Dropbox uses 4 MB). This one decision unlocks most of the system's efficiency:

Partial sync. Change one paragraph in a 2 GB file and only the affected 4 MB chunk(s) need to be re-uploaded — not the whole file.
Resumable uploads. A dropped connection resumes at the chunk boundary; you re-send one chunk, not gigabytes.
Parallel transfer. Chunks upload/download concurrently, saturating bandwidth.
Deduplication. Chunks are the unit of dedup (next step).

Fixed-size chunking is simple but has a weakness: inserting a byte at the start of a file shifts every subsequent chunk boundary, busting dedup. Content-defined (variable-size) chunking with a rolling hash sets boundaries based on content, so an insertion only affects nearby chunks — better dedup at higher CPU cost. Fixed-size is the usual interview default; mention the rolling-hash refinement.

Step 5 — Block Storage and Deduplication

Each chunk is stored in a content-addressed blob store: its identifier is the cryptographic hash of its bytes (e.g. SHA-256). Two consequences fall out immediately. First, the same content always maps to the same ID, so if a chunk already exists you simply increment a reference count instead of storing it again — block-level deduplication. Second, you can verify integrity by re-hashing on read. Dedup is enormous in practice: shared documents, common app files, and re-saved versions mean a large fraction of chunks are duplicates.

security gotcha

Cross-user (global) dedup leaks information. If dedup spans all users, an attacker can probe whether a known file already exists (their upload returns instantly = someone else has it), revealing that another user stores that exact content. Many services therefore scope dedup per-user or per-namespace, or add a secret salt, trading some storage savings for privacy. Always raise this trade-off.

Dedup scope	Storage savings	Risk
Global (all users)	Maximum	Existence/confirmation attacks across users
Per-namespace	High within shared folders	Low — bounded to collaborators
Per-user	Moderate (versions, dupes)	None cross-user

Step 6 — Metadata Service and Data Model

The metadata plane is the brain: it records the folder tree, every file's version history, and the ordered list of chunk hashes that reconstitute each version. It's small per record but must be transactional (a commit updates several rows atomically) and reasonably strongly consistent (a client must not see a half-written file).

metadata schema (sharded by namespace_id)

namespaces (ns_id, owner_id, type)              # a user's root or a shared folder
files      (file_id, ns_id, path, is_dir, latest_version, deleted)
versions   (file_id, version, chunk_hashes[], size, modified_by, ts)
chunks     (chunk_hash PK, blob_url, size, refcount)   # content-addressed
devices    (device_id, user_id, last_cursor)    # where each device left off
shares     (ns_id, user_id, role)               # viewer / editor / owner

A file version is just an ordered list of chunk hashes — reconstructing the file means fetching those chunks and concatenating. The namespace is the unit of sharding and sharing: a user's private root is one namespace, and each shared folder is its own namespace that can be attached to multiple users. Sharding metadata by ns_id keeps a user's (or shared folder's) data co-located, so listing a directory or computing a delta is a single-shard operation. A relational store (with sharding) is a fine choice because of the transactional commit; a NoSQL store works too if you handle the multi-row update carefully.

Step 7 — Core Architecture: Two Planes

Putting Steps 4–6 together, the system cleanly separates into two data planes that scale independently:

Metadata plane — the metadata service + sharded metadata DB. High QPS, tiny payloads, transactional, consistency-sensitive. This is where most engineering effort and most scaling pain live.
Block plane — the block service + blob store (S3) + CDN. Exabyte-scale, large payloads, append-only and immutable (a chunk never changes — its hash defines it), trivially cacheable. Clients transfer bulk bytes directly to/from the blob store via presigned URLs, so application servers never proxy gigabytes.

Because chunks are immutable and content-addressed, the block plane needs no complex consistency — it's a giant, durable hash map. All the ordering, versioning, and "what does this folder contain right now" logic lives in the metadata plane.

Step 8 — The Sync Algorithm

A client-side watcher observes the local folder. When a file changes, the sync engine runs roughly this loop:

client sync engine (upload path)

on local_file_change(file):
    chunks  = split(file, 4MB)
    hashes  = [sha256(c) for c in chunks]
    missing = api.upload_urls(hashes)        # server: which are new?
    for h in missing:
        put(missing[h], chunk_for(h))        # upload only new chunks → blob store
    api.commit(file.path, hashes, base_version=file.known_version)

The download path is the mirror image: the client calls changes?cursor=, learns which files have new versions, fetches the chunk hashes it's missing locally, downloads those chunks from the CDN, and reassembles the file. Crucially, the client maintains a cursor (a position in the per-namespace change log) so it only ever pulls deltas, never the whole tree.

Step 9 — Notifying Other Devices

How does a second device learn that a file changed? Polling the metadata service every few seconds works and is simple, but wastes requests and adds latency. The better design is a dedicated notification service that each online device subscribes to; when a namespace changes, it pushes a lightweight "you have changes, pull now" signal, and the device then does a normal changes?cursor= pull.

Mechanism	Latency	Cost / notes
Periodic polling	Seconds–minutes	Simple; wasteful at idle; doesn't scale to instant sync
Long polling	~1–2 s	Good middle ground; connection held until change or timeout
WebSocket / push	Sub-second	Lowest latency; many idle persistent connections to manage

A common hybrid: keep a lightweight long-poll/WebSocket channel only for the "ping" (no payload), and let the existing metadata API serve the actual delta. That keeps the notification tier cheap (it pushes tiny signals) while reusing the battle-tested change-pull path for data.

Step 10 — Conflict Resolution

Two devices are offline; both edit report.docx; both reconnect and try to commit. The metadata commit includes base_version — the version the client started from. The server accepts a commit only if base_version equals the current latest version (an optimistic compare-and-set). The second committer's base_version is now stale, so the server rejects the fast path and a conflict is declared.

conflict via version compare-and-set

device A: commit(base_version = 5)  → accepted, latest = 6
device B: commit(base_version = 5)  → REJECTED (latest is now 6)

resolution: do NOT overwrite. Save B's content as a new file:
   "report (conflicted copy — Bob's MacBook 2023-04-08).docx"
both versions are preserved → no silent data loss; user decides.

Because we treat a file as an opaque blob (no in-document merge), the safe, well-understood resolution is to keep both: the winning version stays, and the loser is written as a conflicted copy that syncs to everyone. This is exactly how Dropbox behaves, and it's the right interview answer — never silently discard a user's edit. (In-document automatic merging is the domain of CRDTs / operational transform, which is a Google Docs problem, not a file-sync one.)

Step 11 — Sharing and Permissions

Sharing is modeled with the namespace: a shared folder is a namespace that is mounted into multiple users' trees, with a row in shares granting each collaborator a role (viewer/editor/owner). When any member commits a change to that namespace, the notification service fans out to all members' devices. Permission checks happen in the metadata service on every read/write. The scaling concern is the sharing/permission lookup: deeply nested shared folders and large collaborator lists make "can this user access this path?" non-trivial, so permissions are usually cached and evaluated at the namespace level rather than per file.

Step 12 — Downloads, Caching, and the CDN

Downloads are bulk reads of immutable chunks — a perfect fit for a CDN. Since a chunk is content-addressed and never changes, it's infinitely cacheable: the URL (keyed by chunk hash) can be cached at edge locations worldwide with no invalidation logic. The client gets (presigned, expiring) CDN URLs from the block service and pulls chunks from the nearest edge. Hot chunks (popular shared files, app installers) are served almost entirely from cache, sparing the origin blob store. This is why separating immutable blocks from mutable metadata pays off again: only the metadata ever needs invalidation.

Step 13 — Scaling, Durability, and Fault Tolerance

Durability is the headline requirement, so the block store replicates each chunk across multiple availability zones (and often regions); colder data is moved to erasure-coded storage that achieves the same durability with far less overhead than 3× replication. Managed object stores (S3) already give ~11 nines, which is why most designs lean on one rather than rolling their own. For the metadata plane: shard by ns_id and replicate each shard (leader + followers) for availability; the per-namespace change log gives a clean ordering for the cursor-based sync. Application and notification tiers are stateless and horizontally scaled behind load balancers. Because chunks are immutable and verified by hash, background scrubbing can detect and repair silent corruption (bit rot) by re-replicating from a good copy.

Step 14 — Key Tradeoffs

Chunk size. Smaller chunks → better dedup and finer-grained partial sync, but more metadata (more hashes per file) and more request overhead. ~4 MB is a common balance.
Fixed vs content-defined chunking. Fixed is simple but insertions ruin dedup; rolling-hash boundaries fix that at higher CPU cost.
Dedup scope. Global maximizes savings but enables cross-user existence attacks; per-namespace/per-user is safer. A real privacy/cost trade.
Metadata consistency. Strong consistency makes sync correctness easy but costs latency and limits scale; the per-namespace sharding keeps strong consistency affordable within a namespace while the system as a whole scales out.
Notification mechanism. WebSockets give sub-second sync but cost many idle connections; long-polling is a cheaper near-real-time compromise.

takeaway

The whole design rests on three moves: chunk files into blocks, address blocks by content hash (giving dedup and trivial CDN caching for free), and split the small-but-consistency-critical metadata plane from the huge-but-immutable block plane. Get those right and sync, dedup, resumability, and durability all follow naturally; the remaining hard parts are conflict handling (never lose an edit) and scaling the metadata/sharing layer.

🎯 interview hot-takes

Why chunk files? Partial sync (only changed chunks move), resumable + parallel transfer, and deduplication — all impossible if a file is one opaque blob.
Why content-addressed storage? A chunk's ID is the hash of its bytes, so identical chunks are stored once (dedup) and downloads are infinitely CDN-cacheable since chunks are immutable.
Why split metadata from blocks? Metadata is tiny, high-QPS, transactional, consistency-sensitive; blocks are huge, immutable, append-only. Different stores, different scaling, only metadata needs invalidation.
How do you handle two offline edits? Optimistic compare-and-set on base_version; the stale committer's file is saved as a "conflicted copy" — never silently overwrite.
What's the risk of global dedup? Existence/confirmation attacks: instant upload reveals someone else stores that exact content. Scope dedup per-namespace/user to mitigate.