Design Pastebin — System Design

Pastebin lets a user paste a block of text and get back a short URL that anyone can open to read it. It's a close cousin of the URL shortener, with one important difference: instead of storing a tiny target URL, you store a potentially large blob of content. That shifts the design toward the question of where the bytes live, and — because pastes are read far more than they're written — toward aggressive caching. It's a great "medium" interview problem that rewards clean separation of metadata from content.

⚡ Quick Takeaways

Generate a short base-62 key per paste (random or from a counter), just like a URL shortener; check for collisions only if random.
Store content in a blob store, not the DB row — keep a small metadata record (key → blob location, expiry, flags) and put the (possibly large) text in object storage.
Read-heavy by ~10:1+ — cache hot pastes in Redis and serve via CDN; most reads should never touch the database or blob store.
Pastes are write-once, immutable — which makes them trivially cacheable (no invalidation) and CDN-friendly.
Expiration is a first-class feature — support TTLs; purge with a mix of lazy deletion on read and a background sweeper.
Abuse is the real operational risk — rate-limit creation, scan for malware/spam, and support unlisted/private pastes.

tldr

Mint a short base-62 key per paste. Keep a small metadata row (key, blob pointer, expiry, visibility) in a database and store the actual text in an object store (S3) — only inline tiny pastes. Because pastes are immutable and read-heavy, cache hot ones in Redis and front everything with a CDN. Expire via TTL with lazy + scheduled cleanup. Rate-limit and scan on the write path to fight abuse.

high-level architecture

                         ┌──────────────┐   ┌───────────────┐
              write/read │  App / API   │──▶│  Metadata DB  │  key→blob,
   ┌──────────┐ ────────▶│  servers     │   │  (KV / SQL)   │  expiry, flags
   │  Client  │          └──────┬───────┘   └───────────────┘
   └────┬─────┘                 │           ┌───────────────┐
        │  reads (hot)          ├──────────▶│  Blob store   │  paste content
        │                       │           │  (S3)         │
        ▼                  ┌────▼────┐      └───────────────┘
   ┌─────────┐   miss      │  Cache  │   hot pastes (Redis)
   │   CDN   │◀───────────▶│ (Redis) │
   └─────────┘             └─────────┘

Step 1 — Clarify Requirements

Functional: create a paste from a block of text and receive a unique URL; read a paste by its URL; optional custom alias; optional expiration (e.g. 10 min, 1 day, never); visibility (public / unlisted). Non-functional: very read-heavy, low read latency, high availability, durability (don't lose pastes before they expire), and horizontal scalability. Constrain paste size up front — say up to a few MB of text, with a hard cap — because it drives the storage decision. Pastes are write-once and immutable: you create one and never edit it (a new edit is a new paste), which simplifies caching enormously.

Step 2 — Capacity Estimation

Assume 1M new pastes/day (~12 writes/sec average) and a 10:1 read:write ratio (~120 reads/sec average, far higher at peak for viral pastes). Average paste 10 KB → ~10 GB/day of new content, ~3.6 TB/year before expiry reclaims space. The numbers are modest for writes but the read path must handle sharp spikes when a paste goes viral — which is exactly what the cache and CDN are for. Key space: a 7-character base-62 key gives 62⁷ ≈ 3.5 trillion combinations, plenty.

Step 3 — API Design

core API

POST /pastes   {content, expiry?, custom_alias?, visibility?}
        → {key, url}
GET  /pastes/{key}        → {content, created_at, expiry}
DELETE /pastes/{key}      → ok            # owner only

The create endpoint is rate-limited and authenticated (or captcha-gated) to curb abuse; the read endpoint is public and heavily cached.

Step 4 — Key Generation

Each paste needs a short, URL-safe key. Two standard approaches, identical to the URL shortener:

Approach	How	Trade-off
Random base-62	Generate 7 random chars, check DB for collision, retry	Unguessable (good for unlisted); needs a collision check + retry
Counter / ID gen	Distributed counter (e.g. range-allocated or Snowflake) → base-62 encode	No collisions by construction; but sequential keys are guessable/enumerable

For a paste service, random keys are usually preferred because "unlisted" pastes rely on the URL being unguessable; sequential IDs would let anyone enumerate every paste. A common refinement is a pre-generated key pool: a background service mints unused keys into a table so the write path just pops one (no live collision-retry loop on the hot path).

Step 5 — Storage: Separate Metadata from Content

The defining decision. Don't stuff multi-KB/MB text into a relational row — it bloats the database, slows scans, and wastes the DB's strengths. Instead split:

Metadata — a small record per paste (key, pointer to the blob, size, created/expiry timestamps, visibility, owner). Lives in a database (a key-value store or sharded SQL), which is fast to look up by key.
Content — the actual text, stored in an object store (S3) addressed by the key (or a content hash). Object stores are cheap, durable, and built for large blobs.

optimization

For tiny pastes (a few hundred bytes), the round-trip to the blob store can cost more than the read itself. A common refinement is to inline small content directly in the metadata row and only spill to the blob store above a size threshold — best of both worlds.

Step 6 — Data Model

metadata schema

pastes (
   key         PK,          # base-62 short key
   blob_url,                # pointer into S3 (null if inlined)
   inline_text,             # small pastes stored here directly
   size, visibility, owner_id,
   created_at, expires_at   # indexed for cleanup
)

Shard by key (hash partitioning) so lookups stay single-shard and the table scales horizontally. An index on expires_at supports the cleanup sweep.

Step 7 — Read and Write Paths

Write: validate + rate-limit → obtain a key (from the pool) → store content (inline if small, else PUT to blob store) → insert the metadata row → return the URL. Read: look up the key in the cache; on a hit, return immediately; on a miss, read metadata, fetch content (inline or from blob store), populate the cache, and return. Because content is immutable, a cached entry never needs invalidation — it only needs to expire when the paste does.

Step 8 — Caching and CDN

This is the heart of a read-heavy service. Put a Redis cache in front of the metadata + blob lookups, keyed by paste key, so popular pastes are served from memory. Because pastes are immutable, also front the read path with a CDN: a paste's rendered/raw content can be cached at edge locations with a long TTL (bounded by the paste's own expiry), so a viral paste is served almost entirely from the edge and never melts the origin. Use an eviction policy like LRU so the cache holds the currently-hot set.

Step 9 — Expiration and Cleanup

Expired pastes must stop being served and eventually be reclaimed. Three complementary mechanisms (same playbook as the URL shortener):

Lazy deletion — on read, if expires_at has passed, treat it as gone (404) and optionally delete it then. Cheap, but expired-yet-unread pastes linger.
Scheduled sweeper — a background job periodically scans the expires_at index and bulk-deletes expired metadata + their blobs, reclaiming storage.
Object-store TTL — let the blob store auto-expire objects via lifecycle rules, so content cleanup is handled by the storage layer.

Step 10 — Scaling and Fault Tolerance

The app/API tier is stateless and scales horizontally behind a load balancer. The metadata store is sharded by key and replicated for availability; the blob store (S3) already provides ~11 nines of durability and effectively unlimited scale. The cache tier scales by sharding (consistent hashing) across Redis nodes. Since each layer scales independently and the content layer is immutable, the system grows almost linearly — the main capacity concern is read spikes, fully absorbed by cache + CDN.

Step 11 — Security and Abuse Prevention

A public "paste anything" endpoint is catnip for abuse, so this deserves explicit attention:

Rate limiting on the create endpoint (per IP / per user) to stop spam and storage-exhaustion attacks.
Content scanning for malware, phishing, and spam; honor takedown requests.
Visibility controls — unlisted pastes rely on unguessable random keys; truly private pastes need auth and access checks.
Size caps and input validation to prevent oversized uploads.

Step 12 — Key Tradeoffs

Random vs sequential keys. Random keys are unguessable (needed for unlisted pastes) but require collision handling; sequential keys avoid collisions but are enumerable. Paste services lean random.
Inline vs blob storage. Inlining small pastes saves a round-trip; blob storage keeps the database lean for large content. A size threshold gets both.
Cache/CDN TTL. Longer edge caching means fewer origin hits but coarser control; bound it by paste expiry and accept slight delete latency.
Consistency. Pastes are immutable and write-once, so eventual consistency of the cache is harmless — a freshly created paste just needs its first read to hit the origin.

takeaway

Pastebin is a lesson in two ideas: separate small metadata from large content (DB for the pointer, object store for the bytes), and exploit immutability to cache hard. Once those click, the rest — key generation, expiration, abuse handling — is standard. The read path's spike-tolerance via cache + CDN is what turns a simple CRUD app into a scalable service.

🎯 interview hot-takes

How is Pastebin different from a URL shortener? It stores large content, not a tiny target URL — so the key decision is metadata-vs-blob storage, and caching matters far more.
Where does the paste content live? In an object store (S3), with only a small pointer + metadata row in the database; tiny pastes may be inlined to skip the round-trip.
Why is caching so effective here? Pastes are immutable and read-heavy, so cached/CDN entries never need invalidation — perfect for absorbing viral read spikes.
Random or sequential keys? Random, because unlisted pastes depend on unguessable URLs; sequential IDs would let anyone enumerate all pastes.
How do pastes expire? TTL with lazy deletion on read plus a background sweeper (and object-store lifecycle rules) to reclaim storage.