A notification system looks trivial until you remember that one product event — a new like, a shipment update, a fraud alert — may need to reach tens of millions of users within seconds, across four different channels (push, SMS, email, in-app), through external providers you do not control, while respecting each user's opt-outs and quiet hours, never sending the same message twice, and surviving the inevitable APNs or FCM hiccup. The hard parts are not the "send an email" call; they are the fan-out from one event to millions of deliveries, the at-least-once-plus-idempotent delivery semantics, the third-party failure handling, and the preference and rate-limiting logic that keeps you from spamming users. Get those wrong and you either drop critical alerts or wake up your entire user base at 3 a.m. with a retry storm.

⚡ Quick Takeaways
  • Decouple trigger from delivery — a producer writes one event; a message queue absorbs the spike; fan-out workers expand it into per-(user, channel) jobs.
  • One worker pool per channel — push, SMS, email, and in-app each have their own queue, provider adapter, rate limits, and retry policy.
  • At-least-once + idempotent consumers — exactly-once across APNs/Twilio/SES is impossible; dedup on notification_id in Redis instead.
  • Preferences checked in the fan-out path — per-category opt-ins, quiet hours in the user's timezone, and frequency caps are applied before any job is enqueued.
  • Retries with backoff + DLQ — transient 5xx/429 retried with exponential backoff and jitter; permanent failures (invalid token) pruned; exhausted jobs go to a dead-letter queue.
  • Template service — content lives in versioned, localized templates rendered at send time; producers pass IDs and variables, not raw strings.
  • Observability is first-class — every notification carries an ID traced from trigger → queue → provider → delivery receipt, with funnel metrics per channel.
tldr

Split the system into a notification API, a fan-out/preference layer, per-channel queues and workers, and provider adapters (APNs, FCM, Twilio, SES). Use a message queue to decouple the burst trigger from steady delivery. Make delivery at-least-once and consumers idempotent via a Redis dedup key. Apply user preferences, quiet hours, and frequency caps before enqueuing. Retry transient provider errors with backoff and route exhausted jobs to a DLQ. Render content from versioned templates and trace every notification end to end.

Step 1 — Clarify Requirements

A notification system is a classic interview prompt because the naive answer ("call SendGrid in a loop") collapses the moment you introduce scale, multiple channels, and reliability. Pin down scope first.

Functional requirements

  • Send notifications across four channels: mobile push (iOS/Android), SMS, email, and in-app (a feed/inbox the client polls or receives over a websocket).
  • Support triggered (transactional, event-driven: OTP, order shipped) and scheduled/broadcast (marketing campaigns to a segment) notifications.
  • Respect user preferences: per-category channel opt-in/out, quiet hours, frequency caps, and global unsubscribe.
  • Render content from templates with per-locale variants and runtime variables.
  • Provide delivery status: queued, sent, delivered, failed, opened (where the provider reports it).

Non-functional requirements

  • Scale — handle bursty fan-out: a single broadcast may produce 50M+ deliveries; steady state is millions of transactional sends per hour.
  • Low trigger latency — the API that accepts a notification request returns in <100 ms; actual delivery is async.
  • Reliability — critical notifications (OTP, security) must not be silently dropped; target at-least-once delivery with idempotency.
  • No duplicates — a user should never receive the same logical notification twice, even when the pipeline retries.
  • Extensibility — adding a new channel (e.g., WhatsApp) should not require rewriting the core.
interview tip

Separate transactional from promotional early. They have opposite priorities: transactional notifications bypass marketing preferences, demand at-least-once delivery, and run on a low-latency lane; promotional notifications are heavily rate-limited, fully opt-out-respecting, and can tolerate seconds-to-minutes of delay. A single pipeline that treats them identically is a red flag.

Step 2 — Capacity Estimation

Sizing tells you where the bottleneck is — it is almost never the database; it is the queue throughput and the provider rate limits.

Traffic

  • Assume 100M DAU and an average of 5 notifications/user/day → 500M notifications/day ≈ ~5,800/sec average.
  • Peak is spiky: a marketing broadcast to 50M users fires in minutes. If we want it drained in 10 minutes, that is 50M / 600 s ≈ ~83,000 deliveries/sec sustained — the design must absorb 10–20× the average as a burst.
  • Channel mix is uneven: push dominates (≈70%), in-app (≈20%), email (≈8%), SMS (≈2%, but the most expensive and rate-limited).

Storage

  • Notification log record: ~300 bytes (ids, channel, status, timestamps, template_id). 500M/day × 300 B ≈ ~150 GB/day — this goes to a horizontally scalable store with TTL (e.g., 30–90 days), not a single SQL box.
  • Device tokens: 100M users × ~2 devices × ~100 B ≈ ~20 GB — small, hot, kept in a fast store and cache.
  • Preferences: 100M users × ~200 B ≈ ~20 GB — read on every fan-out, so cached aggressively in Redis.
note

The number that should drive the design is not storage; it is the 83K/sec burst and the fact that providers throttle you. APNs, FCM, Twilio, and SES all impose rate limits. Your queue exists precisely so that an 83K/sec burst is reshaped into a steady stream the providers will accept, instead of getting you globally throttled or blacklisted.

Step 3 — API Design

The public surface is small. The producer describes what to send and to whom; the system decides how and when.

HTTP
# Send a notification (transactional)
POST /api/v1/notifications
     Authorization: Bearer <service-token>
     Idempotency-Key: "order-9182-shipped"   // dedup at the edge
     { "user_id": "u_123",
       "template_id": "order_shipped_v3",
       "channels": ["push", "email"],   // optional; else use prefs
       "variables": { "order_id": "9182", "eta": "Jun 30" },
       "priority": "transactional" }
     → 202 { "notification_id": "ntf_8f2a", "status": "accepted" }

# Broadcast to a segment (async, returns a job handle)
POST /api/v1/broadcasts
     { "segment_id": "seg_us_active", "template_id": "promo_summer",
       "priority": "promotional", "send_at": "2026-07-01T16:00:00Z" }
     → 202 { "broadcast_id": "bc_55", "status": "scheduled" }

# Query delivery status
GET /api/v1/notifications/{id}
     → { "status": "delivered", "channel": "push",
         "attempts": 1, "delivered_at": "..." }

# Update user preferences
PUT /api/v1/users/{id}/preferences
     { "marketing": { "push": false, "email": true },
       "quiet_hours": { "start": "22:00", "end": "08:00", "tz": "America/New_York" } }

Two design points. First, the Idempotency-Key header lets the caller make the trigger safe to retry — if the order service crashes and re-POSTs "order-9182-shipped", the API returns the same notification_id instead of sending twice. Second, the endpoint returns 202 Accepted, not 200: delivery is asynchronous, so the API's only job is to validate, dedup, persist the request, and enqueue. Everything expensive happens downstream.

Step 4 — High-Level Architecture

The system is a pipeline of decoupled stages connected by queues. The key principle: the trigger path is cheap and fast; the delivery path is parallel and absorbs failure.

flow
Producers (order svc, social svc, risk svc, campaign tool)
   │  POST /notifications  |  POST /broadcasts
   ▼
[ Notification API ] ── validate, edge-dedup (Idempotency-Key), persist
   │  publish 1 event
   ▼
[ Kafka: "notification-events" ]   ← absorbs the burst
   │
   ▼
[ Fan-out / Preference workers ]
   ├── expand audience (1 event → N users)
   ├── load preferences + device tokens (Redis)
   ├── drop opted-out / quiet-hours / over-cap
   └── emit 1 job per (user, channel)
   ▼
[ Per-channel queues ]   push | sms | email | in-app
   │             │         │        │
   ▼             ▼         ▼        ▼
[ Push wkr ] [ SMS wkr ] [ Email ] [ In-app wkr ]
   │  APNs/FCM   │ Twilio   │ SES     │ writes to inbox + websocket push
   ▼             ▼          ▼         ▼
[ Provider adapters ] ── retry/backoff, rate-limit, DLQ on exhaustion
   │  delivery receipts / webhooks
   ▼
[ Status store + Observability ]  (per-notification trace, funnel metrics)

Why a queue between every stage? Because each stage has a different and independent failure mode and throughput ceiling. The fan-out worker can expand an audience far faster than Twilio will accept SMS; the message queue in front of each channel is the shock absorber that lets the fast stages run flat-out while the slow, rate-limited providers drain at their own pace. Kafka also gives you replayability: if the email worker has a bug, you fix it and reprocess from the offset without losing events.

Step 5 — Channel Adapters: APNs, FCM, SMS, Email, In-App

Each channel is fundamentally different. The shared abstraction is a channel adapter interfacesend(job) → {delivered | retryable_error | permanent_error} — but the implementations diverge sharply.

ChannelProviderTransportKey constraints
iOS pushAPNsHTTP/2 persistent connection, token-based (JWT) authPer-connection multiplexing; 410 = token expired → prune
Android pushFCMHTTP/2 REST, OAuth service accountBatch up to 500; UNREGISTERED → prune token
SMSTwilio / SNSHTTPS RESTExpensive; strict per-number rate limits; carrier filtering
EmailSES / SendGridHTTPS / SMTPReputation/bounce management; DKIM/SPF; async webhooks
In-appinternalDB write + websocketNo third party; delivered if user online, else stored in inbox

Push: APNs and FCM specifics

Push is the highest-volume channel and has the most operational subtlety. APNs uses a long-lived HTTP/2 connection — you authenticate once with a provider JWT (valid ~1 hour, refreshed) and multiplex thousands of sends over the same connection rather than reconnecting per message. FCM lets you batch up to 500 messages per call. Both return a per-token result: a success, a retryable error (provider 5xx, connection reset), or a permanent error (410 Gone / BadDeviceToken / UNREGISTERED). Permanent errors are a token-management signal, not a retry signal — the worker emits a "prune this token" event so a future send never wastes a slot on a dead device.

In-app: online vs offline

In-app is the one channel you fully control. The worker writes the notification to the user's inbox table (so it survives offline), then — if the user has a live websocket connection — pushes it in real time. If they are offline, the inbox row is enough; the client fetches unread items on next open. This dual write (durable inbox + best-effort live push) is what makes in-app feel instant without losing notifications for offline users.

extensibility

The adapter interface is the seam that keeps the core stable. Adding WhatsApp or a web-push channel means writing one new adapter and one new queue — the fan-out, preference, dedup, retry, and observability layers are untouched. Resist the temptation to special-case channel logic in the fan-out worker; push it into the adapter.

Step 6 — The Fan-out Problem

Fan-out is the heart of the system: turning one trigger event into N per-(user, channel) delivery jobs. Two flavors, with very different cost profiles.

Triggered fan-out (1 → few)

A transactional event targets one user (or a small set). Fan-out is cheap: load the user's preferences and device tokens, decide channels, emit a handful of jobs. This path is latency-sensitive and runs on a dedicated high-priority lane.

Broadcast fan-out (1 → millions)

A campaign targets a segment of 50M users. Doing this in one worker is impossible — it would run for hours and any crash loses progress. The pattern is hierarchical fan-out with checkpointing:

  • The broadcast is split into chunks (e.g., 10K users each) by paging the segment with a cursor. Each chunk becomes its own message.
  • Many fan-out workers consume chunks in parallel; each expands its 10K users into per-channel jobs.
  • Progress is checkpointed per chunk, so a worker crash only re-does one 10K chunk, not the whole 50M broadcast — and because consumers are idempotent, re-doing a chunk does not double-send.

This is structurally the same fan-out-on-write vs fan-out-on-read tension you see in a news feed: precompute deliveries up front (write amplification, fast at send) versus compute at read time. For notifications the answer is almost always fan-out-on-write — a notification only matters if it is pushed, so there is no "read time" to defer to.

scalability note

Keep the broadcast and transactional lanes physically separate — different topics, different worker pools, different priorities. A 50M marketing blast must never delay a 2FA code. Priority queues alone are not enough under sustained load; isolating the pools guarantees the critical lane has reserved capacity.

Step 7 — Template Service and Personalization

Producers must not send raw strings. If the order service hard-codes "Your order shipped!", you cannot localize it, A/B test it, or fix a typo without a deploy across every producer. A template service centralizes content.

  • Versioned templatesorder_shipped_v3 is an immutable, reviewed artifact. Producers reference a template ID plus a variable map; rendering happens at send time.
  • Per-locale variants — each template has localized bodies; the renderer picks based on the user's locale, falling back to a default.
  • Per-channel rendering — the same logical notification renders differently per channel: a 40-char push title, a full HTML email, a 160-char SMS. The template bundles all channel renditions.
  • Safe interpolation — variables are escaped per channel (HTML-escape for email, no injection for SMS) to prevent content-injection bugs.
YAML
id: order_shipped_v3
category: transactional
locales:
  en:
    push:  { title: "Shipped!", body: "Order {{order_id}} arrives {{eta}}" }
    email: { subject: "Your order is on the way", html: "templates/shipped.html" }
    sms:   { body: "Order {{order_id}} ships, ETA {{eta}}. Track: {{url}}" }
  es:
    push:  { title: "¡Enviado!", body: "Pedido {{order_id}} llega {{eta}}" }

Step 8 — User Preferences and Opt-out

Preferences are checked in the fan-out path, before a job is enqueued — never send first and filter later. The preference service answers, for a given (user, category, channel): is this allowed right now?

  • Per-category, per-channel opt-in — a user might allow security emails but disable marketing push. Stored as a matrix of (category × channel) booleans.
  • Quiet hours — no non-critical notifications during the user's sleep window, evaluated in their timezone. A notification generated at 03:00 local is either dropped (promotional) or deferred to the morning (digest).
  • Frequency caps — "at most 3 marketing pushes per day." Enforced with a Redis counter per (user, category, day) that the fan-out worker increments.
  • Global unsubscribe + legal — an unsubscribed user is hard-filtered. Email must honor one-click unsubscribe (RFC 8058); SMS must honor STOP.
critical override

Transactional and safety-critical notifications (OTP, login-from-new-device, fraud alerts) bypass marketing preferences and quiet hours. The preference check must branch on category: promotional is filtered hard; critical is delivered regardless of opt-out (a user cannot opt out of their own 2FA code). Encoding this distinction wrong is both a product failure and, for some categories, a security failure.

Step 9 — Rate Limiting and Deduplication

Two distinct concerns that both protect the user and the providers.

Rate limiting

There are three levels, and a good answer names all three. Per-user frequency caps (product-level: don't annoy the user). Per-provider limits (operational: APNs/Twilio/SES will throttle or ban you if you exceed their quotas) enforced with a token bucket in front of each adapter so the worker pool self-throttles to the provider's ceiling. Per-tenant limits if you are a platform serving multiple internal teams, so one team's runaway campaign cannot starve another. This is the same machinery covered in depth in designing a rate limiter — a distributed token bucket or sliding window backed by Redis.

Deduplication

Dedup is what makes at-least-once delivery tolerable. The same logical notification can arrive at the consumer multiple times — the producer retried the trigger, Kafka redelivered after a rebalance, or a broadcast chunk was reprocessed after a crash. The consumer must collapse these to a single send:

Python
def deliver(job):
    # dedup key is deterministic across retries
    key = f"dedup:{job.notification_id}:{job.user_id}:{job.channel}"
    # SETNX returns False if the key already exists → already sent
    first_time = redis.set(key, 1, nx=True, ex=86400)
    if not first_time:
        metrics.incr("dedup.suppressed")   # dropped a duplicate
        return
    result = adapter.send(job)        # call APNs / Twilio / SES
    if result.permanent_error:
        redis.delete(key)             # let a corrected retry through
        prune_token(job)

The dedup key is notification_id + user_id + channel so the same notification on push and email both go out, but a redelivered push job is suppressed. The TTL (24 h) bounds memory while covering any realistic retry window. The subtle bug to avoid: setting the dedup key after a successful send means a crash between send and set re-sends; setting it before and never clearing it on a permanent error means a fixable failure is wrongly suppressed forever. The ordering above — claim the key first, release it only on permanent error — handles both.

Step 10 — Retries, DLQ, and Delivery Guarantees

Providers fail constantly at scale: transient 5xx, connection resets, 429 throttling. The retry policy must distinguish failure classes and never retry forever.

OutcomeExamplesAction
Success2xx, accepted by providerMark sent; await delivery receipt webhook
Retryable5xx, 429, timeout, connection resetExponential backoff + jitter, cap ~5 attempts, then DLQ
Permanent410 Gone, invalid token, hard bounce, STOPNo retry; prune token / suppress address; record

Exponential backoff with jitter (1s, 2s, 4s… plus randomization) prevents a provider blip from turning into a synchronized retry storm where every worker retries in lockstep and re-overwhelms the recovering provider. After the attempt cap, the job moves to a dead-letter queue — a separate topic where exhausted jobs land for inspection, alerting, and optional manual or automated replay once the root cause is fixed. The DLQ is what turns "we silently dropped 200K OTPs during the APNs outage" into "200K jobs are parked in the DLQ and replayed automatically when APNs recovered."

Why at-least-once, not exactly-once

Exactly-once delivery to an external provider is unachievable: after you call Twilio and it sends the SMS, the acknowledgement back to you can be lost. You cannot tell "sent but ack lost" from "not sent," so a safe system retries — which means the SMS could go out twice. The honest, standard answer is at-least-once on the wire plus idempotent consumers: you accept that the pipeline may attempt a send more than once, and you make duplicate attempts harmless via the dedup key from Step 9. This collapses to effectively-once from the user's perspective without claiming a guarantee the providers cannot give you.

delivery semantics

State it crisply: at-least-once delivery + idempotent consumer = effectively-once user experience. "Exactly-once" is a property of your internal processing (Kafka offsets, dedup key), never of the third-party hop. Claiming exactly-once end-to-end through APNs is a classic interview mistake.

Step 11 — Data Model

Three stores, each chosen for its access pattern: device tokens and preferences are hot key-value reads; the notification log is a high-volume append.

SQL
-- Device tokens (hot KV; cache in Redis)
CREATE TABLE device_tokens (
  user_id    BIGINT,
  platform   VARCHAR(10),       -- ios | android | web
  token      VARCHAR(255),
  last_seen  TIMESTAMP,
  PRIMARY KEY (user_id, platform, token)
);

-- Preferences (read on every fan-out; cache aggressively)
CREATE TABLE preferences (
  user_id    BIGINT      PRIMARY KEY,
  matrix     JSONB,                 -- {category: {channel: bool}}
  quiet_start TIME, quiet_end TIME, tz VARCHAR(40),
  unsubscribed BOOLEAN  DEFAULT FALSE
);

-- Notification log (high-volume append; NoSQL / Cassandra, TTL 90d)
-- partition by user_id, clustered by created_at desc
-- { notification_id, user_id, channel, template_id, status,
--   attempts, created_at, sent_at, delivered_at, error_code }

The notification log is append-heavy and queried by "this user's recent notifications" — a perfect fit for a wide-column NoSQL store partitioned by user_id with a TTL, not a single relational table that would need constant sharding and purging. Device tokens and preferences are small and read on the hot fan-out path, so they live in a fast store fronted by Redis with a short TTL; a token write (login on a new device) invalidates the cache entry.

Step 12 — Scaling and Fault Tolerance

Every stage scales horizontally and degrades independently.

Scaling

  • Workers are stateless — fan-out and channel workers are Kafka consumer groups; add consumers to add throughput, with partitions sized for peak parallelism. Partition by user_id to preserve per-user ordering where it matters (e.g., "shipped" before "delivered").
  • Channel pools scale independently — push needs hundreds of workers; SMS needs few because Twilio rate-limits anyway. Autoscale each pool on its own queue depth.
  • Provider connection pooling — APNs HTTP/2 connections and FCM batches are reused across sends; a connection pool per worker amortizes TLS/handshake cost.

Fault tolerance

  • Provider outage — retries + DLQ absorb it; the affected channel backs up in its queue while others flow. When the provider recovers, drain the backlog and replay the DLQ. Critical: the queue must have enough retention to hold a multi-hour outage's worth of jobs.
  • Worker crash — Kafka redelivers uncommitted messages to another consumer; idempotent dedup makes the redelivery harmless.
  • Redis (dedup/cache) failure — degrade gracefully: on a dedup-store outage you can fail open (risk a few duplicates) for non-critical channels rather than halt delivery; preferences fall back to the source DB at higher latency.
  • Poison messages — a malformed job that always throws is caught by the attempt cap and routed to the DLQ instead of blocking the partition forever.

Step 13 — Observability

You cannot operate a notification system you cannot see into. "Did the user get the OTP?" must be answerable in seconds. Observability is built around the per-notification trace and the delivery funnel.

  • End-to-end trace — every notification carries an ID stamped at the API and propagated through the event, fan-out, queue, worker, and provider call, so you can reconstruct its full timeline: accepted → enqueued → sent → provider-acked → delivery-receipt.
  • Funnel metrics per channel — accepted, suppressed (prefs/dedup), sent, delivered, failed, opened. A sudden drop in delivered/sent ratio for push is an APNs problem; a spike in suppressed is a preference or dedup anomaly.
  • Delivery receipts / webhooks — providers report final status asynchronously (SES bounces, APNs feedback, Twilio status callbacks). Ingesting these closes the loop from "sent" to "actually delivered/bounced" and feeds token pruning and bounce suppression.
  • Alerting — alert on DLQ growth rate, per-provider error rate, and queue lag, not just absolute volume. A growing DLQ is the earliest signal of a provider or token problem.
note

Delivery receipts are the difference between "we sent it" and "they got it." A push can be accepted by APNs and still never reach a phone (device off, token silently dead). Without ingesting receipts and feedback, your dashboards lie — they show 100% "sent" while real delivery quietly degrades.

Step 14 — Key Tradeoffs

DecisionChoiceTrade-off accepted
Delivery semanticsAt-least-once + idempotentPossible duplicate attempts, suppressed by dedup key; exactly-once is impossible across providers
Trigger vs deliveryFully decoupled via queueAdded end-to-end latency (sub-second to seconds) for huge throughput and isolation
Fan-out timingFan-out on writeWrite amplification on broadcast; no read-time path exists for notifications
Channel couplingOne queue + adapter per channelMore moving parts; gains independent scaling, retry, and rate limits
Preference checksIn fan-out, before enqueueExtra read on the hot path (cached) vs. wasting provider quota on filtered sends
Notification logWide-column NoSQL + TTLNo ad-hoc joins; optimized for per-user append and time-range reads
takeaway

A notification system is an exercise in controlled fan-out and graceful failure. The spine is a queue that decouples a bursty trigger from rate-limited providers; the reliability comes from at-least-once delivery made safe by idempotent, dedup-keyed consumers; the product correctness comes from checking preferences before you enqueue and treating transactional notifications as a separate, preference-bypassing lane. Everything else — templates, retries, DLQ, observability — exists to make those three things operate at scale without spamming users or dropping the one OTP that mattered.

🎯 interview hot-takes

How do you fan out a notification to millions of users without blocking? Decouple the trigger from delivery with a message queue. The trigger service writes one event; a fan-out worker expands the audience and enqueues one job per (user, channel) onto Kafka. Thousands of channel workers consume in parallel and call APNs/FCM/SMS/email providers. The triggering API call returns in milliseconds because it only writes a single event, not millions of sends.
At-least-once or exactly-once delivery — which do you pick? At-least-once. Exactly-once across third-party providers (APNs, Twilio, SES) is impossible because their acknowledgements can be lost after they have already delivered. So you accept at-least-once on the wire and make the consumer idempotent: a dedup key (notification_id + user_id + channel) in Redis with a TTL drops a duplicate event before it ever reaches the provider.
How do you handle a third-party provider (APNs/FCM) outage or throttling? Retry with exponential backoff and jitter for transient 5xx/429, capped at a few attempts; after the cap, route the job to a dead-letter queue for replay. Honor each provider's rate limit with a token bucket so you never get globally throttled. Permanent errors (invalid token, 410 Gone) are not retried — the token is marked dead and pruned.
How do you respect user preferences and quiet hours at scale? A preference service is consulted in the fan-out path before any job is enqueued. It stores per-user, per-category channel opt-ins, quiet-hours windows in the user's timezone, and frequency caps, cached in Redis for single-digit-millisecond reads. Marketing categories are checked hard; critical transactional notifications (security alerts, OTP) bypass non-essential preferences.
How do you prevent sending the same notification twice? Idempotency at two layers: the producer attaches a deterministic notification_id (a hash of event + user + template); the consumer does a SETNX on a Redis dedup key before sending and skips if it already exists. Combined with a frequency cap, this collapses duplicate triggers and retry-storms into a single delivered message.

← previous
Design a News Feed