A load balancer sits between clients and a pool of backend servers, distributing incoming traffic so no single server becomes a bottleneck. It is the cornerstone of horizontal scalability: add more servers and the load balancer spreads the work automatically. Without one, scaling out is operationally impossible and a single server failure takes down the entire service.

⚡ Quick Takeaways
  • Layer 4 vs Layer 7 — L4 routes on TCP/UDP headers (fast, no content inspection); L7 routes on HTTP path/headers/cookies (smarter, slightly more CPU).
  • Round Robin works well for uniform workloads; Least Connections adapts to variable-duration requests.
  • IP Hash gives session affinity without storing state, but reshuffles all clients when a server is removed.
  • Sticky sessions are a workaround for stateful servers — the real fix is externalizing session state to Redis.
  • LB redundancy — run active-passive or active-active pairs with a VIP; the LB itself must not be a SPOF.
tldr

Load balancers route traffic across backend servers using algorithms like round-robin, least connections, or IP hash. Layer 4 LBs route on TCP/UDP metadata; Layer 7 LBs inspect HTTP content and can make smarter routing decisions. Health checks automatically remove unhealthy servers from the pool.

Load balancing distributing traffic across backend servers
Load balancing distributing traffic across backend servers

Why Load Balancing?

Three core problems it solves:

Layer 4 vs. Layer 7

DimensionLayer 4 (Transport)Layer 7 (Application)
Works onTCP/UDP headers (IP + port)HTTP headers, URL, cookies, body
PerformanceVery fast — no content inspectionSlightly higher CPU cost
Routing intelligenceLow — no awareness of application protocolHigh — can route by path, host, or header value
TLS terminationPass-through onlyCan terminate TLS, inspect payload, re-encrypt
ExamplesAWS NLB, HAProxy TCP modeAWS ALB, Nginx, Envoy

In practice, most modern architectures use Layer 7 load balancers at the edge (for HTTP routing, TLS termination, and header manipulation) and Layer 4 load balancers for internal TCP services where raw throughput matters more than routing intelligence.

Load Balancing Algorithms

Round Robin

Requests are distributed sequentially across the server pool in a circular order. Simple and effective when all servers have identical capacity and request cost is uniform. Weighted Round Robin extends this by assigning a weight to each server proportional to its capacity, so a server with weight 3 receives three times as many requests as one with weight 1.

Least Connections

Each new request goes to the server currently handling the fewest active connections. This adapts naturally to workloads with variable request duration — a server processing a long-running query holds fewer new connections. Weighted Least Connections divides the active connection count by the server's weight before comparing.

IP Hash

The client's IP address is hashed to deterministically select a server. The same client always lands on the same backend, providing session affinity without requiring the load balancer to store session state. The downside: if one server is removed, all its clients are re-hashed and reassigned, potentially disrupting in-flight sessions.

Random

A server is selected uniformly at random. Statistically approaches round-robin with large request volumes, but with no guarantees for small volumes. Useful when you want zero coordination overhead and the pool is large and homogeneous.

Health Checks

Load balancers probe backends at regular intervals to detect failures. A server is marked unhealthy after a configurable number of consecutive failures and removed from the pool. It is re-added once a configurable number of consecutive checks pass. There are two flavors:

nginx
upstream backend {
    least_conn;                          # least-connections algorithm
    server 10.0.0.1:8080 weight=3;
    server 10.0.0.2:8080 weight=1;
    server 10.0.0.3:8080 backup;         # only used if primary servers fail
}

server {
    listen 80;
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_500;
    }
}

Sticky Sessions

Some applications store session state on the server (in-process caches, file uploads in progress). Sticky sessions (also called session persistence) ensure that a given client's requests always go to the same backend. The load balancer typically implements this by setting a cookie on the client's first request and reading it on subsequent ones.

Sticky sessions are a double-edged sword: they improve compatibility with stateful applications but reduce the even distribution of load and complicate failure handling — if the sticky server goes down, all its sessions are lost anyway. The preferred fix is to move session state to a shared store (Redis, a database) so any server can handle any request.

Load Balancer Redundancy

The load balancer itself is a potential single point of failure. Production deployments use an active-passive or active-active pair, often with a Virtual IP (VIP) address that floats between nodes. DNS or BGP routes the VIP to the active node; if it fails, the passive node claims the VIP within seconds via protocols like VRRP (HAProxy + Keepalived) or by updating the DNS record.

Consistent Hashing

IP Hash and simple modulo-based routing share a catastrophic flaw: adding or removing a server reshuffles nearly all client assignments. If you have 10 servers and add one, modulo 11 produces completely different targets for most keys — suddenly 90% of your cache keys miss on the new server mapping, causing a thundering herd on the backends. Consistent hashing solves this by mapping both servers and requests onto a circular hash ring, so that adding or removing one server affects only the keys in the range between it and its nearest neighbor — roughly 1/n of keys for n servers.

How the Ring Works

Each server is hashed to one or more points on a 0–2³²−1 ring. Each request key is also hashed to a point on the ring; the request is routed to the first server encountered moving clockwise from that point. With virtual nodes (vnodes), each physical server occupies multiple points on the ring (e.g., 150 vnodes per server), which smooths out the uneven distribution that occurs when server hashes cluster in one region of the ring.

python
import hashlib, bisect

class ConsistentHashRing:
    def __init__(self, nodes, vnodes=150):
        self.ring = {}        # hash → node
        self.keys = []        # sorted hash values
        for node in nodes:
            for i in range(vnodes):
                h = self._hash(f"{node}-{i}")
                self.ring[h] = node
                self.keys.append(h)
        self.keys.sort()

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def get_node(self, key):
        h = self._hash(key)
        idx = bisect.bisect(self.keys, h) % len(self.keys)
        return self.ring[self.keys[idx]]   # clockwise successor

Consistent hashing is used by: distributed caches (Memcached, Redis Cluster), CDN edge servers, database sharding routers, and Kafka's partition assignment. It is the answer whenever session affinity or cache locality matters and you need to add/remove nodes without catastrophic key remapping.

L4 vs L7 — Going Deeper

The high-level distinction is well known, but the implementation details determine which you should reach for in each scenario.

Layer 4 Under the Hood

An L4 load balancer operates at the TCP/UDP transport layer. It sees source IP, destination IP, source port, and destination port — and nothing else. The two dominant forwarding strategies are:

Layer 7 Under the Hood

An L7 load balancer terminates the TCP connection from the client, reads the HTTP/gRPC/WebSocket request, makes a routing decision based on content, and establishes a separate TCP connection to the backend. This two-connection model is what enables features that L4 cannot provide:

gRPC and HTTP/2

gRPC multiplexes many calls over a single long-lived HTTP/2 connection. An L4 load balancer routes the entire connection to one backend — all calls on that connection land on the same server, completely defeating load balancing. The fix is an L7-aware proxy (Envoy, Nginx with gRPC proxy module, Linkerd) that can distribute individual gRPC frames across backends.

SSL/TLS Termination Strategies

There are three standard approaches to TLS in a load-balanced architecture, each with different security and performance tradeoffs:

StrategyDescriptionProsCons
TLS Termination at LBLB decrypts traffic; backend receives plain HTTPCentralized cert management; LB can inspect/route on content; backends simplerInternal traffic unencrypted; LB is a decryption oracle
TLS Pass-through (L4)LB forwards encrypted bytes; backend terminates TLSEnd-to-end encryption; LB never sees plaintextNo L7 routing possible; cert management per backend
TLS Re-encryptionLB terminates TLS from client, then opens a new TLS connection to backendBoth segments encrypted; LB can inspect and routeHigher CPU cost; two cert pairs to manage

Most cloud architectures use TLS termination at the edge LB (AWS ALB, Cloudflare) combined with internal mTLS (mutual TLS between services, enforced by a service mesh like Istio or Linkerd). This gives you L7 routing at the edge, centralized public cert management, and end-to-end encryption without hand-managing backend certificates.

nginx
# TLS termination + re-encryption to backends
server {
    listen 443 ssl;
    ssl_certificate     /etc/ssl/certs/api.example.com.crt;
    ssl_certificate_key /etc/ssl/private/api.example.com.key;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;

    location / {
        proxy_pass              https://backend;  # re-encrypt to backends
        proxy_ssl_verify        on;
        proxy_ssl_trusted_certificate /etc/ssl/certs/internal-ca.crt;
        proxy_ssl_session_reuse on;              # reuse TLS sessions for perf
    }
}

Health Checks — Advanced Configuration

Basic health checks send a TCP SYN or an HTTP GET to /health. In production, the gap between "the process is alive" and "the process can serve production traffic" is where most failures hide. Advanced health check design closes this gap.

Health Check Endpoint Design

A well-designed /health endpoint should check all critical dependencies — database connectivity, cache availability, downstream service reachability — and return structured JSON with per-dependency status. This lets a load balancer distinguish between "server completely dead" (TCP failure) and "server alive but database is down" (HTTP 503), and route accordingly.

json
// GET /health/readiness — 200 OK if ready, 503 if not
{
  "status": "degraded",
  "checks": {
    "database":  { "status": "up",   "latency_ms": 3   },
    "cache":     { "status": "up",   "latency_ms": 0.5 },
    "payment_gw":{ "status": "down", "latency_ms": 5000 }
  }
}

HAProxy Health Check Configuration

haproxy
backend api_servers
    balance leastconn
    option httpchk GET /health/readiness
    http-check expect status 200
    default-server inter 5s fall 3 rise 2   # check every 5s; 3 failures = down, 2 successes = up
    server api1 10.0.0.1:8080 check weight 10
    server api2 10.0.0.2:8080 check weight 10
    server api3 10.0.0.3:8080 check weight 5   # smaller instance — half the weight
    server api4 10.0.0.4:8080 check backup      # only activated if all primary servers fail

DNS Load Balancing and Global Traffic Management

DNS-based load balancing operates above the network layer entirely: the DNS server returns different A records to different clients, distributing them across server pools or geographic regions. It is the foundation of global traffic management and multi-region deployments.

How DNS LB Works

The authoritative DNS server for your domain returns different IP addresses based on a policy — simple round-robin across IPs, or geographic routing based on the resolver's IP. Clients cache the response for the TTL duration (often 30–300 seconds), then re-resolve. This creates a natural load distribution at the DNS layer, before any TCP connection is made.

dns
# Simple DNS round-robin: return different IPs on successive queries
api.example.com.  30  IN  A  54.1.2.3    # US-East LB VIP
api.example.com.  30  IN  A  52.4.5.6    # US-West LB VIP
api.example.com.  30  IN  A  13.7.8.9    # EU-West LB VIP

# DNS TTL = 30s: clients cache for 30s, then re-resolve to a different IP
# Low TTL enables fast failover but increases DNS query load

Limitations of DNS LB

GeoDNS and Latency-Based Routing

Services like AWS Route 53, Cloudflare, and Akamai extend DNS with geographic intelligence: they identify the resolver's location (typically proxy for the user's location) and return the IP of the nearest healthy endpoint. A user in Europe resolves api.example.com to a Frankfurt IP; a user in Singapore resolves to a Singapore IP. Combined with health checks that remove unhealthy endpoints from rotation, GeoDNS provides region-level failover in addition to latency optimization.

Anycast and Global Load Balancing

Anycast is a routing technique where the same IP address is announced from multiple geographic locations simultaneously via BGP. The Internet's routing protocol naturally delivers packets to the topologically closest origin — traffic from Europe reaches a European PoP, traffic from Asia reaches an Asian PoP, all to the same destination IP address. This is how CDNs and DDoS scrubbing services work: a single IP address appears at hundreds of edge locations.

Anycast differs from DNS LB in an important way: the routing decision is made at the network layer (BGP), not the application layer. There is no TTL caching problem — packets are routed to the nearest PoP at every hop. And because BGP re-converges within seconds when a PoP fails, failover is faster and more reliable than waiting for DNS TTLs to expire.

cloudflare example

Cloudflare operates the same IP address (e.g., 1.1.1.1 for their DNS resolver, or their anycast network for CDN) from 300+ data centers worldwide. A user in Tokyo connects to a Tokyo edge node; a user in São Paulo connects to a São Paulo edge node — all to the same IP. This is Anycast at scale: global load distribution and fault tolerance with a single IP address.

Session Persistence Deep Dive

Sticky sessions come in several implementation variants with meaningfully different tradeoffs. Understanding them matters for both production operations and interview discussions about stateful applications.

Cookie-Based Affinity

The load balancer sets a cookie (e.g., SERVERID=api2) on the client's first request. On subsequent requests, the LB reads the cookie and routes to the designated backend. This is the most common and flexible approach — it works across IP changes (mobile clients switching networks), doesn't expose backend IPs, and the cookie TTL controls how long affinity persists.

haproxy
backend web_servers
    balance roundrobin
    cookie SERVERID insert indirect nocache   # LB-managed sticky cookie
    server web1 10.0.0.1:80 check cookie web1
    server web2 10.0.0.2:80 check cookie web2
    server web3 10.0.0.3:80 check cookie web3

IP Hash Affinity

Hash the client IP to select a backend deterministically. No cookie overhead, but mobile clients roaming between networks get reassigned, and corporate clients (many users behind one NAT IP) all land on the same backend. Adding or removing a server changes the hash result for a fraction of clients proportional to 1/n (assuming consistent hashing) or almost all clients (naive modulo).

The Right Fix: Externalize Session State

Both approaches are workarounds for a deeper problem: session state stored in the application server's memory. The production-grade solution is to move session state to a shared, fast store like Redis or Memcached. Any backend can then handle any request because the session is looked up from the external store, not local memory. This enables truly elastic horizontal scaling: add or remove servers without any client-visible disruption.

High Availability and Failover Patterns

A load balancer that itself has a single point of failure has solved the wrong problem. Production LB setups use one of two high-availability patterns.

Active-Passive with VIP

Two load balancer nodes share a Virtual IP (VIP) address. Under normal operation, only the active node claims the VIP and handles all traffic. The passive node monitors the active via heartbeat (VRRP, keepalived). If the active fails to send a heartbeat within a configurable window, the passive claims the VIP via gratuitous ARP and takes over. Failover typically completes in 1–2 seconds. AWS Elastic Load Balancer, Google Cloud LB, and Azure Load Balancer manage this internally — you never see the VIP mechanics.

bash
# keepalived.conf (VRRP active-passive LB pair)
vrrp_instance VI_1 {
    state MASTER            # set BACKUP on the passive node
    interface eth0
    virtual_router_id 51
    priority 110           # higher priority = preferred master; BACKUP = 100
    advert_int 1           # VRRP heartbeat every 1s
    authentication {
        auth_type PASS
        auth_pass secret123
    }
    virtual_ipaddress {
        203.0.113.10        # the VIP both nodes compete for
    }
    track_script {
        chk_haproxy         # demote if haproxy process dies
    }
}

Active-Active

Both LB nodes handle traffic simultaneously, typically via DNS round-robin between two VIPs or via an upstream Anycast address. Total throughput is doubled. On failure, the surviving node absorbs all traffic. The tradeoff: slightly more complex configuration and state synchronization (connection tables must be replicated if TCP sessions are to survive failover seamlessly).

Algorithm Comparison and Selection Guide

AlgorithmBest forWeaknessComplexity
Round RobinUniform request cost, identical serversUneven if requests vary in duration or server capacity differsO(1)
Weighted Round RobinHeterogeneous server capacitiesStill ignores current server load; static weights need manual tuningO(1)
Least ConnectionsVariable request duration (long-polling, streaming)All new connections sent to an idle server simultaneously (thundering herd on recovery)O(log n)
IP HashSession affinity without cookie overheadReshuffles on server add/remove; NAT clients create hot spotsO(1)
Consistent HashSession affinity + cache locality at scaleSlightly uneven distribution without vnodes; more implementation complexityO(log n)
RandomVery large homogeneous pools with stateless requestsNo guarantees for small pools; no awareness of server loadO(1)
Least Response TimeLatency-sensitive APIs; weighted by actual response speedRequires active measurement overhead; can amplify thundering herdsO(log n)

Load Balancing in Modern Architectures

Service Mesh

In microservices architectures, a service mesh (Istio, Linkerd, Consul Connect) moves load balancing logic from a central LB appliance into a sidecar proxy (typically Envoy) running alongside every application Pod. Each sidecar intercepts all inbound and outbound traffic, applies load balancing, retries, circuit breaking, and mTLS — without the application code knowing. The control plane (Istio's istiod, Linkerd's control plane) distributes configuration updates to all sidecars.

The advantage: L7 load balancing with per-request routing decisions at every service-to-service call, not just at the edge. The cost: every Pod now has a sidecar consuming ~50–100 MB RAM and adding ~1–2 ms latency per hop.

AWS Application Load Balancer — Key Features

Common Pitfalls and Best Practices

takeaway

Choose Layer 7 for HTTP workloads that need path routing, TLS termination, or header inspection. Choose Layer 4 for raw TCP throughput, non-HTTP protocols, or extreme performance via Direct Server Return. Pick your algorithm based on request homogeneity: round-robin for uniform workloads, least-connections for variable ones, consistent hash for cache locality and session affinity. Always run at least two load balancers in active-passive or active-active configuration to avoid introducing the very SPOF you set out to eliminate.

🎯 interview hot-takes

Layer 4 vs Layer 7 — when do you pick each? Layer 4 for raw TCP throughput, non-HTTP protocols, or when you need Direct Server Return to keep return traffic off the LB; Layer 7 when you need path-based routing, TLS termination, gRPC support, or header-based decisions.
How does a load balancer detect a failed server? Active health checks (synthetic probes to /health on a schedule) or passive health checks (infer from 5xx error rates on real traffic). Active checks detect pre-traffic failures; passive checks detect in-flight failures.
Why are sticky sessions a problem? They reduce load distribution, complicate scaling, and mean all session data is lost if that server goes down — the fix is externalizing session state to Redis so any server can handle any request.
What is consistent hashing and when is it used? Maps servers and requests to a ring; adding/removing a server affects only ~1/n of keys instead of reshuffling everything. Used by distributed caches, CDNs, database sharding, and Kafka partition assignment.
How does Anycast work? The same IP is announced via BGP from multiple geographic PoPs; the Internet's routing naturally delivers packets to the topologically nearest PoP — no DNS resolution, no TTL caching, sub-second failover when a PoP goes down.
Why can't an L4 LB balance gRPC traffic? gRPC multiplexes many calls over a single HTTP/2 connection; an L4 LB routes the entire connection to one backend. Only an L7 proxy that understands HTTP/2 framing can distribute individual calls across backends.

← previous
Kubernetes (k8s)