Apache Kafka — Pub/Sub, Partitions & Exactly-Once

Q: What triggers a consumer group rebalance?

Member joins/leaves, session timeout exceeded (session.timeout.ms), max poll interval breached (max.poll.interval.ms), or partition count changes. Use CooperativeStickyAssignor to make rebalances incremental instead of stop-the-world.

Kafka is an open-source distributed event streaming platform adopted by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. At its core it is a publish/subscribe system: producers write messages to topics, and consumers read from the topics they care about. It extends far beyond traditional message queues like ActiveMQ or RabbitMQ.

⚡ Quick Takeaways

Log, not a queue — messages persist after being read; each consumer tracks its own offset.
Partitions give horizontal scale and per-key ordering.
Consumer groups give parallel reads — one partition per member.
acks + ISR + idempotence tune durability all the way up to exactly-once.
Fast because of sequential writes, page cache, and zero-copy.

tldr

Kafka is a durable, partitioned, replicated commit log. Producers append records to partitions, consumers pull at their own pace, and replication + acks give you tunable durability — from fire-and-forget to exactly-once.

Major Use Cases

Peak traffic buffering — absorb bursts during high-demand windows (lunch rush for food delivery, flash sales for e-commerce) so downstream services aren't overwhelmed.
Service decoupling — separate data sources from consumers; Kafka MirrorMaker moves data across data lakes and availability zones.
Asynchronous communication — offload non-essential work off the request path to improve user-facing latency.

Pub/Sub Model

Different categories of messages flow into distinct topics from various producers. Unlike traditional queues, messages persist after consumption — Kafka manages the lifecycle independently of any single consumer, so multiple consumers can read the same topic at different offsets.

Producer Architecture

The producer calls .send() with a ProducerRecord (a message wrapper). The flow:

A serializer converts the message to a byte stream (custom serializers, Protobuf, or Avro via the Schema Registry).
The partitioner decides the destination partition, enabling distributed storage and parallel processing across the cluster.
Records queue in the RecordAccumulator (in-memory) before being sent. Two knobs control timing: BATCH_SIZE_CONFIG (queue threshold) and LINGER_MS_CONFIG (max wait).
The Sender creates a network client and awaits the acks confirmation.

Consumer Model

Kafka uses a pull model — consumers request data on their own schedule, giving more decoupling than push. Many independent consumers can read the same topic, but consumers within a single consumer group cannot share a partition: the group behaves as one logical subscriber, and each partition is owned by exactly one member.

Partitions & Replicas

Parallelism comes from splitting a topic into partitions spread across servers. The DefaultPartitioner uses hash(key) % num_partitions; custom implementations override partition().

Each partition is an append-only log segmented at 1 GB, with a sparse index record every 4 KB for fast lookups. Retention (log.retention.hours/minutes/ms, default 7 days) deletes or compacts old data based on log.cleanup.policy. Replication guards against single-point failures via a leader/follower scheme.

System Reliability

acks

This producer-side setting controls how many brokers must receive a record before the write is considered successful — the core durability/throughput tradeoff:

Setting	Behavior
acks=0	Success the moment the record is sent — no broker response awaited. Fastest, least durable.
acks=1	Success once the leader receives the record.
acks=all (-1)	Success only after all in-sync replicas (ISR) acknowledge. If ISRs drop offline, streaming blocks. `min.insync.replicas=n` guarantees a minimum number of copies.

reliable transmission

For durable delivery: acks=-1, num_partitions > 1, and min.insync.replicas > 1.

Retries & Delivery Semantics

RETRIES_CONFIG defaults to MAX_INT for Kafka ≥ 2.1. Retries risk duplication unless managed:

At-Least-Once — acks=-1; messages never lost but may duplicate.
At-Most-Once — acks=0; no duplicates but messages may be lost.
Exactly-Once — At-Least-Once + idempotence. Each producer gets a unique Producer Id (PID) with a monotonically increasing sequence number per message. Brokers track the largest <PID, sequence> per partition and reject duplicates (enable.idempotence=true). The TransactionManager adds database-like beginTransaction()/commitTransaction() wrapping .send().

Replicas

Multiple followers sync from the leader. Producers only write to the leader; followers fetch from it. The ISR lists in-sync followers; a follower lagging beyond replica.lag.time.max.ms moves to the OSR. ISR + OSR = Assigned Replicas (AR). On leader failure, a new leader is elected from the ISR. Each replica tracks its LEO (Log End Offset) to measure sync progress.

High-Speed Read/Write

Cluster computing — parallel processing for both producers and consumers.
Sparse indexing — .index files accelerate reads.
Append-only logs — sequential writes only, far faster than random access.
Page Cache — the OS caches file metadata/indices in RAM while data lives on disk; Kafka leans on this layered approach instead of a custom cache.
Zero-copy — data transfers directly from the read buffer to the socket buffer, skipping user/kernel context switches. (This advantage disappears under SSL, since brokers must decrypt then re-encrypt.)

ZooKeeper

A coordination service for distributed systems. With Kafka it tracks cluster node status and topic/message metadata. Five primary functions:

Controller election — picks the first AR broker also in the ISR (/kafka/controller).
Cluster membership — lists alive brokers (/kafka/brokers/ids).
Topic configuration — stores full topic details (/kafka/brokers/topics/{topic}/partitions/{id}/state).
Access Control Lists.
Quotas — monitors per-client read/write allowances.

CLI Examples

Create a topic with 2 partitions and replication factor 3:

bash

# test.config → bootstrap.servers=localhost:9092
kafka-topics \
  --bootstrap-server localhost:9092 \
  --command-config ./test.config \
  --topic test1 \
  --create \
  --replication-factor 3 \
  --partitions 2

Produce keyed messages, then consume from the beginning:

bash

# producer
kafka-console-producer \
  --topic test1 \
  --broker-list localhost:9092 \
  --property parse.key=true \
  --property key.separator=:
# > key:{"val":0}

# consumer
kafka-console-consumer \
  --topic test1 \
  --bootstrap-server localhost:9092 \
  --property print.key=true \
  --from-beginning

Consumer Groups and Rebalancing

A consumer group is the unit of parallel consumption in Kafka. Every consumer instance in a group is assigned a disjoint subset of partitions, and the group together consumes the full topic. Add a sixth consumer to a five-partition topic and that consumer will be idle — there are no spare partitions to assign. The constraint is deliberately simple: it means each partition has exactly one authoritative offset per group, and progress is never ambiguous.

The Rebalance Protocol

A rebalance is triggered whenever group membership changes: a new consumer joins, an existing one crashes or exceeds session.timeout.ms, a new partition is added to the topic, or an application calls unsubscribe(). During a rebalance, the Group Coordinator (a broker elected for each consumer group) stops all consumption, revokes partition assignments, then redistributes them. The cost is a pause — sometimes several seconds — during which no messages are processed. Minimizing rebalance frequency and duration is therefore a major operational concern.

properties

# Tune these to reduce unnecessary rebalances
session.timeout.ms=45000        # how long before a silent consumer is declared dead
heartbeat.interval.ms=15000     # must be < session.timeout.ms / 3
max.poll.interval.ms=300000    # max time between poll() calls before consumer is dropped
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Eager vs. Cooperative Rebalancing

The classic eager rebalance (used by RangeAssignor and RoundRobinAssignor) revokes all partition assignments before any redistribution begins — a global stop-the-world pause. Kafka 2.4+ introduced the cooperative sticky rebalance protocol (CooperativeStickyAssignor): consumers only revoke partitions that need to move, while keeping the ones they retain. This typically eliminates the pause entirely for consumers not involved in the reassignment, and keeps partition assignments "sticky" to minimize unnecessary state migration.

pitfall

max.poll.interval.ms is the silent killer. If your poll() loop takes longer than this (e.g., a slow database write downstream of message processing), the broker assumes the consumer is dead and triggers a rebalance. The fix: either increase the timeout or, better, move slow work to a separate thread and keep the poll loop tight.

Static Group Membership

Kafka 2.3+ added group.instance.id, which gives a consumer a persistent identity across restarts. With static membership, a consumer that crashes and rejoins within session.timeout.ms reclaims its old partitions without a rebalance — an important optimization for container workloads where pods restart frequently during deployments.

Log Compaction

Retention in Kafka comes in two flavors. The default delete policy simply removes segments older than log.retention.hours (or exceeding a size threshold). Log compaction is fundamentally different: instead of discarding messages by time, it retains only the most recent message for each key, creating a compacted log that behaves like an eventually-consistent key-value store.

bash

# Create a compacted topic for a user-profile changelog
kafka-topics \
  --bootstrap-server localhost:9092 \
  --topic user-profiles \
  --create \
  --partitions 12 \
  --replication-factor 3 \
  --config cleanup.policy=compact \
  --config min.cleanable.dirty.ratio=0.1 \
  --config segment.ms=86400000    # roll a new segment daily

How the Log Cleaner Works

The broker's background log cleaner threads scan the "dirty" (recently written) portion of each partition. They build an offset map in memory (key → latest offset), then copy only the records from the dirty portion that are the latest for their key into a new "clean" segment, discarding older duplicates. The result: the compacted log grows only with distinct keys, not with time.

A special tombstone message — a record with a non-null key and a null value — signals deletion. After compaction, even the tombstone is eventually removed (after delete.retention.ms), so consumers that read the compacted log see the key fully erased.

use case

Log compaction powers Kafka Streams' state stores and changelog topics. When a KTable is rebuilt after a crash, the application replays only the compacted log — not the full history — to reconstruct current state. Database CDC topics (Debezium) also use compaction so downstream consumers can bootstrap from the latest row image per primary key.

KRaft — Kafka Without ZooKeeper

ZooKeeper served Kafka for years as its cluster coordination layer: controller election, broker membership, topic metadata. But it introduced a separate operational dependency, a separate scaling concern, and a bottleneck during cluster startup. KRaft (Kafka Raft Metadata mode), introduced in Kafka 2.8 and production-ready from Kafka 3.3, replaces ZooKeeper entirely with Kafka's own built-in Raft consensus implementation.

Architecture Shift

Under KRaft, a subset of brokers are designated controllers. They run a dedicated metadata partition (topic __cluster_metadata) replicated via the Raft protocol. The active controller is the Raft leader. All cluster metadata — broker registrations, topic configurations, partition leadership — lives in this single log rather than spread across ZooKeeper znodes.

properties

# KRaft server.properties (combined broker+controller node)
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@kafka1:9093,2@kafka2:9093,3@kafka3:9093
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
log.dirs=/var/lib/kafka/data

KRaft Benefits

Single operational footprint — one system to deploy, monitor, and upgrade instead of two.
Faster startup and leader election — metadata is already in Kafka's log; no ZooKeeper epoch handshake. Cluster restart goes from minutes to seconds in large clusters.
Higher partition counts — ZooKeeper struggled beyond ~200K partitions; KRaft targets millions. The metadata log scales with Kafka's own storage, not ZooKeeper's memory-resident znodes.
Simplified security — one set of TLS/SASL configurations instead of separate Kafka and ZooKeeper configs.

migration path

Kafka 3.5+ supports a migration mode that lets you convert an existing ZooKeeper-based cluster to KRaft without downtime: run both coordinators simultaneously during the transition, then decommission ZooKeeper once all metadata is in KRaft. Never skip this dual-write phase in production.

Performance Tuning Parameters

Kafka's defaults are conservative. Real-world tuning targets three dimensions: producer throughput, consumer lag, and durability guarantees. The following parameters are the levers that matter most.

Producer Tuning

properties

# Throughput-optimized producer
acks=all                   # durable writes to all ISR replicas
enable.idempotence=true    # deduplicate retries, exactly-once within producer session
linger.ms=5               # wait up to 5ms to batch records; 0 = send immediately
batch.size=65536          # 64 KB batch; larger = fewer round trips, more latency
compression.type=lz4     # lz4 for best CPU/ratio tradeoff; snappy also common
buffer.memory=67108864   # 64 MB in-memory buffer; increase for high-volume producers
max.in.flight.requests.per.connection=5  # >1 is fine with idempotence enabled

Consumer Tuning

properties

fetch.min.bytes=1024         # wait until 1 KB available; reduces fetch round trips
fetch.max.wait.ms=500        # max wait for fetch.min.bytes to be met
max.poll.records=500         # records per poll(); reduce if processing is slow
auto.offset.reset=earliest  # on first run: start from oldest; latest = tail only
enable.auto.commit=false    # manual commit for exactly-once processing semantics

Broker and Topic Tuning

properties

# broker-level
num.io.threads=8             # number of I/O threads; set to disk count
num.network.threads=3       # network request threads
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
log.flush.interval.messages=10000  # fsync every 10K msgs; leave to OS for best perf

# topic-level overrides via kafka-configs
kafka-configs --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name orders \
  --add-config retention.ms=604800000,min.insync.replicas=2

Kafka vs. RabbitMQ vs. Apache Pulsar

Choosing the right messaging system is one of the highest-leverage architecture decisions you can make. Kafka, RabbitMQ, and Pulsar each occupy a different point in the design space.

Dimension	Kafka	RabbitMQ	Pulsar
Message model	Durable log — persists after consumption; consumers replay at any offset	Queue — message is deleted after acknowledgement	Log + queue unified: topics backed by a durable log, queues as a cursor abstraction
Throughput	Millions of msg/sec per cluster via sequential disk I/O	~50K–100K msg/sec per queue (memory-bound)	Comparable to Kafka via Apache BookKeeper storage
Latency	Low single-digit ms with tuned batching; higher with linger	Sub-millisecond p50 (no batching overhead)	5–15 ms typical; geo-replication adds more
Routing	Topic + partition key; no content-based routing	Rich routing via exchanges (direct, fanout, topic, headers)	Namespaces + topic hierarchy; no exchange routing
Replay	Native — seek to any offset within retention window	Not supported — consumed messages are gone	Native — backed by BookKeeper ledgers
Multi-tenancy	Namespaces (limited)	Virtual hosts	First-class: tenants → namespaces → topics with quotas
Geo-replication	MirrorMaker 2 (async)	Shovel / Federation plugins	Built-in, synchronous cross-datacenter replication
Ops complexity	Moderate (KRaft removes ZK dependency)	Low — excellent management UI, easy setup	High — Kafka + ZooKeeper + BookKeeper triad historically
Best for	Event streaming, CDC, audit logs, high-throughput pipelines	Task queues, RPC patterns, complex routing, low latency jobs	Multi-tenant SaaS, native geo-replication, tiered storage at petabyte scale

decision guide

Pick Kafka when you need replay, high throughput, CDC, or event sourcing. Pick RabbitMQ for complex routing logic, task queues, or sub-millisecond latency requirements. Pick Pulsar when you need built-in geo-replication across datacenters or true multi-tenancy with per-namespace quota isolation — at the cost of significantly higher operational complexity.

Kafka Streams and Kafka Connect

Kafka is not just a message broker — it is the foundation for an entire streaming processing ecosystem. Two components extend it furthest: Kafka Streams for stream processing inside your JVM application, and Kafka Connect for plug-and-play data pipeline connectors.

Kafka Streams

Kafka Streams is a lightweight client library (not a cluster) for building stream processing applications. Unlike Apache Flink or Spark Streaming, there is no separate processing cluster to manage — the library runs inside your application process, reads from input topics, and writes results to output topics. Fault tolerance is achieved by storing state (windowed aggregations, join tables) in RocksDB on disk, backed by a Kafka changelog topic for recovery after crashes.

java

StreamsBuilder builder = new StreamsBuilder();

// read from "orders" topic, group by userId, count over 5-minute windows
KStream<String, Order> orders = builder.stream("orders");
orders
  .groupByKey()
  .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
  .count()
  .toStream()
  .to("orders-per-user-5min");  // write results back to Kafka

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Key Kafka Streams concepts:

KStream — an unbounded stream of records; each record is an independent event.
KTable — a stream viewed as a changelog; each key's value is its latest update (like a materialized view).
GlobalKTable — a KTable fully replicated to every instance (useful for broadcast joins without repartitioning).
Stream-Table Join — enrich each event with the current state of a lookup table; the join is local because both are co-partitioned by key.
Windowing — tumbling (non-overlapping), hopping (overlapping), session (activity-driven), and sliding windows for time-based aggregations.
Exactly-once — Kafka Streams supports end-to-end exactly-once semantics via transactions spanning both Kafka reads and writes.

Kafka Connect

Kafka Connect is a scalable, fault-tolerant framework for moving data between Kafka and external systems — databases, object stores, search engines, SaaS APIs — without writing custom code. Connectors come in two flavors: source connectors read from an external system and write to Kafka (e.g., Debezium PostgreSQL CDC, S3 source), and sink connectors read from Kafka and write to an external system (e.g., Elasticsearch sink, Snowflake sink, JDBC sink).

json

{
  "name": "postgres-orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.prod.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "orders",
    "table.include.list": "public.orders,public.inventory",
    "topic.prefix": "cdc",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_orders",
    "publication.name": "dbz_publication"
  }
}

Connect runs as a cluster of workers that distribute connector tasks across themselves. If a worker fails, its tasks are reassigned to surviving workers — exactly the same consumer group rebalance mechanism. This means Connect scales horizontally by adding workers and survives node failures without operator intervention.

Schema Registry and Avro

Raw bytes in Kafka topics become a maintenance nightmare as schemas evolve. The Confluent Schema Registry (or AWS Glue Schema Registry) stores Avro, Protobuf, or JSON Schema definitions and assigns each version a numeric ID. Producers serialize messages by embedding a 4-byte schema ID in the payload prefix; consumers look up the schema by ID to deserialize. This enables schema evolution with compatibility checks: the registry rejects a new schema version that is backward- or forward-incompatible with the registered versions, preventing silent data corruption across independently deployed services.

avro

{
  "type": "record",
  "name": "OrderEvent",
  "namespace": "com.example.orders",
  "fields": [
    { "name": "orderId",  "type": "string" },
    { "name": "userId",   "type": "string" },
    { "name": "totalCents", "type": "long" },
    { "name": "currency", "type": "string", "default": "USD" }
  ]
}

Common Pitfalls and Best Practices

Partition Count — Getting It Right

Partitions are Kafka's unit of parallelism and they cannot be decreased once created (only increased). Choosing too few caps your consumer throughput ceiling; choosing too many wastes memory and increases rebalance time. A rough starting point: aim for 1–3 MB/sec throughput per partition based on your broker disk bandwidth, and target roughly 1 partition per consumer instance you expect to run at peak. A common mistake is creating topics with a single partition to "keep things simple" — this makes every consumer in the group idle except one.

Offset Management

With enable.auto.commit=true, offsets are committed periodically regardless of whether the consumer actually processed the message. A crash between the commit and the downstream write means the message is permanently skipped. Use enable.auto.commit=false and commit offsets only after your processing is complete and durable. For exactly-once, wrap the downstream write and the offset commit in a Kafka transaction.

Consumer Lag Monitoring

Consumer lag — the gap between the latest produced offset and the consumer's committed offset — is the primary operational health metric for a Kafka consumer. A growing lag that doesn't recover signals either a slow consumer or a sudden production spike. Monitor it with Kafka's built-in consumer group describe, Confluent's Kafka Consumer Lag Exporter, or LinkedIn's Burrow.

bash

# Check consumer group lag across all partitions
kafka-consumer-groups \
  --bootstrap-server localhost:9092 \
  --group order-processing-group \
  --describe

# Sample output
# GROUP                   TOPIC  PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processing-group  orders  0          1023440         1023441         1
# order-processing-group  orders  1          998772          999201          429  ← lag growing!

Avoiding Common Anti-Patterns

One topic per event type, not one topic for everything — mixing unrelated event types in one topic forces consumers to deserialize and filter messages they don't care about, and prevents independent retention policies.
Don't store large payloads in Kafka — keep messages under 1 MB; for larger payloads (images, reports), store in S3 and put only the reference URL in Kafka. The default message.max.bytes is 1 MB for a reason.
Don't rely on Kafka as a database — while log compaction approximates a KV store, Kafka has no random-access read path. For lookups, materialize the compacted log into Redis or a database.
Dead-letter topics for poison pills — a single malformed message that fails deserialization will block a partition forever if your consumer crashes on it. Route failed messages to a dedicated DLQ topic and alert on it.

Real-World Usage Scenarios

Understanding why organizations reach for Kafka — rather than a simpler queue — solidifies when to recommend it in an interview.

LinkedIn (Kafka's origin) — user activity tracking: every page view, click, and job application published to topics that feed real-time dashboards, recommendation models, and A/B test analysis. Replay is essential: a model retraining job can scan 30 days of raw events.
Uber — geolocation updates from millions of drivers flow through Kafka topics at millions of events per second. Kafka decouples the driver app from the dispatch, maps, and surge pricing services. Topics are retained for 7 days so any downstream service that falls behind can catch up.
Netflix — CDC streams from production databases flow via Debezium into Kafka, then into Elasticsearch for search and Iceberg tables in S3 for analytics. Kafka's retention window gives the analytics pipeline a recovery window if the Spark job fails.
E-commerce checkout pipeline — the canonical system design example: the order service publishes an OrderCreated event, and separate consumers handle inventory deduction, payment processing, notification dispatch, and recommendation signal recording — all independently scaled and fault-isolated.

takeaway

Think of Kafka as a replicated log, not a queue. Partitions give you horizontal scale and ordering-per-key; consumer groups give you parallel reads; and acks + ISR + idempotence let you dial durability from fire-and-forget all the way to exactly-once. KRaft removes ZooKeeper from the equation; log compaction turns topics into a durable KV changelog; Kafka Streams and Connect complete a full streaming ecosystem without external dependencies.

🎯 interview hot-takes

How does Kafka guarantee exactly-once? Idempotent producer (PID + sequence number prevents duplicates on retry) plus transactions that wrap beginTransaction()/commitTransaction() around produce + offset-commit atomically.
Why is Kafka so fast? Sequential append-only writes + OS page cache (avoids double buffering) + zero-copy sendfile syscall + batch compression.
What triggers a consumer group rebalance? Member joins/leaves, session timeout exceeded (session.timeout.ms), max poll interval breached (max.poll.interval.ms), or partition count changes. Use CooperativeStickyAssignor to make rebalances incremental instead of stop-the-world.
Log compaction vs. retention? Retention deletes old segments by time/size; compaction retains only the latest record per key — effectively a durable KV changelog. Null-value tombstones signal deletes.
Kafka vs. RabbitMQ? Kafka for high-throughput event streaming, replay, and CDC; RabbitMQ for complex routing patterns, task queues, and sub-millisecond latency requirements.
What is KRaft and why does it matter? Kafka's built-in Raft metadata layer that replaces ZooKeeper, removing the separate dependency, enabling millions of partitions, and cutting cluster startup from minutes to seconds.