DDIA Ch.10 — Batch Processing

Chapter 10 opens Part III, "Derived Data," which is about systems that take data from one place and transform it into another — indexes, caches, aggregates, models. It starts with batch processing: jobs that take a large, bounded input, crunch it, and produce an output, with no expectation of low latency. Kleppmann builds the ideas from the ground up — starting with humble Unix command-line tools, whose design philosophy turns out to be the conceptual ancestor of MapReduce and the modern dataflow engines that power large-scale analytics.

⚡ Quick Takeaways

Three system types — services (online, latency-bound), batch (offline, bounded input, throughput-bound), and stream (near-real-time, unbounded input). This chapter is batch.
The Unix philosophy scales up — do one thing well, compose with a uniform interface (text over pipes), expect to be plugged together. MapReduce is this idea across thousands of machines.
MapReduce = map + shuffle + reduce — mappers extract key-value pairs, the framework sorts and groups by key (the shuffle), reducers process each group. Jobs chain into workflows.
Joins are the hard part — reduce-side (sort-merge) joins are general; map-side (broadcast/partitioned hash) joins are faster when one input is small or co-partitioned. Hot keys cause skew.
Immutable inputs make fault tolerance trivial — jobs are deterministic with no side effects, so a failed task is just rerun.
Dataflow engines (Spark, Flink, Tez) beat plain MapReduce — they model the whole workflow as one job and avoid writing every intermediate step to disk.

tldr

Batch processing transforms a big bounded dataset into a derived output. The Unix philosophy — small composable tools with a uniform interface — is the template; MapReduce applies it across a cluster on a distributed filesystem, with map/shuffle/reduce and chained workflows. Joins and skew are the main challenges. Because inputs are immutable and jobs deterministic, fault tolerance is just "rerun the failed task." Dataflow engines improve on MapReduce by keeping intermediate data in memory and scheduling the whole pipeline together.

Batch Processing with Unix Tools

Kleppmann begins with a deceptively simple task: find the five most popular pages in a web server log. You can do it with a chain of standard Unix tools, and the example carries surprising depth.

top 5 URLs from a web log — a Unix pipeline

cat access.log |             # stream the log
  awk '{print $7}' |          # extract the URL field
  sort |                     # group identical URLs together
  uniq -c |                  # count each run of duplicates
  sort -rn |                 # order by count, descending
  head -n 5                  # take the top 5

The magic is in the Unix philosophy: each program does one thing well, programs are expected to be composed, and they share a uniform interface — a stream of text on stdin/stdout. That uniformity is exactly why sort can feed uniq without either knowing about the other. A subtle but important detail: sort handles datasets larger than memory by spilling sorted chunks to disk and merging them (an external merge sort) — the same trick distributed systems use at scale.

MapReduce and Distributed Filesystems

MapReduce is, in spirit, the Unix pipeline spread across thousands of machines. Instead of reading and writing local files, jobs read and write a distributed filesystem like HDFS — a shared-nothing pool of disks across many machines, with files split into blocks that are replicated for fault tolerance. A MapReduce job has two callbacks the programmer writes, and one step the framework provides:

Map — called once per input record; it extracts and emits key-value pairs (e.g. emit (url, 1) for each log line).
Shuffle — the framework sorts all emitted pairs by key and partitions them so that every value for a given key arrives at the same reducer. This sort-and-group step is the heart of MapReduce.
Reduce — called once per distinct key with all its values; it produces the output (e.g. sum the 1s to get a count).

A single map-reduce step is limited, so real jobs are workflows: chains of MapReduce jobs where the output of one becomes the input of the next, stitched together by tools like Oozie or Airflow.

Joins and Grouping

Bringing related records together — joins — is where most of the engineering lives:

Reduce-side (sort-merge) join. Both inputs are mapped to emit the join key, the shuffle brings records with the same key to the same reducer, and the reducer joins them. Fully general, but pays the cost of sorting and shuffling all the data.
Map-side joins. Avoid the shuffle when possible. A broadcast hash join loads a small input entirely into memory in every mapper. A partitioned hash join works when both inputs are already partitioned the same way on the join key, so each mapper only needs the matching partition.
Handling skew. A "hot" key (a celebrity in a social graph) sends a huge amount of data to one reducer, which becomes a straggler. Techniques like skewed/sharded joins split the hot key's load across several reducers.

MapReduce vs Distributed Databases

MapReduce embodies a different philosophy from MPP (massively parallel processing) analytic databases. MPP databases want data loaded into a schema first and run optimized SQL. MapReduce (and the Hadoop ecosystem) embraces schema-on-read: dump raw data into the distributed filesystem and figure out its structure when you read it, enabling diverse, evolving processing (SQL, ML, custom code) over the same raw data. This made Hadoop a flexible "data lake" at the cost of MPP's query optimization.

The Output and Fault Tolerance

Batch jobs typically produce bulk outputs: building a search index (how Google originally used MapReduce), training a recommendation/ML model, or constructing read-only key-value stores to be loaded into serving systems. The crucial design principle is that inputs are immutable and jobs have no side effects — a job only reads its input and writes a fresh output, never modifying the input in place. This gives fault tolerance almost for free: if a task fails (machine crash), the framework simply reruns it on another machine, because the deterministic computation on the same immutable input yields the same result. It also makes experimentation safe — a buggy job can be rerun without having corrupted anything.

Beyond MapReduce: Dataflow Engines

MapReduce's biggest inefficiency: it materializes intermediate state to disk (the distributed filesystem) between every map-reduce step. For a workflow of many steps, this means writing and re-reading (and re-replicating) gigabytes repeatedly, and each step can't start until the previous fully finishes. Dataflow engines — Spark, Tez, Flink — fix this by modeling the entire workflow as a single job of connected operators:

They avoid unnecessary materialization, pipelining data between operators and keeping it in memory where possible.
They schedule the whole graph together, so operators can overlap and the engine can optimize across step boundaries.
Sorting (the expensive shuffle) is done only where actually needed, not at every boundary.

Aspect	MapReduce	Dataflow engine (Spark/Flink)
Intermediate state	Written to disk every step	Pipelined / kept in memory
Job model	Independent map+reduce steps	One operator graph (DAG)
Latency	High (disk + step barriers)	Lower (overlap, memory)
Fault tolerance	Re-read materialized output	Recompute from lineage (Spark RDDs)
Iterative jobs	Painful (re-read each pass)	Efficient (cache in memory)

The trade-off is fault tolerance: with no materialized intermediate state to fall back on, dataflow engines must recompute lost data. Spark tracks the lineage of each dataset (the operations that produced it) so it can recompute just the lost partitions — provided the computation was deterministic. For graph and iterative algorithms (e.g. PageRank), specialized models like Pregel's "think like a vertex" approach pass messages between vertices across iterations far more efficiently than re-running MapReduce each pass.

takeaway

Batch processing's enduring lessons are about design discipline, not just throughput: immutable inputs and side-effect-free, deterministic jobs make fault tolerance, retries, and experimentation trivial. MapReduce proved the model at scale by borrowing the Unix philosophy; dataflow engines refined it by not touching the disk between every step. The same "derived data from immutable inputs" idea reappears, in continuous form, in stream processing.

🎯 interview hot-takes

What are the three stages of MapReduce? Map (emit key-value pairs), shuffle (framework sorts and groups by key), reduce (process each key's values). The shuffle is the expensive, essential middle.
Reduce-side vs map-side join? Reduce-side (sort-merge) is general but shuffles everything; map-side (broadcast or partitioned hash) skips the shuffle when one input is small or both are co-partitioned.
Why is batch fault tolerance easy? Inputs are immutable and jobs deterministic with no side effects, so a failed task is simply rerun and produces the same result.
Why are Spark/Flink faster than MapReduce? They model the workflow as one operator graph and pipeline data in memory instead of materializing every intermediate step to disk.