Last modified: January 24, 2026

This article is written in: 🇺🇸

Batch Processing

Batch processing is a method for handling large volumes of data by grouping them into a single batch, typically without immediate user interaction. It is often useful in scenarios where tasks can be processed independently and do not require real-time results, such as nightly analytics jobs, building machine learning models, or transforming data for further analysis. A well-known paradigm in this domain is MapReduce, which operates across distributed clusters to handle massive datasets in parallel.

Batch Processing Flow

+---------------+    +--------------+    +----------------------------+    +--------------+
|               |    |              |    |                            |    |              |
| Data Sources  +--->+ Data Storage +--->+ Batch Processing System    +--->+ Final Output |
|               |    |              |    |   +-----+  +-----+  +-----+|    |              |
+---------------+    +--------------+    |   | J1  |  | J2  |  | J3  ||    +--------------+
      |                      |           |   +-----+  +-----+  +-----+|           |
      |                      |           +----------------------------+           |
      |                      |                           |                        |
      |                      |                           |                        |
      |______________________|___________________________|________________________|
                          (Accumulation Over Time)

MapReduce

MapReduce is a two-phase process (Map and Reduce) that breaks down a dataset into smaller tasks, distributing them across multiple nodes. It automates data distribution, task coordination, and failure handling, making it valuable for large-scale batch processing.

MapReduce Flow

                   +-----------------+
                   |    Input Data   |
                   +-----------------+
                           |
                           v
+-------------------------+-------------------------+
|    Split into Chunks / Distribute to Mappers     |
+-------------------------+-------------------------+
          |                                 |
          v                                 v
+---------------------+             +---------------------+
|  Map Function       |             |  Map Function       |
|  (Process Chunks)   |             |  (Process Chunks)   |
+---------+-----------+             +---------+-----------+
          |                                   |
          v                                   v
+---------------------+             +---------------------+
|  Intermediate Data  |             |  Intermediate Data  |
|  (Key-Value Pairs)  |             |  (Key-Value Pairs)  |
+---------+-----------+             +---------+-----------+
          |                                   |
          v                                   v
+---------------------+             +---------------------+
|     Shuffle &       |             |     Shuffle &       |
|     Sort Phase      |             |     Sort Phase      |
|   (Group by Key)    |             |   (Group by Key)    |
+---------+-----------+             +---------+-----------+
          |                                   |
          v                                   v
+---------------------+             +---------------------+
| Reduce Function     |             | Reduce Function     |
| (Aggregate Results) |             | (Aggregate Results) |
+---------+-----------+             +---------+-----------+
          |                                   |
          +----------------+------------------+
                           |
                           v
                   +-----------------+
                   |  Final Output   |
                   | (Combined Data) |
                   +-----------------+

Example Dataset

Customer Category Amount
Customer A Electronics 250
Customer B Clothing 50
Customer C Electronics 300
Customer D Books 20
Customer E Clothing 80

Step 1: Map Phase

In the Map Phase, each record is processed to output a key-value pair where the key is the product category and the value is the purchase amount. For example, the mapping function (in pseudo-code) could look like this:

map(record):
    key = record.Category
    value = record.Amount
    emit(key, value)

For our dataset, the mapper will produce:

Step 2: Shuffle and Sort Phase

After mapping, the MapReduce framework automatically shuffles and sorts the key-value pairs so that pairs with the same key are grouped together. The grouping results in:

Electronics: [250, 300]
Clothing: [50, 80]
Books: [20]

Step 3: Reduce Phase

In the Reduce Phase, the reducer takes each key and its list of values to perform an aggregation operation (in this example, summing the amounts). The pseudo-code for the reducer might be:

reduce(key, values):
    total = 0
    for value in values:
        total += value
    emit(key, total)

Applying this:

Final Output

The final output is a set of key-result pairs:

Electronics -> 550
Clothing -> 130
Books -> 20

This step-by-step example demonstrates how MapReduce can efficiently process and summarize data by breaking the task into mapping, shuffling, and reducing steps.

Joins in MapReduce

Performing joins combines data from different sources or tables. There are multiple ways to execute joins in a MapReduce context:

Sort-Merge Join (Conceptual)

  Dataset A          Dataset B
  (Mapped by Key)    (Mapped by Key)
         \                 / 
          \               /
           \    Shuffle  /
            \          /
             \        /
        (Reduce: Merge on Key)
                 |
               Output

When to Use Batch Processing

Alternatives to MapReduce

Although MapReduce has been popular for massive-scale processing, newer frameworks and models offer different trade-offs:

Apache Spark

Pregel

Hive and Pig