Last modified: September 27, 2025

This article is written in: 🇺🇸

Designing Parallel Programs

Designing parallel programs involves breaking down computational tasks into smaller, concurrent units to be executed simultaneously. This approach leverages the power of multiple processors to enhance performance and efficiency. Key steps in this process include partitioning, communication, agglomeration, and mapping.

Partitioning

Partitioning is a fundamental concept in computational problem-solving, where a large problem is divided into smaller, more manageable tasks. This approach allows tasks to be executed concurrently, improving efficiency and performance. In essence, partitioning is about breaking a problem into smaller pieces.

#
      +-----------------------+
      |     Big Problem       |
      +-----------------------+
                 |
     +-----------+-----------+
     |           |           |
+---------+ +---------+ +---------+
| Part 1  | | Part 2  | | Part 3  |
+---------+ +---------+ +---------+

Goals

I. Balance the Workload Among Tasks

II. Minimize Dependencies Between Tasks

Types of Decomposition

I. Domain Decomposition

Domain decomposition splits your data space into chunks that different workers handle in parallel. The trick is to choose chunks so each worker mostly uses its own data, only syncing at boundaries.

Example (concrete & step-by-step):

(Another mental model: In a weather sim, split the map into regions; each core updates temperature/wind in its region, only exchanging border values with neighbors after each timestep.)

II. Functional Decomposition

Functional decomposition splits by kind of work, not by data. You build a pipeline of stages; each stage does one job well.

Example (web request pipeline):

(Another quick one: for large matrix multiplication on a GPU pipeline: Load → Multiply → Accumulate → Store. Same idea—different steps, specialized workers.)

III. Cyclic Decomposition

Cyclic decomposition (group theory) takes a permutation—a rule that reorders items—and breaks it into disjoint cycles. A cycle tells you how items rotate among themselves; disjoint means cycles don’t share elements, so they’re independent “mini-loops.”

Where does the permutation come from? Anywhere you reorder labels/indices. Examples:

Concrete example:

You have items labeled ${1,2,3,4}$. A “rotate left among the first three, keep 4” rule sends

In two-line notation that’s

$$ \sigma = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 2 & 3 & 1 & 4 \end{pmatrix}. $$

How to decompose into cycles (step-by-step):

  1. Start with the smallest label not yet used: 1. Follow where it goes: $1 \to 2$, then $2 \to 3$, then $3 \to 1$. You’ve returned to 1, so close the cycle: $(1\,2\,3)$.
  2. Next unused label: 4. Follow it: $4 \to 4$. That’s a 1-cycle (a fixed point): $(4)$.

So the cyclic decomposition is:

$$ \sigma = (1\,2\,3)(4). $$

What this tells you:

Mini table to visualize the mapping:

input 1 2 3 4
output 2 3 1 4

You can trace arrows $1\to2\to3\to1$ and $4\to4$ to “see” the cycles.

IV. Block Decomposition

Block decomposition slices a matrix into submatrices (blocks) so you operate on them as units. This clarifies structure (e.g., block-diagonal) and speeds things up (cache-friendly, reuse kernels).

Example (numbers + why it helps):

Let

$$ A=\begin{pmatrix} 1 & 2 & 9 & 8\\ 0 & 3 & 7 & 6\\ 4 & 5 & 0 & 0\\ 4 & 5 & 0 & 0 \end{pmatrix} = \begin{pmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix} $$

$$ A_{11}=\begin{pmatrix} 1 & 2\\ 0 & 3\end {pmatrix}, \quad A_{12}=\begin{pmatrix} 9 & 8\\ 7 & 6 \end{pmatrix}, \quad A_{21}=\begin{pmatrix} 4 & 5\\ 4 & 5 \end{pmatrix}, \quad A_{22}=\begin{pmatrix} 0 & 0\\ 0 & 0 \end{pmatrix} $$

Bottom line: with blocks, you trade one huge problem for a few tidy medium-sized ones, which are simpler to reason about and often faster to compute.

Communication

Communication in parallel computing involves the exchange of data between tasks. This process is essential when tasks depend on each other's results to proceed. Efficient communication is crucial for maintaining performance and minimizing delays.

Goals

I. Minimize the Volume and Frequency of Communication

Reducing the amount of data exchanged and the number of communications can significantly decrease overhead and improve overall performance.

II. Optimize the Use of Network Resources:

Efficient use of network bandwidth and minimizing latency ensures faster data transfer and better utilization of computational resources.

Types of Communication

I. Local Communication

Local communication refers to data exchange between tasks that reside on the same physical processor. This type of communication is generally faster due to lower latency and higher bandwidth within a single processor.

Example:

In a multi-core processor, tasks running on different cores can share data through shared memory.

+-----------+     +-----------+
| Core 1    | <-> | Core 2    |
| Task A    |     | Task B    |
+-----------+     +-----------+
      |               |
      +-------+-------+
              |
       Shared Memory

II. Global Communication

Global communication involves data exchange between tasks located on different physical processors. This type of communication often involves higher latency and lower bandwidth due to the physical distance and network constraints.

Example:

In a distributed computing system, tasks running on separate machines need to communicate over a network.

+-------------+        +-------------+
| Processor 1 | <----> | Processor 2 |
| Task A      |        | Task B      |
+-------------+        +-------------+
      |                     |
      +---------------------+
             Network

III. Point-to-Point Communication

Point-to-point communication is direct data exchange between pairs of tasks. It is the simplest form of communication, where one task (sender) sends data directly to another task (receiver).

+---------+     +---------+
| Sender  | --> | Receiver |
+---------+     +---------+

IV. Collective Communication

Collective communication involves multiple tasks and includes operations such as broadcast, scatter, and gather.

Scatter Communication

In scatter communication, a single process sends different pieces of data to multiple processes. Each process receives a unique piece of the data.

#
      +-----------------------+
      |         ABC           |
      +-----------------------+
                 |
     +-----------+-----------+
     |           |           |
+---------+ +---------+ +---------+
|    A    | |    B    | |    C    |
+---------+ +---------+ +---------+

Broadcast Communication

In broadcast communication, a single process sends the same piece of data to all other processes.

#
      +-----------------------+
      |         ABC           |
      +-----------------------+
                 |
     +-----------+-----------+
     |           |           |
+---------+ +---------+ +---------+
|   ABC   | |   ABC   | |   ABC   |
+---------+ +---------+ +---------+

Communication Modes

I. Synchronous Blocking Communication

II. Asynchronous Nonblocking Communication

Concepts in Communication

I. Overhead

II. Latency

III. Bandwidth

Agglomeration

Agglomeration is the process of combining smaller tasks and data partitions into larger tasks. This technique aims to reduce communication overhead and improve overall efficiency by grouping tasks that frequently interact and ensuring the aggregated tasks fit within the processor’s memory.

Goals

I. Improve Computational Granularity:

II. Minimize Communication:

Granularity

Granularity in parallel computing refers to the ratio of computation to communication in a task or set of tasks. It indicates the amount of work done between communication events.

$$ \text{Granularity} = \frac{\text{computation}}{\text{communication}} $$

Types of Parallelism

I. Fine-Grained Parallelism

II. Coarse-Grained Parallelism

Practical Considerations

I. Task Grouping

II. Memory Management

III. Optimization Balance

Mapping

Mapping is the process of assigning agglomerated tasks to specific processors in a parallel computing environment. Effective mapping is crucial for balancing the computational load and minimizing communication overhead.

Goals

I. Balance the Computational Load Across Processors

II. Minimize Communication Overhead

III. Minimize the Total Execution Time

Strategies

I. Static Mapping

II. Dynamic Mapping

III. Hierarchical Mapping

IV. Task Clustering

V. Graph Partitioning

Example Workflow

2-D Heat Diffusion (4096Ă—4096, 5,000 steps, 5-point stencil, FP64)

What we’re doing

Simulate heat diffusion on a 4096×4096 grid for 5,000 steps using a 5-point stencil in double precision. Two row-major arrays (A[nx][ny], B[nx][ny]) “ping-pong” each step. We checkpoint to HDF5 per tile with datasets named temp_step_<t>.

Why these tools

What “done” looks like

I. Partitioning (make work even and predictable)

Goal: split the domain into equal, independent tasks.

How: uniform 8×8 domain decomposition → 64 blocks.

+---------------------------------------------------------------+
|                        4096 x 4096 Grid                       |
+---------------------------------------------------------------+
|   Split into 8 x 8 = 64 blocks                                |
|   Each block = 512 x 512 cells                                |
+---------------------------------------------------------------+

 Example partition view (8x8 blocks):

 +----+----+----+----+----+----+----+----+
 | T0 | T1 | T2 | T3 | T4 | T5 | T6 | T7 |
 +----+----+----+----+----+----+----+----+
 | T8 | T9 |T10 |T11 |T12 |T13 |T14 |T15 |
 +----+----+----+----+----+----+----+----+
 |T16 |T17 |T18 |T19 |T20 |T21 |T22 |T23 |
 +----+----+----+----+----+----+----+----+
 | ... repeats until T63 ...             |
 +---------------------------------------+

Each block has predictable work, ID, and size (512x512).

II. Communication (keep it correct and overlapped)

Goal: define and overlap all required data exchanges.

How: nonblocking point-to-point halos; rare collectives.

Example: One block (512x512)

      North halo
        vvvvv
   +---------------------+
   | NNNNNNNNNNNNNNNNNNN |
   |                     |
   |   Interior work     |<---> East halo
   |                     |
   | SSSSSSSSSSSSSSSSSSS |
   +---------------------+
        ^^^^^
      South halo

Legend:
- N = North halo (1 row)
- S = South halo (1 row)
- W/E = West/East halo (1 column each)

Each step:
- Send/recv 4 edges (north, south, east, west).
- Overlap: compute interior first, halos later.
- Message size per edge = 512 Ă— 1 Ă— 8 B = 4 KB.

III. Agglomeration (cut rank count, keep memory safe)

Goal: reduce messages and ranks without exceeding per-rank memory.

How: merge 2Ă—2 blocks into a tile; thread inside each rank.

Before: 64 MPI ranks (each 512x512)
 After:  16 MPI ranks (each 1024x1024 = 2x2 tile)

 Illustration (agglomeration from 2x2 to 1):

 +----+----+      +--------+
 | A  | B  |  =>  |        |
 +----+----+      |   R    |   (One rank handles all 4 blocks)
 | C  | D  |      |        |
 +----+----+      +--------+

Effect:
- Fewer ranks (64 -> 16)
- Larger messages (8 KB per edge instead of 4 KB)
- Threads (OpenMP) handle work inside each bigger block
- Memory per rank ~20 MB (safe under 512 MB limit)

IV. Mapping (place work where data moves least)

Goal: minimize cross-node traffic and cross-NUMA access.

How: Cartesian rank layout + core pinning.

Cluster: 4 nodes (32 cores each)
 Layout: 16 MPI ranks Ă— 8 threads = 128 threads

 Example placement:

   Node A (ranks 0-3)     Node B (ranks 4-7)
   +-----+-----+          +-----+-----+
   | R0  | R1  |          | R4  | R5  |
   +-----+-----+          +-----+-----+
   | R2  | R3  |          | R6  | R7  |
   +-----+-----+          +-----+-----+

   Node C (ranks 8-11)    Node D (ranks 12-15)
   +-----+-----+          +-----+-----+
   | R8  | R9  |          |R12  |R13  |
   +-----+-----+          +-----+-----+
   |R10  |R11  |          |R14  |R15  |
   +-----+-----+          +-----+-----+

Inside each node:
- 2 ranks per socket
- 8 threads per rank pinned to cores
- Cartesian layout keeps neighbor IDs aligned with space
- East-West halos mostly intra-node, only North-South cross nodes

Runbook

Code: heat2d.c (ping-pong buffers, pack/unpack halos, 5-point update)

Build & quick check

mpicc -O3 -march=native -fopenmp heat2d.c -o heat2d
./heat2d --nx 512 --ny 512 --steps 10 --tile 512 --verify analytic

Distribute across servers (rank layout)

mpirun -n 16 --map-by ppr:4:node --bind-to core --report-bindings \
  ./heat2d --nx 4096 --ny 4096 --steps 5000 --tile 1024

Full run with threads + checkpoints

OMP_NUM_THREADS=8 OMP_PLACES=cores OMP_PROC_BIND=close \
mpirun -n 16 --map-by ppr:4:node \
  ./heat2d --nx 4096 --ny 4096 --steps 5000 --tile 1024 --checkpoint 200

Outputs & logging:

Principles

Tools and Techniques

Table of Contents

    Designing Parallel Programs
    1. Partitioning
      1. Goals
      2. Types of Decomposition
    2. Communication
      1. Goals
    3. Types of Communication
    4. Communication Modes
      1. Concepts in Communication
    5. Agglomeration
      1. Goals
      2. Granularity
      3. Types of Parallelism
      4. Practical Considerations
    6. Mapping
      1. Goals
      2. Strategies
    7. Example Workflow
      1. I. Partitioning (make work even and predictable)
      2. II. Communication (keep it correct and overlapped)
      3. III. Agglomeration (cut rank count, keep memory safe)
      4. IV. Mapping (place work where data moves least)
      5. Runbook
    8. Principles
    9. Tools and Techniques