Last modified: March 11, 2025

This article is written in: 🇺🇸

Hardware in Parallel Computing

Parallel computing is the process of breaking a task into smaller parts that can be processed simultaneously by multiple processors. These notes explore the different ways of achieving parallelism in hardware and their impact on parallel computing performance.

Ways of Achieving Parallelism

Parallelism refers to the simultaneous execution of multiple tasks or processes to improve performance and efficiency. There are several ways to achieve parallelism:

Single-Core CPU

A single-core CPU can only perform one task at a time, limiting its parallel computing capabilities. Here's a detailed breakdown of how a single-core CPU processes tasks:

Multi-Core CPU

Multi-core CPUs contain multiple cores within a single processor chip, each capable of executing its own thread or process. This architecture significantly improves parallel computing performance. Here's how it works:

Graphics Processing Unit (GPU)

GPUs are specialized processors designed for parallel computing, particularly data parallelism, where many small processing cores work on different parts of the same problem simultaneously. Here's a deeper look into GPUs:

I. GPUs consist of thousands of smaller, efficient cores, making them ideal for tasks that can be divided into smaller, independent operations, such as rendering graphics and running complex simulations. This ability to handle multiple tasks simultaneously is due to their parallel architecture.

II. To leverage GPU parallel processing power, developers use specialized programming languages and libraries like CUDA, OpenCL, and OpenGL. These tools allow developers to write programs that can distribute work across the many cores of a GPU, optimizing performance.

III. Beyond graphics rendering, GPUs are utilized in various high-performance computing applications such as scientific simulations, data analysis, and machine learning. The parallel nature of these tasks makes GPUs a perfect fit, resulting in significant performance improvements over traditional CPU-based systems.

IV. Due to graphics frame buffer requirements, a GPU must be capable of moving extremely large amounts of data in and out of its main DRAM. The frame buffer is a dedicated block of memory that holds the image data currently being displayed or about to be displayed on the screen, which includes several types of data:

The frame buffer must be large enough to store this data for the entire display resolution and needs to be updated rapidly to maintain smooth video playback and responsive graphics rendering.

Imagine you're watching a high-definition video or playing a video game. The GPU has to keep track of every tiny dot (pixel) on your screen, including its color and depth in a 3D space. All this data is stored in the frame buffer, a special part of the GPU's memory. To ensure that what you see is smooth and realistic, the GPU must quickly move this large amount of data in and out of its memory.

V. What is Floating-Point Calculations per video frame?

Comparison: CPU vs GPU

I. CPU Architecture Diagram

+---------+---------+---------+
| Control |   ALU   |   ALU   |
|  CPU    +---------+---------+
|         |   ALU   |   ALU   |
+---------+---------+---------+
|                Cache        |
+-----------------------------+
|              DRAM           |
+-----------------------------+

II. GPU Architecture Diagram

+-----------------------------+
|  |  |  |  | GPU |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|              DRAM           |
+-----------------------------+

Below is a table comparing CPUs and GPUs, emphasizing their unique features and common use cases:

Aspect CPU (Central Processing Unit) GPU (Graphics Processing Unit)
Core Architecture Few powerful cores designed for executing complex tasks and instructions. Hundreds to thousands of simpler cores optimized for handling highly parallel tasks.
Control Unit Sophisticated control units with features like branch prediction, out-of-order execution, and advanced instruction pipelining. Simpler control units optimized for parallel processing rather than complex decision-making.
Cache System Large multi-level caches (L1, L2, and sometimes L3) to minimize memory access latency. Smaller, specialized caches designed to manage large data throughput efficiently.
Execution Model Optimized for low-latency execution, handling fewer tasks very quickly. Optimized for high throughput, capable of executing many tasks concurrently.
Memory Access Complex memory hierarchy to ensure fast access to data. High-bandwidth memory systems designed to handle large volumes of data efficiently (e.g., GDDR6).
Power Efficiency Generally optimized for lower power consumption in typical computing tasks. Often optimized for performance, resulting in higher power consumption.
Application Focus Suited for general-purpose computing tasks, such as running operating systems and applications. Specialized for tasks requiring massive parallelism, such as rendering graphics and computations in AI.
Programming Model Uses conventional programming models and languages like C, C++, and Python. Uses specialized programming models and languages like CUDA and OpenCL for parallel processing.
Cooling Requirements Typically requires less aggressive cooling solutions. Often requires more advanced cooling solutions due to higher power consumption and heat generation.
Market Commonly found in personal computers, servers, and mobile devices. Commonly found in gaming consoles, workstations, and data centers for tasks like machine learning.

Reducing Latency vs Increasing Throughput

Throughput-Oriented Design

Parallel Computing Architectures

SISD (Single Instruction, Single Data)
A single instruction stream operates on a single data element at a time. This is the classic, straightforward model where a uniprocessor fetches instructions and data from memory and executes them in sequence.

SIMD (Single Instruction, Multiple Data)
A single instruction stream operates on multiple data elements simultaneously. This category includes array processors and vector processors, where one instruction can be applied to an entire set (or array) of data elements in parallel.

MISD (Multiple Instructions, Single Data)
Multiple instructions operate on a single data element. True MISD systems are rare, but systolic array processors are sometimes considered a loose example, as they involve multiple processing elements each performing different operations on data as it flows through.

MIMD (Multiple Instructions, Multiple Data)
Multiple instruction streams operate on multiple data elements concurrently. This category includes multiprocessor and multithreaded systems, where each processor or thread can execute its own sequence of instructions on its own set of data.

#
                            Parallel Computer Architectures
                                        |
      --------------------------------------------------------------------------------------------------
      |                                 |                              |                               |
    SISD                               SIMD                           MISD                            MIMD
    (Von Neumann)                       |                              ?                               |
                                        |                                                              |
                           ---------------------------                              -----------------------------
                          |                          |                              |                           |
                       Vector                     Array                    Multiprocessors            Multicomputers
                      Processor               Processor                          |                           |
                                                                 ------------------------                 ------------------
                                                                 |            |          |                |                |
                                                                UMA         COMA        NUMA             MPP              COW
                                                                 |                       |                |                
                                                           ------------              -----------          ----------
                                                           |          |              |         |          |        |
                                                          Bus      Switched       CC-NUMA    NC-NUMA    Grid    Hypercube

Programming model vs hardware execution model

Types of Parallelism: Data Parallelism vs Task Parallelism

Parallelism can be broadly classified into two types, based on how tasks are divided and executed:

I. Data Parallelism / SIMD (Single Instruction, Multiple Data)

II. Task Parallelism / MIMD (Multiple Instruction, Multiple Data)

Shared Memory Architectures

Shared memory architectures enable multiple processors to access a common global address space, facilitating communication and data sharing among processors. This means that any changes made to a memory location by one processor are immediately visible to all other processors. There are two primary types of shared memory architectures:

I. Uniform Memory Access (UMA)

II. Non-Uniform Memory Access (NUMA)

Comparison of UMA, NUMA, SIMD, and MIMD

Below is a table that outlines the relevance of Shared Memory, UMA, NUMA, SIMD, and MIMD for both CPUs and GPUs:

Concept CPUs GPUs
SIMD (Single Instruction, Multiple Data) Use SIMD instructions (e.g., SSE, AVX, or NEON) to process multiple data elements in parallel within each core. Inherently rely on SIMD- or SIMT-based execution, where a single instruction is applied to multiple data elements across many threads.
MIMD (Multiple Instruction, Multiple Data) Modern multi-core CPUs exhibit MIMD by running different instruction streams on separate cores simultaneously. Mostly use SIMD at the hardware level, but can show MIMD-like behavior across different thread blocks or warps that may execute divergent instructions.
Shared Memory All cores share a common physical memory space accessible to each processor, coordinated by caches and interconnects. Provide on-chip shared memory (e.g., per-block or per-wavefront) that allows fast data exchange among threads within the same block.
UMA (Uniform Memory Access) In a UMA system, all processors access the shared memory with uniform latency, simplifying memory management. In integrated CPU-GPU systems with a unified memory architecture, GPUs share the same memory pool with uniform access (though performance can vary).
NUMA (Non-Uniform Memory Access) Memory access times depend on physical proximity to specific memory regions (NUMA nodes), improving scalability for large systems. Less common in typical GPU deployments, but can appear in high-end or multi-GPU systems for large-scale parallel processing with distributed memory.

Distributed Computing and Cluster Computing Architectures

Distributed computing and cluster computing are approaches to parallelism that leverage multiple machines to work on a common task. They offer significant benefits in terms of scalability, fault tolerance, and performance.

Distributed Computing

Distributed computing involves a network of independent computers that work together to perform a task. These computers communicate and coordinate their actions by passing messages to one another.

I. Architecture

II. Characteristics

III. Examples

IV. Challenges

Cluster Computing

Cluster computing involves a group of closely connected computers that work together as a single system. Clusters are typically used for high-performance computing tasks.

I. Architecture

II. Characteristics

III. Examples

IV. Challenges

Comparison of Distributed and Cluster Computing

Aspect Distributed Computing Cluster Computing
Infrastructure Nodes can be geographically dispersed and connected over the internet. Nodes are usually in close physical proximity, connected via a high-speed local network.
Management Decentralized management with no single point of control. Centralized management with a head node overseeing the cluster operations.
Use Cases Suitable for tasks that require diverse resources and can tolerate higher latency (e.g., collaborative research projects, global computing networks). Ideal for tasks needing intensive computation and low-latency communication (e.g., scientific simulations, data analysis).

Table of Contents

    Hardware in Parallel Computing
    1. Ways of Achieving Parallelism
    2. Single-Core CPU
    3. Multi-Core CPU
    4. Graphics Processing Unit (GPU)
    5. Comparison: CPU vs GPU
    6. Reducing Latency vs Increasing Throughput
      1. Throughput-Oriented Design
    7. Parallel Computing Architectures
      1. Programming model vs hardware execution model
      2. Types of Parallelism: Data Parallelism vs Task Parallelism
      3. Shared Memory Architectures
      4. Comparison of UMA, NUMA, SIMD, and MIMD
    8. Distributed Computing and Cluster Computing Architectures
      1. Distributed Computing
      2. Cluster Computing
      3. Comparison of Distributed and Cluster Computing