Last modified: July 10, 2024

This article is written in: 🇺🇸

Hardware in Parallel Computing

Parallel computing is the process of breaking a task into smaller parts that can be processed simultaneously by multiple processors. These notes explore the different ways of achieving parallelism in hardware and their impact on parallel computing performance.

Ways of Achieving Parallelism

Parallelism refers to the simultaneous execution of multiple tasks or processes to improve performance and efficiency. There are several ways to achieve parallelism:

Single-Core CPU

A single-core CPU can only perform one task at a time, limiting its parallel computing capabilities. Here's a detailed breakdown of how a single-core CPU processes tasks:

Multi-Core CPU

Multi-core CPUs contain multiple cores within a single processor chip, each capable of executing its own thread or process. This architecture significantly improves parallel computing performance. Here's how it works:

Graphics Processing Unit (GPU)

GPUs are specialized processors designed for parallel computing, particularly data parallelism, where many small processing cores work on different parts of the same problem simultaneously. Here's a deeper look into GPUs:

I. GPUs consist of thousands of smaller, efficient cores, making them ideal for tasks that can be divided into smaller, independent operations, such as rendering graphics and running complex simulations. This ability to handle multiple tasks simultaneously is due to their parallel architecture.

II. To leverage GPU parallel processing power, developers use specialized programming languages and libraries like CUDA, OpenCL, and OpenGL. These tools allow developers to write programs that can distribute work across the many cores of a GPU, optimizing performance.

III. Beyond graphics rendering, GPUs are utilized in various high-performance computing applications such as scientific simulations, data analysis, and machine learning. The parallel nature of these tasks makes GPUs a perfect fit, resulting in significant performance improvements over traditional CPU-based systems.

IV. Due to graphics frame buffer requirements, a GPU must be capable of moving extremely large amounts of data in and out of its main DRAM. The frame buffer is a dedicated block of memory that holds the image data currently being displayed or about to be displayed on the screen, which includes several types of data:

The frame buffer must be large enough to store this data for the entire display resolution and needs to be updated rapidly to maintain smooth video playback and responsive graphics rendering.

Imagine you're watching a high-definition video or playing a video game. The GPU has to keep track of every tiny dot (pixel) on your screen, including its color and depth in a 3D space. All this data is stored in the frame buffer, a special part of the GPU's memory. To ensure that what you see is smooth and realistic, the GPU must quickly move this large amount of data in and out of its memory.

V. What is Floating-Point Calculations per video frame?

Comparison: CPU vs GPU

I. CPU Architecture Diagram

+---------+---------+---------+
| Control |   ALU   |   ALU   |
|  CPU    +---------+---------+
|         |   ALU   |   ALU   |
+---------+---------+---------+
|                Cache        |
+-----------------------------+
|              DRAM           |
+-----------------------------+

II. GPU Architecture Diagram

+-----------------------------+
|  |  |  |  | GPU |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|  |  |  |  |  |  |  |  |  |  |
+-----------------------------+
|              DRAM           |
+-----------------------------+

Below is a table comparing CPUs and GPUs, emphasizing their unique features and common use cases:

Aspect CPU (Central Processing Unit) GPU (Graphics Processing Unit)
Core Architecture Few powerful cores designed for executing complex tasks and instructions. Hundreds to thousands of simpler cores optimized for handling highly parallel tasks.
Control Unit Sophisticated control units with features like branch prediction, out-of-order execution, and advanced instruction pipelining. Simpler control units optimized for parallel processing rather than complex decision-making.
Cache System Large multi-level caches (L1, L2, and sometimes L3) to minimize memory access latency. Smaller, specialized caches designed to manage large data throughput efficiently.
Execution Model Optimized for low-latency execution, handling fewer tasks very quickly. Optimized for high throughput, capable of executing many tasks concurrently.
Memory Access Complex memory hierarchy to ensure fast access to data. High-bandwidth memory systems designed to handle large volumes of data efficiently (e.g., GDDR6).
Power Efficiency Generally optimized for lower power consumption in typical computing tasks. Often optimized for performance, resulting in higher power consumption.
Application Focus Suited for general-purpose computing tasks, such as running operating systems and applications. Specialized for tasks requiring massive parallelism, such as rendering graphics and computations in AI.
Programming Model Uses conventional programming models and languages like C, C++, and Python. Uses specialized programming models and languages like CUDA and OpenCL for parallel processing.
Cooling Requirements Typically requires less aggressive cooling solutions. Often requires more advanced cooling solutions due to higher power consumption and heat generation.
Market Commonly found in personal computers, servers, and mobile devices. Commonly found in gaming consoles, workstations, and data centers for tasks like machine learning.

Reducing Latency vs Increasing Throughput

Throughput-Oriented Design

Parallel Computing Architectures

#
                            Parallel Computer Architectures
                                        |
      --------------------------------------------------------------------------------------------------
      |                                 |                              |                               |
    SISD                               SIMD                           MISD                            MIMD
    (Von Neumann)                       |                              ?                               |
                                        |                                                              |
                           ---------------------------                              -----------------------------
                          |                          |                              |                           |
                       Vector                     Array                    Multiprocessors            Multicomputers
                      Processor               Processor                          |                           |
                                                                 ------------------------                 ------------------
                                                                 |            |          |                |                |
                                                                UMA         COMA        NUMA             MPP              COW
                                                                 |                       |                |                
                                                           ------------              -----------          ----------
                                                           |          |              |         |          |        |
                                                          Bus      Switched       CC-NUMA    NC-NUMA    Grid    Hypercube

Types of Parallelism: Data Parallelism vs Task Parallelism

Parallelism can be broadly classified into two types, based on how tasks are divided and executed:

I. Data Parallelism / SIMD (Single Instruction, Multiple Data)

II. Task Parallelism / MIMD (Multiple Instruction, Multiple Data)

Shared Memory Architectures

Shared memory architectures enable multiple processors to access a common global address space, facilitating communication and data sharing among processors. This means that any changes made to a memory location by one processor are immediately visible to all other processors. There are two primary types of shared memory architectures:

I. Uniform Memory Access (UMA)

II. Non-Uniform Memory Access (NUMA)

Comparison of UMA, NUMA, SIMD, and MIMD:

Here's a summary table that outlines the relevance of shared memory, UMA, NUMA, SIMD, and MIMD to both CPUs and GPUs:

Concept CPUs GPUs
SIMD (Single Instruction, Multiple Data) Used in vector processors and CPU instructions like SSE, AVX for parallel data processing. Inherently SIMD, executing the same instruction on multiple data points in parallel.
MIMD (Multiple Instruction, Multiple Data) Modern multi-core CPUs execute different instructions on different data independently. Primarily SIMD, but also exhibits MIMD characteristics with different threads or blocks executing different instructions.
Shared Memory Multiple processors can access the same physical memory space. On-chip memory accessible by all threads within a block for fast exchange.
UMA (Uniform Memory Access) Processors share physical memory uniformly with equal access time. Shared memory architecture in integrated systems, simplifying programming model.
NUMA (Non-Uniform Memory Access) Memory access time depends on the memory location relative to the processor, improving scalability. Less common, but can be implemented in high-end systems for large-scale parallel processing.

Distributed Computing and Cluster Computing Architectures

Distributed computing and cluster computing are approaches to parallelism that leverage multiple machines to work on a common task. They offer significant benefits in terms of scalability, fault tolerance, and performance.

Distributed Computing

Distributed computing involves a network of independent computers that work together to perform a task. These computers communicate and coordinate their actions by passing messages to one another.

I. Architecture

II. Characteristics

III. Examples

IV. Challenges

Cluster Computing

Cluster computing involves a group of closely connected computers that work together as a single system. Clusters are typically used for high-performance computing tasks.

I. Architecture

II. Characteristics

III. Examples

IV. Challenges

Comparison of Distributed and Cluster Computing

Aspect Distributed Computing Cluster Computing
Infrastructure Nodes can be geographically dispersed and connected over the internet. Nodes are usually in close physical proximity, connected via a high-speed local network.
Management Decentralized management with no single point of control. Centralized management with a head node overseeing the cluster operations.
Use Cases Suitable for tasks that require diverse resources and can tolerate higher latency (e.g., collaborative research projects, global computing networks). Ideal for tasks needing intensive computation and low-latency communication (e.g., scientific simulations, data analysis).

Table of Contents

  1. Ways of Achieving Parallelism
  2. Single-Core CPU
  3. Multi-Core CPU
  4. Graphics Processing Unit (GPU)
  5. Comparison: CPU vs GPU
  6. Reducing Latency vs Increasing Throughput
    1. Throughput-Oriented Design
  7. Parallel Computing Architectures
    1. Types of Parallelism: Data Parallelism vs Task Parallelism
    2. Shared Memory Architectures
    3. Comparison of UMA, NUMA, SIMD, and MIMD:
  8. Distributed Computing and Cluster Computing Architectures
    1. Distributed Computing
    2. Cluster Computing
    3. Comparison of Distributed and Cluster Computing