Last modified: September 27, 2025

This article is written in: 🇺🇸

GPU Programming

GPUs (Graphics Processing Units) excel at performing the same operation on many data elements in parallel. Originally, GPUs were used just for rendering images, but their highly parallel nature also makes them ideal for general-purpose tasks with large data sets, such as scientific simulations, machine learning, and other compute-intensive workloads.

What’s special about GPU programming?

Time-Space Duality

GPU architectures (and more broadly, parallel processors) often exploit a concept called time-space duality, which compares two main ways of doing parallel operations: array processors and vector processors. Understanding these helps illustrate general principles behind GPU designs.

====================================================
ARRAY PROCESSOR (Spatial Parallelism)
====================================================
         |   T0   |   T1   |   T2   |   T3   |
----------------------------------------------------
 PE0     |  LD0   |  ADD0  |  MUL0  |  ST0   |
 PE1     |  LD1   |  ADD1  |  MUL1  |  ST1   |
 PE2     |  LD2   |  ADD2  |  MUL2  |  ST2   |
 PE3     |  LD3   |  ADD3  |  MUL3  |  ST3   |
----------------------------------------------------
(Same operation across all PEs in each time slot)

====================================================
VECTOR PROCESSOR (Temporal Parallelism)
====================================================
Time -->    T0     T1     T2     T3     T4     T5     T6     T7
----------------------------------------------------------------
 Elem0:     LD0    ADD0   MUL0   ST0
 Elem1:            LD1    ADD1   MUL1   ST1
 Elem2:                   LD2    ADD2   MUL2   ST2
 Elem3:                          LD3    ADD3   MUL3   ST3
----------------------------------------------------------------
(Each data element "flows" through the pipeline over time)

How it fits into GPU programming?

I. Array Processors (Spatial Parallelism)

II. Vector Processors (Temporal Parallelism)

In modern GPUs, there are elements of both approaches. You can think of each GPU “core” as being replicated (spatial parallelism), but within each core, pipelined operations occur (temporal parallelism). This combination helps GPUs handle large-scale parallelism efficiently.

Vector Processor

A vector processor is an architecture that operates on entire vectors (1D arrays) with a single instruction. This approach is relevant to GPU programming because GPUs often provide SIMD (Single Instruction, Multiple Data) capabilities similar to vector processors.

Benefits

Strides

When data is laid out in memory, sometimes elements that belong together (e.g., all elements of a column in a matrix) are not next to each other. Stride tells the hardware how far to jump in memory to reach the next element of a vector.

Row-major layout example

Consider a $4 \times 4$ matrix $A$ in row-major order:

$$ \begin{bmatrix} A(0,0) & A(0,1) & A(0,2) & A(0,3) \\ A(1,0) & A(1,1) & A(1,2) & A(1,3) \\ A(2,0) & A(2,1) & A(2,2) & A(2,3) \\ A(3,0) & A(3,1) & A(3,2) & A(3,3) \end{bmatrix} $$

In memory (row-major), row 0 is first, then row 1, etc.. Accessing a row is contiguous ($\text{stride} = 1$), but accessing a column is not contiguous ($\text{stride} = \text{number of columns}$)

Matrix multiplication

When computing

$$ C = A \times B, $$

each element $C(i,j)$ is the dot product of row $i$ of $A$ and column $j$ of $B$. Both $A$ and $B$ are also in row-major format. So:

I. Load row of $A$

$$ [A(i,0), A(i,1), \dots, A(i,n-1)] $$

II. Load column of $B$

$$ [B(0,j), B(1,j), B(2,j), \dots, B(p-1,j)] $$

III. Vector multiply and accumulate

Multiply corresponding elements and sum them up:

$$ C(i,j) = \sum_{k=0}^{n-1} A(i,k) \cdot B(k,j) $$

By setting a vector stride correctly, a single vector instruction can fetch or process these elements in a single pass. This eliminates the need for complex address computations in a loop, letting the hardware handle it efficiently.

Loading Storing Vectors from Memory

In practice, vector loads/stores are a key part of high-performance computing. When data is well-organized in memory and has a suitable stride, the hardware can fetch multiple elements at once, reducing the number of individual instructions and improving overall throughput.

Memory Banking

Each bank has its own Memory Address Register (MAR) and Memory Data Register (MDR), and they all connect to the CPU via a shared address bus and data bus. The following diagram shows the high-level structure of a system with multiple memory banks:

#
                 +----------------+
                 |      CPU       |
                 |  (exec unit,   |
                 |   control)     |
                 +--------+-------+
                          |
             +------------+------------+
             |   Address Bus &         |
             |     Data Bus            |
             +------------+------------+
                          |
         -------------------------------------------------
         |          |           |       ...       |      |
         v          v           v                 v      v
+-----------+  +-----------+  +-----------+   ...   +-----------+
|  Bank 0   |  |  Bank 1   |  |  Bank 2   |         | Bank 15   |
| (MAR,MDR) |  | (MAR,MDR) |  | (MAR,MDR) |         | (MAR,MDR) |
+-----------+  +-----------+  +-----------+   ...   +-----------+

Memory banking is used for high-performance vector or GPU architectures because it increases effective memory bandwidth. If each element of a vector is mapped to a different bank, the system can read multiple elements simultaneously or in rapid succession.

Vectorizable Loops

Vectorizable loops are a prime candidate for GPU acceleration. Each GPU thread might handle a portion of the data, or specialized instructions can process multiple elements per thread. In either case, independence between loop iterations makes it easy to parallelize the computation.

GPUs Are SIMD Engines

SIMD vs SIMT

===============================================
SIMD (Single Instruction, Multiple Data)
===============================================
          +------------------------+
          |  Instruction Fetch    |
          |  (one instruction)    |
          +-----------+-----------+
                      |
                      v
    +-----------------+-----------------+-----------------+-----------------+
    |     Lane 0      |     Lane 1      |     Lane 2      |     Lane 3      |
    |  (data element) |  (data element) |  (data element) |  (data element) |
    +-----------------+-----------------+-----------------+-----------------+

(One instruction is applied to multiple data elements simultaneously 
 in parallel lanes. All lanes share the same Program Counter.)

===============================================
SIMT (Single Instruction, Multiple Threads)
===============================================
           +-------------------------------------------------+
           |         Warp Scheduler / Dispatch Unit          |
           +------------------------+-------------------------+
                                    |
                (Groups threads     |  Same instruction if
                into warps          |  they converge in PC)
                                    v
   +------------+   +------------+   +------------+   +------------+
   |  Thread 0  |   |  Thread 1  |   |  Thread 2  |   |  Thread 3  |
   | Instr. Strm|   | Instr. Strm|   | Instr. Strm|   | Instr. Strm|
   +------------+   +------------+   +------------+   +------------+

(Each thread has its own instruction stream and Program Counter, 
 but threads that execute the same instruction can run in lockstep 
 as a warp, effectively behaving like SIMD when they follow one path.)

Fine-Grained Multithreading

High-level GPU architecture

#
                              +--------------------------------+
                              |           PC, Mask             |
                              |        +--------------+        |
                              |        |   I-Cache    |        |
                              |        +--------------+        |
                              |             Decode             |
                              |                                |
                              |  +--------------+  +--------+  |
                              |  | Scalar Pipe  |  |  ...   |  |  <-- Possibly multiple
                              |  +--------------+  +--------+  |      scalar pipelines
                              |  | Scalar Pipe  |              |
                              |  +--------------+              |
                              |         ...                    |
                              |        SIMD Exec               |
                              +---------------+----------------+
                                              |
                                              |
            +----------------------+-----------+--------------------+
            |                Interconnection Network                |
            +------------+------------+------------+----------------+
                         |            |            |
                         v            v            v
                  +-----------+  +-----------+  +-----------+  ...
                  |ShaderCore |  |ShaderCore |  |ShaderCore |
                  +-----------+  +-----------+  +-----------+
                         |            |            |
                         +-----+------+------+-----+
                               |      |      |
         +---------------------+------+------+---------------------+
         |                 Memory Controller(s)                    |
         |                 +--------------+                        |
         |                 |    GDDRx     |  (e.g., GDDR3, GDDR5)  |
         +-----------------+--------------+------------------------+

  1. Front-End
  2. PC (Program Counter), Mask: Tracks instruction addresses and active lanes (threads)
  3. I-Cache: Holds instructions to be decoded.
  4. Decode: Decodes instructions into micro-ops for the pipelines.
  5. Scalar Pipelines: Execute scalar operations (e.g., address calculations, control flow)
  6. SIMD Execution: Executes vector (wide) operations for parallel data processing (shaders)

  7. Interconnection Network

  8. Connects the front-end and memory subsystem to multiple shader cores (compute units)

  9. Shader Cores

  10. Each core has ALUs, registers, and scheduling hardware for parallel threads (shaders)

  11. Memory Controllers + GDDR

  12. Multiple controllers handle high-bandwidth GDDR memory accesses in parallel.

General Purpose Programming on GPU

Typical GPU Execution Flow (Three Steps):

I. Load Data to GPU

II. Execute GPU Kernel

III. Copy Results Back

Here’s a simplified diagram showing the basic data flow:

+-------------------------+       Step 1        +------------------------+
|         CPU (Host)      | --(Host->Device)--> |       GPU (Device)     |
|                         | <-(Device->Host)--  |                        |
|   (Runs host code)      |       Step 3        | (Executes GPU kernels) |
+-------------------------+                     +-----------+------------+
                                                   ^
                                                   | Step 2
                                                   |
                                                 Kernel
                                                 Launch

Warps and Blocks

Below is a diagram that visualizes how blocks and warps fit into the overall GPU execution model. Each grid consists of one or more blocks, and each block contains many threads. Under the hood, the GPU hardware groups these threads into warps (on NVIDIA GPUs, typically 32 threads per warp)

+-------------------------------------------------------------+
|                    GPU Grid                                 |
| (launched by Kernel<<< numBlocks, threadsPerBlock >>>(...)) | 
|                                                             |
|  +------------------+    +------------------+               |
|  |    Block 0       |    |    Block 1       |               |
|  | (threads)        |    | (threads)        |  ...          |
|  | +--------------+ |    | +--------------+ |               |
|  | |   Warp 0     | |    | |   Warp 0     | |               |
|  | | (th0..th31)  | |    | | (th0..th31)  | |               |
|  | +--------------+ |    | +--------------+ |               |
|  | +--------------+ |    | +--------------+ |               |
|  | |   Warp 1     | |    | |   Warp 1     | |               |
|  | | (th32..th63) | |    | | (th32..th63) | |               |
|  | +--------------+ |    | +--------------+ |               |
|  |       ...        |    |       ...        |               |
|  +------------------+    +------------------+               |
+-------------------------------------------------------------+

How Blocks and Warps Relate

  1. Grid: The entire problem domain, composed of multiple blocks. A GPU kernel launch specifies:
  2. Number of blocks (e.g., numBlocks)
  3. Threads per block (e.g., threadsPerBlock)

  4. Blocks: Each block is an independent group of threads:

  5. All threads in a block can cooperate using shared memory and barrier synchronization (__syncthreads() in CUDA)
  6. Blocks are also often arranged in one-, two-, or three-dimensional configurations (e.g., a 2D grid of threads for image processing)

  7. Threads: Each block has many threads:

  8. Each thread has its own registers and local memory.
  9. Threads in the same block can communicate and synchronize.

  10. Warps: The GPU hardware divides the threads in each block into warps:

  11. A warp is typically 32 threads on NVIDIA GPUs (the exact number can vary by architecture/vendor)
  12. All threads in a warp execute in lockstep (SIMD fashion)
  13. If threads in the same warp diverge (e.g., different branches of an if statement), the warp executes each branch sequentially, reducing overall efficiency.

Warp-Based SIMD

Comparing Traditional SIMD vs. Warp-Based SIMD

Program Structure in CUDA

A CUDA kernel is a function marked with __global__ that runs on the GPU across many threads; each thread computes its own index and works on a slice of the data.

__global__ void scale(float* d, float s, int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < n) d[i] *= s;
}

Before the GPU can touch your data, you allocate memory on the device; the pointer you hold on the CPU is just a handle to memory that physically lives on the GPU.

float* d_data = nullptr;
size_t bytes = n * sizeof(float);
cudaError_t err = cudaMalloc(&d_data, bytes);
if (err != cudaSuccess) { /* handle error */ }

Inputs must be moved from host RAM to device memory; cudaMemcpy does a raw byte copy and you must state the direction explicitly.

cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);

Kernels are launched with the <<<grid, block>>> syntax that sets parallelism; pick a block size (often 128–1024) and compute how many blocks cover your data, then optionally synchronize to surface errors before continuing.

int block = 256;
int grid  = (n + block - 1) / block;
scale<<<grid, block="">>>(d_data, 2.0f, n);
cudaDeviceSynchronize();  // wait & surface any runtime errors

Results live on the device until you copy them back; once you’re done, free the device memory to avoid leaks.

cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_data);

Inside a kernel, registers hold most scalars (fast but limited), while __shared__ memory lets threads in the same block cooperate; use __syncthreads() to make sure all threads reach a safe point before using shared results.

__global__ void blockSum(const float* in, float* out) {
  __shared__ float buf[256];
  int t = threadIdx.x;
  int i = blockIdx.x * blockDim.x + t;
  buf[t] = in[i];
  __syncthreads();
  // (tiny reduction sketch)
  for (int s = blockDim.x/2; s > 0; s >>= 1) {
    if (t < s) buf[t] += buf[t + s];
    __syncthreads();
  }
  if (t == 0) out[blockIdx.x] = buf[0];
}

Below is a table of common CUDA concepts. Each row outlines the concept, gives a brief description, shows a typical usage or syntax, and provides extra notes or examples for clarity.

Concept Description Syntax / Example Extra Notes / Examples
Kernel A function executed on the GPU in parallel by many threads. Marked with __global__. __global__ void myKernel(float* data) {
// GPU code
}

// Launch
myKernel<<<numBlocks, threadsPerBlock>>>(d_data);
- Each thread runs myKernel() independently.
- Grids and blocks specify the total thread count.
- Often used for data-parallel tasks (e.g., vector addition)
Block A group of threads that can cooperate via shared memory and synchronization. Specified in the kernel launch: <numBlocks, threadsPerBlock> - Example: myKernel<<<64, 256>>>(...) creates 64 blocks, each with 256 threads.
- Threads in the same block can access shared memory declared with __shared__.
Grid The entire set of blocks launched for a kernel. Same syntax as block specification but at a higher level (numBlocks) - A grid can be 1D, 2D, or 3D, matching the shape of the data (e.g., image processing often uses 2D grids)
Thread The basic unit of execution on the GPU. Each thread has its own registers and can identify itself by threadIdx within a block and blockIdx within the grid. int tx = threadIdx.x;
int bx = blockIdx.x;
int idx = bx * blockDim.x + tx;
- threadIdx, blockIdx, and blockDim are built-in variables.
- Each thread typically handles one or more elements of the data.
Barrier / __syncthreads() Synchronizes all threads in the same block. No thread passes the barrier until all threads have reached it. __shared__ float temp[256];
temp[threadIdx.x] = data[threadIdx.x];
__syncthreads();
// All threads have updated temp here
- Only valid within a kernel.
- Ensures partial updates in shared memory are visible to all threads before proceeding.
- Does not synchronize across multiple blocks.
Shared Memory Fast on-chip memory visible to all threads in the same block. Declared with __shared__. __global__ void myKernel(float* in){
__shared__ float tile[256];
// ...
}
- Useful for caching data reused by threads in a block.
- Access is much faster than global memory but limited in size.
- Must be accessed carefully to avoid bank conflicts.
Global Memory Main device memory accessible by all threads but with higher access latency than shared memory. Allocated via cudaMalloc((void**)&d_ptr, size);
Accessed in kernels by pointer (d_ptr[index])
- Large capacity but slower than shared or local memory.
- Important to ensure coalesced access (threads access consecutive addresses) for best performance.
Host Code CPU-side code that sets up kernel launches, manages device memory, and coordinates data transfers. // Allocate GPU memory
cudaMalloc((void**)&d_data, size);
// Launch kernel
myKernel<<<blocks, threads>>>(d_data);
// Copy results back
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);
- The host is responsible for orchestrating GPU operations (allocation, copy, launch)
- Synchronization functions (cudaDeviceSynchronize()) can ensure the GPU has finished before using results on the CPU.
Device Function A function that runs on the GPU but is only called from another GPU function (kernel or device function). Declared with __device__. __device__ float deviceFunc(float x) {
return x * x;
}

__global__ void kernelFunc(float* data){
data[threadIdx.x] = deviceFunc(data[threadIdx.x]);
}
- Not callable from the host.
- Often used to break large kernels into smaller subroutines.
Host Function Normal CPU function compiled for the host. Marked with __host__ (optional unless combined with __device__ in a function qualifier). __host__ void hostFunc(){
// Runs on CPU
}
- Host code typically handles memory management, kernel launches, etc.
- __host__ __device__ can make a function callable from both CPU and GPU, although limitations apply.

Table of Contents

    GPU Programming
    1. Time-Space Duality
    2. Vector Processor
    3. Strides
      1. Row-major layout example
      2. Matrix multiplication
    4. Loading Storing Vectors from Memory
    5. Memory Banking
    6. Vectorizable Loops
    7. GPUs Are SIMD Engines
    8. SIMD vs SIMT
    9. Fine-Grained Multithreading
    10. High-level GPU architecture
    11. General Purpose Programming on GPU
    12. Warps and Blocks
      1. Warp-Based SIMD
      2. Comparing Traditional SIMD vs. Warp-Based SIMD
    13. Program Structure in CUDA