Last modified: September 27, 2025
This article is written in: 🇺🇸
GPUs (Graphics Processing Units) excel at performing the same operation on many data elements in parallel. Originally, GPUs were used just for rendering images, but their highly parallel nature also makes them ideal for general-purpose tasks with large data sets, such as scientific simulations, machine learning, and other compute-intensive workloads.
What’s special about GPU programming?
GPU architectures (and more broadly, parallel processors) often exploit a concept called time-space duality, which compares two main ways of doing parallel operations: array processors and vector processors. Understanding these helps illustrate general principles behind GPU designs.
====================================================
ARRAY PROCESSOR (Spatial Parallelism)
====================================================
| T0 | T1 | T2 | T3 |
----------------------------------------------------
PE0 | LD0 | ADD0 | MUL0 | ST0 |
PE1 | LD1 | ADD1 | MUL1 | ST1 |
PE2 | LD2 | ADD2 | MUL2 | ST2 |
PE3 | LD3 | ADD3 | MUL3 | ST3 |
----------------------------------------------------
(Same operation across all PEs in each time slot)
====================================================
VECTOR PROCESSOR (Temporal Parallelism)
====================================================
Time --> T0 T1 T2 T3 T4 T5 T6 T7
----------------------------------------------------------------
Elem0: LD0 ADD0 MUL0 ST0
Elem1: LD1 ADD1 MUL1 ST1
Elem2: LD2 ADD2 MUL2 ST2
Elem3: LD3 ADD3 MUL3 ST3
----------------------------------------------------------------
(Each data element "flows" through the pipeline over time)
How it fits into GPU programming?
I. Array Processors (Spatial Parallelism)
II. Vector Processors (Temporal Parallelism)
In modern GPUs, there are elements of both approaches. You can think of each GPU “core” as being replicated (spatial parallelism), but within each core, pipelined operations occur (temporal parallelism). This combination helps GPUs handle large-scale parallelism efficiently.
A vector processor is an architecture that operates on entire vectors (1D arrays) with a single instruction. This approach is relevant to GPU programming because GPUs often provide SIMD (Single Instruction, Multiple Data) capabilities similar to vector processors.
VADD
) to add pairs of elements across two vectors in one pipeline.Benefits
When data is laid out in memory, sometimes elements that belong together (e.g., all elements of a column in a matrix) are not next to each other. Stride tells the hardware how far to jump in memory to reach the next element of a vector.
Consider a $4 \times 4$ matrix $A$ in row-major order:
$$ \begin{bmatrix} A(0,0) & A(0,1) & A(0,2) & A(0,3) \\ A(1,0) & A(1,1) & A(1,2) & A(1,3) \\ A(2,0) & A(2,1) & A(2,2) & A(2,3) \\ A(3,0) & A(3,1) & A(3,2) & A(3,3) \end{bmatrix} $$
In memory (row-major), row 0 is first, then row 1, etc.. Accessing a row is contiguous ($\text{stride} = 1$), but accessing a column is not contiguous ($\text{stride} = \text{number of columns}$)
When computing
$$ C = A \times B, $$
each element $C(i,j)$ is the dot product of row $i$ of $A$ and column $j$ of $B$. Both $A$ and $B$ are also in row-major format. So:
I. Load row of $A$
$$ [A(i,0), A(i,1), \dots, A(i,n-1)] $$
II. Load column of $B$
$$ [B(0,j), B(1,j), B(2,j), \dots, B(p-1,j)] $$
III. Vector multiply and accumulate
Multiply corresponding elements and sum them up:
$$ C(i,j) = \sum_{k=0}^{n-1} A(i,k) \cdot B(k,j) $$
By setting a vector stride correctly, a single vector instruction can fetch or process these elements in a single pass. This eliminates the need for complex address computations in a loop, letting the hardware handle it efficiently.
In practice, vector loads/stores are a key part of high-performance computing. When data is well-organized in memory and has a suitable stride, the hardware can fetch multiple elements at once, reducing the number of individual instructions and improving overall throughput.
Each bank has its own Memory Address Register (MAR) and Memory Data Register (MDR), and they all connect to the CPU via a shared address bus and data bus. The following diagram shows the high-level structure of a system with multiple memory banks:
#
+----------------+
| CPU |
| (exec unit, |
| control) |
+--------+-------+
|
+------------+------------+
| Address Bus & |
| Data Bus |
+------------+------------+
|
-------------------------------------------------
| | | ... | |
v v v v v
+-----------+ +-----------+ +-----------+ ... +-----------+
| Bank 0 | | Bank 1 | | Bank 2 | | Bank 15 |
| (MAR,MDR) | | (MAR,MDR) | | (MAR,MDR) | | (MAR,MDR) |
+-----------+ +-----------+ +-----------+ ... +-----------+
Memory banking is used for high-performance vector or GPU architectures because it increases effective memory bandwidth. If each element of a vector is mapped to a different bank, the system can read multiple elements simultaneously or in rapid succession.
C[i]
solely from A[i]
and B[i]
satisfies independence.for i = 0 to 49, C[i] = (A[i] + B[i]) / 2
qualifies for vectorization because each element is computed separately, whereas a loop referencing C[i-1]
would introduce dependencies; for example, all 50 averages can be computed simultaneously without waiting for prior iterations.A
into V0, load B
into V1, add them into V2, shift V2 right into V3, and then store V3 into C
enables processing multiple indices per step, whereas scalar instructions repeat these operations per element; for example, a vector add replaces 50 scalar adds.(A[i] + B[i]) >> 1
performs the same result faster on integer hardware.Vectorizable loops are a prime candidate for GPU acceleration. Each GPU thread might handle a portion of the data, or specialized instructions can process multiple elements per thread. In either case, independence between loop iterations makes it easy to parallelize the computation.
VADD
instruction can add eight numbers at once using vector registers.
===============================================
SIMD (Single Instruction, Multiple Data)
===============================================
+------------------------+
| Instruction Fetch |
| (one instruction) |
+-----------+-----------+
|
v
+-----------------+-----------------+-----------------+-----------------+
| Lane 0 | Lane 1 | Lane 2 | Lane 3 |
| (data element) | (data element) | (data element) | (data element) |
+-----------------+-----------------+-----------------+-----------------+
(One instruction is applied to multiple data elements simultaneously
in parallel lanes. All lanes share the same Program Counter.)
===============================================
SIMT (Single Instruction, Multiple Threads)
===============================================
+-------------------------------------------------+
| Warp Scheduler / Dispatch Unit |
+------------------------+-------------------------+
|
(Groups threads | Same instruction if
into warps | they converge in PC)
v
+------------+ +------------+ +------------+ +------------+
| Thread 0 | | Thread 1 | | Thread 2 | | Thread 3 |
| Instr. Strm| | Instr. Strm| | Instr. Strm| | Instr. Strm|
+------------+ +------------+ +------------+ +------------+
(Each thread has its own instruction stream and Program Counter,
but threads that execute the same instruction can run in lockstep
as a warp, effectively behaving like SIMD when they follow one path.)
#
+--------------------------------+
| PC, Mask |
| +--------------+ |
| | I-Cache | |
| +--------------+ |
| Decode |
| |
| +--------------+ +--------+ |
| | Scalar Pipe | | ... | | <-- Possibly multiple
| +--------------+ +--------+ | scalar pipelines
| | Scalar Pipe | |
| +--------------+ |
| ... |
| SIMD Exec |
+---------------+----------------+
|
|
+----------------------+-----------+--------------------+
| Interconnection Network |
+------------+------------+------------+----------------+
| | |
v v v
+-----------+ +-----------+ +-----------+ ...
|ShaderCore | |ShaderCore | |ShaderCore |
+-----------+ +-----------+ +-----------+
| | |
+-----+------+------+-----+
| | |
+---------------------+------+------+---------------------+
| Memory Controller(s) |
| +--------------+ |
| | GDDRx | (e.g., GDDR3, GDDR5) |
+-----------------+--------------+------------------------+
SIMD Execution: Executes vector (wide) operations for parallel data processing (shaders)
Interconnection Network
Connects the front-end and memory subsystem to multiple shader cores (compute units)
Shader Cores
Each core has ALUs, registers, and scheduling hardware for parallel threads (shaders)
Memory Controllers + GDDR
Typical GPU Execution Flow (Three Steps):
I. Load Data to GPU
II. Execute GPU Kernel
III. Copy Results Back
Here’s a simplified diagram showing the basic data flow:
+-------------------------+ Step 1 +------------------------+
| CPU (Host) | --(Host->Device)--> | GPU (Device) |
| | <-(Device->Host)-- | |
| (Runs host code) | Step 3 | (Executes GPU kernels) |
+-------------------------+ +-----------+------------+
^
| Step 2
|
Kernel
Launch
cudaMalloc
and transfers input data using cudaMemcpy
, whereas skipping this step leaves the GPU without access to necessary inputs; for example, image pixels must be copied to device memory before a filter kernel can run.myKernel<<<blocks, threads>>>(args)
, whereas without this step the GPU performs no computation; for example, a matrix multiplication kernel assigns rows and columns to thousands of threads in parallel.cudaMemcpy
, whereas skipping this step is only possible if further GPU processing or direct display output is planned; for example, graphics rendering often leaves images in device memory for display without returning them to the CPU.KernelA<<<nBlk, nThr>>>(args)
specifies grid and block dimensions, whereas omitting this launch leaves the GPU idle; for example, setting nBlk=64
and nThr=128
spawns 8192 threads to process data.C[i] = A[i] + B[i]
is written once and applied across all threads.if
condition that splits threads forces the warp to execute both paths sequentially.Below is a diagram that visualizes how blocks and warps fit into the overall GPU execution model. Each grid consists of one or more blocks, and each block contains many threads. Under the hood, the GPU hardware groups these threads into warps (on NVIDIA GPUs, typically 32 threads per warp)
+-------------------------------------------------------------+
| GPU Grid |
| (launched by Kernel<<< numBlocks, threadsPerBlock >>>(...)) |
| |
| +------------------+ +------------------+ |
| | Block 0 | | Block 1 | |
| | (threads) | | (threads) | ... |
| | +--------------+ | | +--------------+ | |
| | | Warp 0 | | | | Warp 0 | | |
| | | (th0..th31) | | | | (th0..th31) | | |
| | +--------------+ | | +--------------+ | |
| | +--------------+ | | +--------------+ | |
| | | Warp 1 | | | | Warp 1 | | |
| | | (th32..th63) | | | | (th32..th63) | | |
| | +--------------+ | | +--------------+ | |
| | ... | | ... | |
| +------------------+ +------------------+ |
+-------------------------------------------------------------+
How Blocks and Warps Relate
numBlocks
)Threads per block (e.g., threadsPerBlock
)
Blocks: Each block is an independent group of threads:
__syncthreads()
in CUDA) Blocks are also often arranged in one-, two-, or three-dimensional configurations (e.g., a 2D grid of threads for image processing)
Threads: Each block has many threads:
Threads in the same block can communicate and synchronize.
Warps: The GPU hardware divides the threads in each block into warps:
if
statement), the warp executes each branch sequentially, reducing overall efficiency.VADD
instruction can add four 32-bit integers at once in a 128-bit register.A CUDA kernel is a function marked with __global__
that runs on the GPU across many threads; each thread computes its own index and works on a slice of the data.
__global__ void scale(float* d, float s, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) d[i] *= s;
}
Before the GPU can touch your data, you allocate memory on the device; the pointer you hold on the CPU is just a handle to memory that physically lives on the GPU.
float* d_data = nullptr;
size_t bytes = n * sizeof(float);
cudaError_t err = cudaMalloc(&d_data, bytes);
if (err != cudaSuccess) { /* handle error */ }
Inputs must be moved from host RAM to device memory; cudaMemcpy
does a raw byte copy and you must state the direction explicitly.
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
Kernels are launched with the <<<grid, block>>>
syntax that sets parallelism; pick a block size (often 128–1024) and compute how many blocks cover your data, then optionally synchronize to surface errors before continuing.
int block = 256;
int grid = (n + block - 1) / block;
scale<<<grid, block="">>>(d_data, 2.0f, n);
cudaDeviceSynchronize(); // wait & surface any runtime errors
Results live on the device until you copy them back; once you’re done, free the device memory to avoid leaks.
cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_data);
Inside a kernel, registers hold most scalars (fast but limited), while __shared__
memory lets threads in the same block cooperate; use __syncthreads()
to make sure all threads reach a safe point before using shared results.
__global__ void blockSum(const float* in, float* out) {
__shared__ float buf[256];
int t = threadIdx.x;
int i = blockIdx.x * blockDim.x + t;
buf[t] = in[i];
__syncthreads();
// (tiny reduction sketch)
for (int s = blockDim.x/2; s > 0; s >>= 1) {
if (t < s) buf[t] += buf[t + s];
__syncthreads();
}
if (t == 0) out[blockIdx.x] = buf[0];
}
Below is a table of common CUDA concepts. Each row outlines the concept, gives a brief description, shows a typical usage or syntax, and provides extra notes or examples for clarity.
Concept | Description | Syntax / Example | Extra Notes / Examples |
Kernel | A function executed on the GPU in parallel by many threads. Marked with __global__ . |
__global__ void myKernel(float* data) { // GPU code } // Launch myKernel<<<numBlocks, threadsPerBlock>>>(d_data); |
- Each thread runs myKernel() independently.- Grids and blocks specify the total thread count. - Often used for data-parallel tasks (e.g., vector addition) |
Block | A group of threads that can cooperate via shared memory and synchronization. | Specified in the kernel launch: <numBlocks, threadsPerBlock> |
- Example: myKernel<<<64, 256>>>(...) creates 64 blocks, each with 256 threads.- Threads in the same block can access shared memory declared with __shared__ . |
Grid | The entire set of blocks launched for a kernel. | Same syntax as block specification but at a higher level (numBlocks ) |
- A grid can be 1D, 2D, or 3D, matching the shape of the data (e.g., image processing often uses 2D grids) |
Thread | The basic unit of execution on the GPU. Each thread has its own registers and can identify itself by threadIdx within a block and blockIdx within the grid. |
int tx = threadIdx.x; int bx = blockIdx.x; int idx = bx * blockDim.x + tx; |
- threadIdx , blockIdx , and blockDim are built-in variables.- Each thread typically handles one or more elements of the data. |
Barrier / __syncthreads() |
Synchronizes all threads in the same block. No thread passes the barrier until all threads have reached it. | __shared__ float temp[256]; temp[threadIdx.x] = data[threadIdx.x]; __syncthreads(); // All threads have updated temp here |
- Only valid within a kernel. - Ensures partial updates in shared memory are visible to all threads before proceeding. - Does not synchronize across multiple blocks. |
Shared Memory | Fast on-chip memory visible to all threads in the same block. Declared with __shared__ . |
__global__ void myKernel(float* in){ __shared__ float tile[256]; // ... } |
- Useful for caching data reused by threads in a block. - Access is much faster than global memory but limited in size. - Must be accessed carefully to avoid bank conflicts. |
Global Memory | Main device memory accessible by all threads but with higher access latency than shared memory. | Allocated via cudaMalloc((void**)&d_ptr, size); Accessed in kernels by pointer ( d_ptr[index] ) |
- Large capacity but slower than shared or local memory. - Important to ensure coalesced access (threads access consecutive addresses) for best performance. |
Host Code | CPU-side code that sets up kernel launches, manages device memory, and coordinates data transfers. | // Allocate GPU memory cudaMalloc((void**)&d_data, size); // Launch kernel myKernel<<<blocks, threads>>>(d_data); // Copy results back cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); |
- The host is responsible for orchestrating GPU operations (allocation, copy, launch) - Synchronization functions ( cudaDeviceSynchronize() ) can ensure the GPU has finished before using results on the CPU. |
Device Function | A function that runs on the GPU but is only called from another GPU function (kernel or device function). Declared with __device__ . |
__device__ float deviceFunc(float x) { return x * x; } __global__ void kernelFunc(float* data){ data[threadIdx.x] = deviceFunc(data[threadIdx.x]); } |
- Not callable from the host. - Often used to break large kernels into smaller subroutines. |
Host Function | Normal CPU function compiled for the host. Marked with __host__ (optional unless combined with __device__ in a function qualifier). |
__host__ void hostFunc(){ // Runs on CPU } |
- Host code typically handles memory management, kernel launches, etc. - __host__ __device__ can make a function callable from both CPU and GPU, although limitations apply. |
cudaMalloc
, cudaMemcpy
, etc.).