← Back to homepage

CUDA for Scientific Computing: A Practical Guide

Bridging the gap between basic syntax and high-performance PDE solvers. This guide focuses on the engineering realities of GPU computing: hardware-aware memory management, stencil optimization, and why architectural design dictates algorithmic choices in computational physics.

Companion Materials: CUDA Crash Course

For a fast, structured introduction to GPU programming, you can download my CUDA Crash Course Note (PDF). Originally organized during my PhD research, this self-study guide pairs C and CUDA examples side-by-side so you can quickly compare the structural execution differences without getting lost in boilerplate. You can also download the accompanying source code archive (.rar) containing all the paired examples.

Contents

1. The Architectural Shift: CPU vs. GPU 2. Mapping Grids to CUDA Hierarchies 3. The Memory Bottleneck 4. 1D Laplace Benchmarking 5. 2D Laplace & Stencil Operations 6. Performance Analysis on Tesla T4 7. The Obsolescence of Texture Memory 8. Production-Grade Optimizations 9. References

1. The Architectural Shift: CPU vs. GPU

CPUs are built to minimize latency for unpredictable workloads, dedicating vast die area to complex control logic and massive L3 caches. GPUs take the opposite approach: they maximize throughput by heavily prioritizing Arithmetic Logic Units (ALUs) over control logic.

Hardware allocation differences between CPU and GPU
Figure 1. Transistor allocation in CPU vs. GPU architecture. GPUs trade latency-hiding caches for sheer computational density.

For structured-grid algorithms—such as Finite-Difference Time-Domain (FDTD) or Jacobi iterations—this throughput-centric design is a natural fit. These solvers demand high arithmetic intensity across massive arrays with highly predictable memory access patterns. When properly implemented, a modern GPU can routinely deliver order-of-magnitude speedups over a multi-core CPU.

Conceptual analogy of CPU vs GPU
Figure 1.1. Single complex worker (CPU) vs. Massively parallel coordinated workers (GPU).

FDTD Simulation: Multi-threaded CPU vs. CUDA execution.

2. Mapping Grids to CUDA Hierarchies

CUDA abstractions map directly to numerical grids. Computations are launched as kernels, executed by thousands of threads. Threads are bundled into blocks, which formulate a grid. In spatial simulations, the standard practice is mapping one thread to one discrete mesh node.

CUDA hierarchical execution model
Figure 2. Mapping execution threads to a computational grid.

Global memory indexing in 1D:

int i = blockIdx.x * blockDim.x + threadIdx.x;

Extended to 2D for cross-section models or surface boundaries:

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

3. The Memory Bottleneck

In computational physics, FLOPS are practically free; data movement is expensive. Real-world solver performance is almost entirely bound by memory bandwidth. Understanding the memory hierarchy is a prerequisite for writing fast kernels.

Memory layout of a modern GPU
Figure 3. Memory hierarchy: balancing latency, bandwidth, and capacity.

4. 1D Laplace Benchmarking

To isolate memory access behavior from arithmetic complexity, we use the 1D discrete Laplace operator:

(∇²u)i ≈ ui+1 − 2ui + ui−1

We evaluate four distinct implementation strategies to measure hardware response:

__global__ void ker_laplace_naive(float* y, const float* x, int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= n) return;

  // Periodic boundary conditions
  int prev = (i == 0) ? n - 1 : i - 1;
  int next = (i == n - 1) ? 0 : i + 1;

  y[i] = x[next] - 2.0f * x[i] + x[prev];
}

5. 2D Laplace & Stencil Operations

Scaling up to 2D, the five-point stencil is the backbone of diffusion solvers, electrostatics, and TM/TE mode FDTD schemes:

(∇²u)i,j ≈ ui+1,j + ui−1,j + ui,j+1 + ui,j−1 − 4ui,j
5-point spatial stencil
Figure 4. Central difference representation for 2D spatial derivatives.
__global__ void ker_laplace2d_naive(float* y, const float* x, int nx, int ny) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i >= nx || j >= ny) return;

  int im = (i == 0)      ? nx - 1 : i - 1;
  int ip = (i == nx - 1) ? 0      : i + 1;
  int jm = (j == 0)      ? ny - 1 : j - 1;
  int jp = (j == ny - 1) ? 0      : j + 1;

  int c  = j * nx + i;
  y[c] = x[j * nx + ip] + x[j * nx + im] + x[jp * nx + i] + x[jm * nx + i] - 4.0f * x[c];
}

6. Performance Analysis on Tesla T4

3D FDTD Performance Benchmark

Beyond lightweight stencil tests, it is also useful to show what a real 3D scientific solver can achieve after a CUDA pipeline has been implemented. In the present case, a 3D FDTD solver with dispersive material updates, real-time spectral accumulation, and Ex data output was executed on a publicly accessible NVIDIA T4 GPU.

Representative result: a 600 × 600 × 200 simulation with 12,000 time steps, including real-time spectral accumulation and Ex field output, was completed in less than 30 minutes on a T4 GPU.
3D FDTD performance benchmark on NVIDIA T4
Figure 5. Baseline 3D CUDA benchmark on an NVIDIA T4. This table gives a practical sense of how grid size and numerical precision affect sustained throughput, and helps connect introductory CUDA concepts to production-scale scientific computing.

This benchmark is not meant to replace the kernel-level discussion below. Rather, it gives readers a more concrete sense of the scale at which GPU acceleration becomes transformative for end-to-end computational electromagnetics workflows.

Executing the 2D benchmark on an NVIDIA T4 GPU (using 16×16 blocks) reveals crucial insights about modern hardware behavior.

Grid Resolution Naive (ms) LDG (ms) TexObject (ms) Shared (ms)
1024 × 10240.03980.03970.04680.0674
2048 × 20480.14320.14310.16440.2377
4096 × 40960.56620.56560.65690.9082
8192 × 81922.24762.25112.63813.6918

Engineering Takeaways

The execution time scales linearly with grid size, verifying a strict memory-bound state. More importantly, the hierarchy of execution speed remains rigid: Naive ≈ LDG < TexObject < Shared

Practical Rule: Never assume shared memory automatically guarantees a speedup. For low arithmetic-intensity stencils on modern architectures (Volta/Turing/Ampere onwards), the hardware L1 cache often outperforms explicit manual tiling.

7. The Obsolescence of Texture Memory

In the Kepler and Maxwell eras, exploiting texture memory was mandatory for high-performance 2D/3D wave propagation solvers. The texture cache was explicitly designed for spatial locality, vastly outperforming the rudimentary data caches of the time.

However, as unified cache architectures evolved, standard global memory accesses (our "Naive" method) caught up. Today, texture units should be strictly reserved for their intended use cases: hardware-accelerated interpolation, boundary clamping, and unstructured image processing.

8. Production-Grade Optimizations

When hardware caches handle basic stencils efficiently, further speedups in industrial PDE solvers (e.g., 3D FDTD, Maxwell solvers) require addressing global memory bandwidth and hardware occupancy directly.

I. Enforcing Memory Coalescing

GPUs retrieve memory in 32- or 128-byte segments. If all 32 threads in a Warp access a contiguous and aligned block of memory, the transaction is "coalesced" and executed in a single cycle.

Implementation: In multiphysics simulations, using Array of Structures (AoS) (e.g., struct {float Ex, Ey, Hz;} grid[N]) breaks continuity. To achieve maximum bandwidth, data structures must be refactored into a Structure of Arrays (SoA) (e.g., independent arrays for Ex[N], Ey[N]), ensuring 100% coalesced access for every field component.

II. Kernel Fusion

Writing intermediate fields to global memory, only to read them back in the next sequential step, is a severe bottleneck. Data transfer off-chip frequently dominates the runtime of a solver.

Implementation: Combine sequential operations—such as calculating gradients, updating field values, and applying absorbing boundary conditions (like UPML/CPML)—into a single, robust kernel. Retaining intermediate values within registers eliminates off-chip memory round-trips entirely.

III. Thread Coarsening and Register Reuse

The standard "one thread per node" mapping creates significant instruction overhead in massive 3D arrays.

Implementation: Assign a single thread to compute a small cluster of nodes (e.g., a 1x4 element column). This reduces the total number of blocks launched. Crucially, shared boundary data between those nodes can be loaded once from global memory and held in zero-latency registers for reuse. This bypasses the L1 cache entirely and massively increases Instruction-Level Parallelism (ILP).

9. References & Image Attribution

  1. Figure 1: Hardware allocation derived from NVIDIA CUDA C++ Programming Guide.
  2. Figure 2: Thread hierarchy mapping, NVIDIA Developer Documentation.
  3. Figure 3: GPU memory architecture, Cornell Virtual Workshop.
  4. Figure 4: Standard numerical methods 5-point stencil representation.
← Back to homepage