CUDA Tutorial for Scientific Computing

This page introduces GPU computing from a computational physics perspective. The focus is not only on how to launch CUDA kernels, but also on why GPUs can accelerate structured numerical workloads, how memory hierarchy influences performance, and how these ideas connect to stencil-based algorithms such as the Laplace operator and Jacobi iteration.

Contents

Why can a GPU be faster? CUDA execution model GPU memory hierarchy 1D Laplace benchmark 2D Laplace benchmark Scaling results and interpretation Texture memory and historical context Image sources

1. Why can a GPU be much faster than a CPU?

The main reason is architectural specialization. A CPU is optimized for low-latency execution, complex control flow, branch prediction, and strong single-thread performance. A GPU is optimized for throughput: it devotes a much larger fraction of its hardware budget to arithmetic units and parallel execution resources. This makes it especially effective when the same operation must be applied to many data elements.

In scientific computing, this situation appears naturally in vector operations, matrix-style updates, structured-grid PDE solvers, image-style processing, and stencil-based algorithms. If the computation is regular enough, many GPU threads can work simultaneously on different parts of the domain.

CPU versus GPU architecture
Figure 1. Conceptual comparison between CPU and GPU resource allocation. The CPU devotes more hardware to control and cache, while the GPU devotes more hardware to massively parallel arithmetic execution.
Key point. A GPU is not automatically faster for every workload. The strongest gains typically appear when the computation is data-parallel, arithmetic is repeated many times, and memory access is reasonably regular.

2. CUDA execution model

CUDA expresses GPU work through a kernel launched over many threads. These threads are organized hierarchically into threads, blocks, and grids. In numerical computing, one thread is often mapped to one array element, one mesh node, or one grid point.

CUDA thread block grid hierarchy
Figure 2. CUDA thread hierarchy: threads are grouped into blocks, and blocks are grouped into a grid.

In one dimension, the global index is usually written as

int i = blockIdx.x * blockDim.x + threadIdx.x;

In two dimensions, the same pattern is extended to a pair of indices:

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

This indexing logic is the bridge between the numerical grid and the CUDA execution hierarchy.

3. GPU memory hierarchy

Many CUDA kernels are limited less by arithmetic throughput than by data movement. For this reason, memory hierarchy is central to performance. Fast storage levels are smaller and closer to computation, while large-capacity levels are slower.

GPU memory hierarchy
Figure 3. GPU memory hierarchy, including registers, shared memory, cache levels, and global memory.
Practical interpretation. In stencil kernels, performance often depends on whether neighboring data are repeatedly reloaded from global memory or effectively reused through cache or shared memory.

4. 1D Laplace operator benchmark

A clean entry point into CUDA stencil programming is the 1D discrete Laplace operator,

(∇²u)i ≈ ui+1 − 2ui + ui−1

This is a one-dimensional stencil update, not yet a full PDE solver. It is useful pedagogically because it separates the memory-access question from the broader complexity of iterative boundary-value solvers. The benchmark compares four implementations:

__global__ void ker_laplace_naive(float* y, const float* x, int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= n) return;

  int prev = (i == 0) ? n - 1 : i - 1;
  int next = (i == n - 1) ? 0 : i + 1;

  y[i] = x[next] - 2.0f * x[i] + x[prev];
}

This example is ideal for introducing thread indexing, cache-aware stencil execution, and the difference between architectural assumptions and measured performance.

5. 2D Laplace operator benchmark

The natural extension is the 2D five-point stencil,

(∇²u)i,j ≈ ui+1,j + ui−1,j + ui,j+1 + ui,j−1 − 4ui,j

This stencil is fundamental in computational physics and engineering. It appears in diffusion, electrostatics, heat conduction, pressure equations, and iterative relaxation methods. The 2D version is more realistic than the 1D case because it introduces two-dimensional indexing, more natural spatial locality, and a stronger connection to PDE discretization.

Laplace stencil and Jacobi iteration
Figure 4. Illustration of stencil-based update structure and Jacobi-style local-neighbor averaging.
__global__ void ker_laplace2d_naive(float* y, const float* x, int nx, int ny) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i >= nx || j >= ny) return;

  int im = (i == 0)      ? nx - 1 : i - 1;
  int ip = (i == nx - 1) ? 0      : i + 1;
  int jm = (j == 0)      ? ny - 1 : j - 1;
  int jp = (j == ny - 1) ? 0      : j + 1;

  int c  = j * nx + i;
  int xm = j * nx + im;
  int xp = j * nx + ip;
  int ym = jm * nx + i;
  int yp = jp * nx + i;

  y[c] = x[xp] + x[xm] + x[yp] + x[ym] - 4.0f * x[c];
}

In this tutorial, the 2D benchmark is implemented in a scaling-oriented form so that the same code can be used to study how different memory strategies behave as the grid size increases.

6. Scaling results and interpretation

The following benchmark was obtained on an NVIDIA Tesla T4 using a fixed block size of 16 × 16. Four memory-access strategies were compared for a 2D five-point Laplace stencil with periodic boundary conditions.

Grid size Naive (ms) LDG (ms) TexObject (ms) Shared (ms)
1024 × 10240.03980.03970.04680.0674
2048 × 20480.14320.14310.16440.2377
4096 × 40960.56620.56560.65690.9082
8192 × 81922.24762.25112.63813.6918

How the methods change with grid size

As the grid size increases, the runtime of all four methods grows approximately in proportion to the number of grid points. This is exactly what one expects from a stencil that performs a nearly fixed amount of work per point. The benchmark therefore behaves like a classical memory-bound scaling test rather than a compute-bound kernel.

The ranking is essentially unchanged across all tested grid sizes:

Naive ≈ LDG  <  TexObject  <  Shared

This consistency is itself an important result. It shows that the kernel’s limiting mechanism does not fundamentally change over the tested range. The performance is dominated by memory movement, and the relative overheads of the different implementations remain structurally similar from 1024² to 8192².

Naive and LDG

The naive and LDG versions are nearly indistinguishable in performance. This means that, for this regular row-major stencil on a Tesla T4, the default hardware cache hierarchy is already highly effective. The read-only path provided by __ldg() does not produce a meaningful additional gain in this case.

TexObject

The texture-object version is consistently slower. This indicates that texture fetches do not provide an advantage for this simple five-point stencil. The access pattern is regular enough that the ordinary caching path already handles the working set efficiently.

Shared memory tiling

The shared-memory version is slowest at every grid size. The reason is that this kernel is too lightweight for explicit tiling to pay off. Halo loading, corner handling, shared-memory bookkeeping, and block synchronization all add overhead. In this benchmark, those costs are larger than the reduction in global-memory traffic.

Key lesson. Shared memory is not automatically beneficial. For regular stencil workloads on modern GPUs, optimization decisions should be based on measurements rather than assumptions.

7. Historical note: why texture memory no longer shows a clear advantage here

In earlier CUDA generations, texture memory often provided a noticeable advantage for stencil-style workloads. The reason was architectural: the texture path had a dedicated cache optimized for spatial locality, and regular 2D access patterns could benefit from it more clearly than from ordinary global-memory loads.

Over time, GPU memory systems evolved substantially. Modern architectures include much more capable cache hierarchies, and in several generations the L1 cache and texture-related caching behavior became increasingly integrated. CUDA also introduced __ldg(), which routes read-only data through a specialized cache path derived from the same broader memory-system evolution.

As a result, the historical advantage of texture memory has been weakened for many regular stencil workloads. In this benchmark, that effect is very clear: the naive implementation and the LDG version perform almost identically, while the texture-object version does not improve performance. This strongly suggests that the ordinary cache hierarchy is already effective enough for this access pattern.

This does not mean that texture memory is obsolete. It can still be useful for irregular access patterns, image-style kernels, gather operations, or cases that benefit from hardware interpolation. However, for a simple row-major five-point Laplace stencil, the modern cache system largely removes the historical reason to expect a texture-memory advantage.

8. Suggested usage in teaching

The sequence used here works well in teaching because it progresses from simple to physically meaningful examples:

  1. Begin with 1D indexing and a simple 1D Laplace operator.
  2. Extend the same idea to a 2D five-point stencil.
  3. Discuss memory traffic and why stencil kernels are frequently memory-bound.
  4. Compare cache-based and shared-memory implementations empirically.
  5. Use the benchmark to demonstrate that optimization must be evidence-driven.
← Back to homepage

9. Image sources and attribution

  1. Figure 1: CPU vs GPU architecture. Based on the CPU/GPU comparison diagram discussed by Cornell Virtual Workshop, which notes that the figure is taken from the CUDA C++ Programming Guide.
  2. Figure 2: CUDA thread / block / grid hierarchy. Standard CUDA execution-model diagram from the NVIDIA CUDA C++ Programming Guide.
  3. Figure 3: GPU memory hierarchy. Cornell Virtual Workshop memory-levels diagram for the NVIDIA Tesla V100, based on NVIDIA/Citadel teaching material.
  4. Figure 4: Laplace / Jacobi stencil illustration. Used as a teaching illustration for stencil-style iterative updates and local-neighbor averaging.