How many hamsters can you fit in a GPU ?

General Purpose GPU Computing:

The era of Moore’s law leading to constant speed bumps for processor clocks is drawing to a close, at least for a while. We can no longer write a simulation kernel and see the simulation times come down year on year by simply buying faster CPUs.

Intel & AMD have both stopped substantially increasing the clock speed of their processors, and are now using the still increasing number of components per chip to create multi-core processors. However, in particular in the case of Intel’s multi-core offerings the independent cores are all sharing critical pathways to the system memory. Thus for the kind of matrix and vector operations that DG implementations rely on it is relatively easy to reach maximum capacity for data flow between the memory and the cores (saturating the frontside bus). For instance, it is quite difficult to get 4x speed ups from a four-core processor as the cores becomes starved of data due to conflicts for memory access.

Moving to shared clusters/super-computers is ok apart from the problem that as such a resource inevitably becomes popular the obtainable work throughput declines with the increasing number of users. If you count queue time as part of compute time then the speed ups offered by these systems are less impressive.

Running your own beowulf cluster can be expensive & time consuming, in particular if you have to run it yourself and have to foot the energy bill.

The picture may seem bleak, and my personal (subjective) observations:

  1. i.  CPUs are not getting faster every year

  2. ii.  cores are becoming more numerous but hard to utilize

  3. iii. shared cluster computing lacks interactivity (anyone remember the days when you had to book an international phone call in advance through the operator ? )

  4. iv. personal beowulf clusters are a pain in the neck to operate (personnel time is expensive compared to compute time)

We need an alternative: and the GPU may offer some options. In the following we discuss one possible compute model, based on the CUDA advanced programming interface for NVIDIA’s upper range of GPUs.

CUDA Terminology See these notes for much more detail

There is a fairly minimal amount of terminology that is needed to help understand the programming model used in the CUDA framework.

An individual GPU will be referred to as a device.

The CPU will be referred to as a host.

In particular it is useful to know what Thread, Warp, Block, and Grid refer to:

  1. Thread:
    concurrent code and associated state executed on the CUDA device (in parallel with other threads). The fine grain unit of parallelism in CUDA

  2. Warp:
    a group of threads executed physically in parallel in G80. Technically a half-warp is executed “simultaneously”

  3. Block:
    a group of threads that are executed together and form the unit of resource assignment

  4. Grid:
    a group of thread blocks that must all complete before the next phase of the program can begin on the GPU.

For gpu-DG we assign the computations required for each node of the high-order nodal mesh to its own thread. No assumptions are made about the particular composition of the Warps. Each element will be assigned to a block. The grid corresponds to the set of all elements in the mesh [ for coarser grain parallelism we will actually assign a partition as the grid and use MPI to communicate between CPU’s ].

This is by no means the only data-partitioning we could use, but it is natural given the structure of the mesh and the cohesive nature of the elemental communications.

Bank Conflicts

Each bank has a bandwidth of 32 bits per clock cycle

Successive 32-bit words are assigned to successive banks

G80 has 16 banks

So bank = address % 16

Same as the size of a half-warp

No bank conflicts between different half-warps, only within a single half-warp

CUDA Functions

The CUDA API can be thought of as an extension to C. It includes a number of new functions that are called by the host and allow interaction between the host and device. Even with a limited number of available functions, there are a few standout functions that provide most of the functionality that we need for creating a GPU implementation:

  1. cudaGetDevice: find the current GPU “context”

  2. cudaGetDeviceCount: find the number of graphics devices (independent GPUs) available

  3. cudaSetDevice: set the current graphics device (0-indexed integer)

  4. cudaMalloc: allocate an array on the device

  5. cudaMallocHost: (optional) way to allocate an array on the host. Using this function to allocate memory gives CUDA the ability to track & align the memory on the host and accelerate the process of transferring data between host and device.

  6. cudaBindTexture: attach a handle to a CUDA array so that it can be treated as a texture by the device. texture memory is the only type of memory that is cached.

  7. cudaMemcpy: copy between arrays on host & device or two device arrays.

  8. cudaThreadSynchronize: the CPU and the GPU operate asynchronously. This function returns to the CPU when all previously invoked device kernels have completed. [ analogous to an MPI_Barrier ]

These functions are prototyped in cuda.h and cuda_runtime_api.h and can be called from your usual C code.

CUDA Syntax

Functions that are going to be executed as kernels on the device should have their header prefaced with:

__device__ __global__

Variables and arrays that are to be created in shared memory space on the multiprocessor should be prefaced by this notation as they are created:

__device__ __shared__

Variables that are to be created in the constant memory of the device should be declared with the following preface:

__device__ __constant__

See the nvidia notes for further details on declaring textures and binding arrays to these handles:

texture<float4, 1, cudaReadModeElementType> t_LIFT;

CUDA Kernels (CUrnels ?)


kernel_function <<<BlocksPerGrid, ThreadsPerGrid>>>>( argument1, argument2, ... );

Simple CUDA Example


  To compile (assuming cuda install in /usr/local/cuda ):


  nvcc -I/usr/local/cuda/include -c -o cudakernel.o

  gcc -I/usr/local/cuda/include -c -o cudademo.o cudademo.c

  gcc -o cudademo cudademo.o cudakernel.o -L/usr/local/cuda/lib -lcuda -lcudart -lm


#include <stdio.h>

#include "cuda.h"

#include "cuda_runtime_api.h"

main(int argc, char **argv){


  /* registers */

  int n;

  /* device ID */

  int devid;


  /* device count */

  int devcount;


  /* number of entries in arrays */

  int N = 512;

  /* pointer to host array */

  float *h_array;


  /* pointer to gpu device array */

  float *g_array1, *g_array2;


  /* find number of device in current "context" */


  /* find how many devices are available */


  /* allocate array on host (via CUDA) */

  cudaMallocHost((void**) &h_array, N*sizeof(float));

  /* allocate arrays on device (via CUDA) */

  cudaMalloc((void**) &g_array1, N*sizeof(float));

  cudaMalloc((void**) &g_array2, N*sizeof(float));

  /* fill up the host array */


    h_array[n] = n;


  /* copy from host array to device array */

  cudaMemcpy(g_array1, h_array, N*sizeof(float), cudaMemcpyHostToDevice);

  /* invoke kernel on device */

  void cudakernel(int N, float *g_data, float *g_result);

  cudakernel(N, g_array1, g_array2);

  /* copy from device array to host array */

  cudaMemcpy(h_array, g_array2, N*sizeof(float), cudaMemcpyDeviceToHost);


  /* print out results */


    printf("%f ", h_array[n]);




The one non-library function, highlighted in green, is in a separate .cu CUDA file:

#include "cuda.h"

/* example kernel */

__global__ void kernel(int N, float *g_data, float *g_result){

  int n = threadIdx.x + blockDim.x*blockIdx.x;


  /* compute squares of entries in data array */


    g_result[n] = g_data[n]*g_data[n];



/* only use extern if calling code is C */

extern "C"


/* driver for kernel */

void cudakernel(int N, float *g_data, float *g_result){


  /* choose 256 threads per block for high occupancy */

  int ThreadsPerBlock = 256;


  /* find number of blocks */

  int BlocksPerGrid = (N+ThreadsPerBlock-1)/ThreadsPerBlock;


  /* invoke device on this block/thread grid */

  kernel <<< BlocksPerGrid, ThreadsPerBlock >>> (N, g_data, g_result);



The cudakernel function is callable from C. It invokes an array of thread blocks, i.e. multiple threads. In this case a total of BlocksPerGrid*ThreadsPerBlock threads are invoked. Each thread can identify itself with a unique identifier, in fact here we create a variable n that encodes this information.

CUDA Compilation Demo

Queue demo.

Performance Optimization

Coding a kernel for the GPU device takes a certain amount of care. It amounts to: taking care of, feeding, and keeping a whole lot of hamsters busy. Here are some rules of thumb:

  1. 1.Problem Decomposition: Divide up your problem so that there are a relatively large number (>32, <512) of threads per block.

    Example: the 8800 gpus have 128 thread processors, each capable of organizing/executing 96 threads. The thread manager on each thread processor keeps threads in several queues. When a specific thread requests data from memory it is put into a wait queue, once the data is available it is put into a ready-to-resume queue and is resumed as soon as possible. This is why it is important to have a good number of threads available for execution in each block, allowing us to hide high memory access times.

  2. 2.Memory types: There are several types of memory available on the device. Unfortunately it is important to know their characteristics to achieve high data throughput:

    common: accessible for read & write operations by all thread processors. Memory access to this memory is very slow, taking up to 300 cycles. But keep in mind that while a thread is waiting for data from common memory, other threads can execute instructions.

    constant: accessible for read operations by all thread processors.  Limited to 64KB (?) and difficult to use efficiently.

    local:       accessible for read & write, dedicated to individual threads. Unfortunately it has been difficult to use efficiently.

    texture:   accessible for read operations by all thread processors. Cached. Fast if data is in cache. To use effectively I found it necessary to coalesce reads.

    shared:   accessible for read & write by all threads in block. Faster than texture, but limited in size. Useful for storing modest amounts of data to be subsequently used by all threads in block. The amount of shared memory required for a block effects the level of occupancy obtainable, i.e. how many threads can be executed simultaneously.

    register:  accessible for read & write local to each thread. This is likely the fastest memory available. But unfortunately it is very limited per thread and the occupancy rate is strongly dependent on the amount of register memory each thread requires.

    Schematically one can envision the GPU memory system relative to the block/grid thread task arrangement this way:

    [ source ]

    The device diagram can be imagined as follows:
    [ source ]

  3. 3.Occupancy calculator: nvidia provides an Excel spreadsheet to compute the level of occupancy that your CUDA kernel can expect. First use nvcc to compile the .cu kernel file, then use ptxas to assemble the CUDA kernel. You can actually view the contents of the cubin.bin output file and find out how much local memory (lmem), register memory (reg), and shared memory (smem) that your compiled/optimized kernel will actually require. Plugging these numbers into the occupancy calculator will yield the estimates of the occupancy rate.

    Here are two examples of the right hand side function that evaluates the elemental spatial derivative contributions for 3D Maxwells. The treatment of each is somewhat different, yielding different register counts and noticeably different performance.

    Example 1: MaxwellsGPU_VOL_Kernel (early version) (N=7):

    Parameters: lmem=0, smem=3133, reg=22, ThreadsPerBlock=120

    This can be a little tricky to interpret. It is evident that the occupancy is strongly dependent on the registers per thread, reg. Reducing reg by one would bump up occupancy to 50% from 33%, but this has been a little tricky to achieve without performance degradation. We are rather at the mercy of the compilers (nvcc, ptxas).

    A faster implementation use:

    Example 2: MaxwellsGPU_VOL_Kernel (N=7): 

    Parameters: lmem=0, smem=3133, reg=28, ThreadsPerBlock=120

    Despite the higher register count, the occupancy level is the same and more importantly the performance is higher. This is because we are using so many registers that a few more does not make much difference to the number of warps that can simultaneously execute on each multiprocessor.

  4. 4.It’s an entertaining enterprise trying to maximize the threads per block, keep lmem=0, and minimize rmem and smem in order to achieve a high occupancy rate. My experience is that it is difficult to achieve more than 66% occupancy (this requires 10 or less registers and modest shared memory) and 100% occupancy does not yield much better performance than 66% occupancy for non-trivial kernels. My best code performance has been achieved by a compromise of kernel complexity and register counts.

  5. 5.Memory Bank Conflicts: The promise of the fast shared memory, simultaneously accessible by all the threads in a half-warp, is tempered by the possibility that the individual threads may request to read from the shared memory in a random access manner. A simple rules of thumb is that if you intend to use shared memory, try to ensure that each thread is reading an entry that is 32bits shifted from its thread neighbors. Examples:

    i. thread n grabs an entry from a shared memory array as:
    float u = s_u[n];

    ii. all threads read the same entry float u = s_[3]; This is referred to as a “broadcast”

  1. 6.Coalesced Memory Reads from Common Memory:

  2. 7.CUDA (2.0 beta) Profiler: nvidia included a GUI based profiler in the 2.0 version of CUDA. It provides medium grain information at the function level, i.e. it does not give line-by-line profile information. cudaprofile gives information about the following characteristics:

    i. microseconds spent in GPU compute per kernel
    ii. microseconds spent in CPU compute (or busy wait)
    iii. number of coherent load and stores on the GPU (i.e. coalesced read/writes)
    iv. number of incoherent load and stores on the GPU
    v. number of branches (i.e. if statements processed on the GPU)
    vi. number of divergent branches (bad news)
    vii. number of local load and stores (quite bad news)
    viii. number of times the threads being processed in a warp had to be serialized, i.e. processsed sequentially (bad news)

    Here is an example profile:

    This was a good run apart from the large number of branches. To get an idea of how much time is spent in each CUrnel:
    Not too much info here: it just tells us which CUrnel to examine most carefully for optimization.

  3. 8.

CUDA Caveats

The current generation of nvidia’s GPUS (e.g. 8800, 9800, and C870) do not perform hardware based double precision arithmetic operations. In what follows we are assuming that this is not going to be a problem (probably true for short time time-domain computations). The real importance of this discussion is the preparation for the next generation or two of devices that will be double enabled.

The amount of memory available to each GPU is usually limited: typically 512MB for 8800/9800 or 1.5GB for the tesla cards.

The Big Question: Can CUDA Deliver for DG ?

FDTD yielded 5x speed up


Excellent notes: