Rocky Mountain Mathematics Consortium
Summer 2008: Parallel Numerical Methods for
Partial Differential Equations
Rocky Mountain Mathematics Consortium
Summer 2008: Parallel Numerical Methods for
Partial Differential Equations
How many hamsters can you fit in a GPU ?
General Purpose GPU Computing:
The era of Moore’s law leading to constant speed bumps for processor clocks is drawing to a close, at least for a while. We can no longer write a simulation kernel and see the simulation times come down year on year by simply buying faster CPUs.

Intel & AMD have both stopped substantially increasing the clock speed of their processors, and are now using the still increasing number of components per chip to create multi-core processors. However, in particular in the case of Intel’s multi-core offerings the independent cores are all sharing critical pathways to the system memory. Thus for the kind of matrix and vector operations that DG implementations rely on it is relatively easy to reach maximum capacity for data flow between the memory and the cores (saturating the frontside bus). For instance, it is quite difficult to get 4x speed ups from a four-core processor as the cores becomes starved of data due to conflicts for memory access.
Moving to shared clusters/super-computers is ok apart from the problem that as such a resource inevitably becomes popular the obtainable work throughput declines with the increasing number of users. If you count queue time as part of compute time then the speed ups offered by these systems are less impressive.
Running your own beowulf cluster can be expensive & time consuming, in particular if you have to run it yourself and have to foot the energy bill.
The picture may seem bleak, and my personal (subjective) observations:
i. CPUs are not getting faster every year
ii. cores are becoming more numerous but hard to utilize
iii. shared cluster computing lacks interactivity (anyone remember the days when you had to book an international phone call in advance through the operator ? )
iv. personal beowulf clusters are a pain in the neck to operate (personnel time is expensive compared to compute time)
We need an alternative: and the GPU may offer some options. In the following we discuss one possible compute model, based on the CUDA advanced programming interface for NVIDIA’s upper range of GPUs.

CUDA Terminology See these notes for much more detail
There is a fairly minimal amount of terminology that is needed to help understand the programming model used in the CUDA framework.
An individual GPU will be referred to as a device.
The CPU will be referred to as a host.
In particular it is useful to know what Thread, Warp, Block, and Grid refer to:
• Thread:
concurrent code and associated state executed on the CUDA device (in parallel with other threads). The fine grain unit of parallelism in CUDA
• Warp:
a group of threads executed physically in parallel in G80. Technically a half-warp is executed “simultaneously”
• Block:
a group of threads that are executed together and form the unit of resource assignment
• Grid:
a group of thread blocks that must all complete before the next phase of the program can begin on the GPU.
For gpu-DG we assign the computations required for each node of the high-order nodal mesh to its own thread. No assumptions are made about the particular composition of the Warps. Each element will be assigned to a block. The grid corresponds to the set of all elements in the mesh [ for coarser grain parallelism we will actually assign a partition as the grid and use MPI to communicate between CPU’s ].

This is by no means the only data-partitioning we could use, but it is natural given the structure of the mesh and the cohesive nature of the elemental communications.

Bank Conflicts
Each bank has a bandwidth of 32 bits per clock cycle
Successive 32-bit words are assigned to successive banks
G80 has 16 banks
So bank = address % 16
Same as the size of a half-warp
No bank conflicts between different half-warps, only within a single half-warp

CUDA Functions
The CUDA API can be thought of as an extension to C. It includes a number of new functions that are called by the host and allow interaction between the host and device. Even with a limited number of available functions, there are a few standout functions that provide most of the functionality that we need for creating a GPU implementation:
• cudaGetDevice: find the current GPU “context”
• cudaGetDeviceCount: find the number of graphics devices (independent GPUs) available
• cudaSetDevice: set the current graphics device (0-indexed integer)
• cudaMalloc: allocate an array on the device
• cudaMallocHost: (optional) way to allocate an array on the host. Using this function to allocate memory gives CUDA the ability to track & align the memory on the host and accelerate the process of transferring data between host and device.
• cudaBindTexture: attach a handle to a CUDA array so that it can be treated as a texture by the device. texture memory is the only type of memory that is cached.
• cudaMemcpy: copy between arrays on host & device or two device arrays.
• cudaThreadSynchronize: the CPU and the GPU operate asynchronously. This function returns to the CPU when all previously invoked device kernels have completed. [ analogous to an MPI_Barrier ]
These functions are prototyped in cuda.h and cuda_runtime_api.h and can be called from your usual C code.

CUDA Syntax
Functions that are going to be executed as kernels on the device should have their header prefaced with:
__device__ __global__
Variables and arrays that are to be created in shared memory space on the multiprocessor should be prefaced by this notation as they are created:
__device__ __shared__
Variables that are to be created in the constant memory of the device should be declared with the following preface:
__device__ __constant__
See the nvidia notes for further details on declaring textures and binding arrays to these handles:
texture<float4, 1, cudaReadModeElementType> t_LIFT;

CUDA Kernels (CUrnels ?)
__syncthreads
kernel_function <<<BlocksPerGrid, ThreadsPerGrid>>>>( argument1, argument2, ... );

Simple CUDA Example
/*
To compile (assuming cuda install in /usr/local/cuda ):
nvcc -I/usr/local/cuda/include -c -o cudakernel.o cudakernel.cu
gcc -I/usr/local/cuda/include -c -o cudademo.o cudademo.c
gcc -o cudademo cudademo.o cudakernel.o -L/usr/local/cuda/lib -lcuda -lcudart -lm
*/
#include <stdio.h>
#include "cuda.h"
#include "cuda_runtime_api.h"
main(int argc, char **argv){
/* registers */
int n;
/* device ID */
int devid;
/* device count */
int devcount;
/* number of entries in arrays */
int N = 512;
/* pointer to host array */
float *h_array;
/* pointer to gpu device array */
float *g_array1, *g_array2;
/* find number of device in current "context" */
cudaGetDevice(&devid);
/* find how many devices are available */
cudaGetDeviceCount(&devcount);
/* allocate array on host (via CUDA) */
cudaMallocHost((void**) &h_array, N*sizeof(float));
/* allocate arrays on device (via CUDA) */
cudaMalloc((void**) &g_array1, N*sizeof(float));
cudaMalloc((void**) &g_array2, N*sizeof(float));
/* fill up the host array */
for(n=0;n<N;++n)
h_array[n] = n;
/* copy from host array to device array */
cudaMemcpy(g_array1, h_array, N*sizeof(float), cudaMemcpyHostToDevice);
/* invoke kernel on device */
void cudakernel(int N, float *g_data, float *g_result);
cudakernel(N, g_array1, g_array2);
/* copy from device array to host array */
cudaMemcpy(h_array, g_array2, N*sizeof(float), cudaMemcpyDeviceToHost);
/* print out results */
for(n=0;n<N;++n){
printf("%f ", h_array[n]);
}
printf("\n");
}
The one non-library function, highlighted in green, is in a separate .cu CUDA file:
#include "cuda.h"
/* example kernel */
__global__ void kernel(int N, float *g_data, float *g_result){
int n = threadIdx.x + blockDim.x*blockIdx.x;
/* compute squares of entries in data array */
if(n<N)
g_result[n] = g_data[n]*g_data[n];
}
/* only use extern if calling code is C */
extern "C"
{
/* driver for kernel */
void cudakernel(int N, float *g_data, float *g_result){
/* choose 256 threads per block for high occupancy */
int ThreadsPerBlock = 256;
/* find number of blocks */
int BlocksPerGrid = (N+ThreadsPerBlock-1)/ThreadsPerBlock;
/* invoke device on this block/thread grid */
kernel <<< BlocksPerGrid, ThreadsPerBlock >>> (N, g_data, g_result);
}
}
The cudakernel function is callable from C. It invokes an array of thread blocks, i.e. multiple threads. In this case a total of BlocksPerGrid*ThreadsPerBlock threads are invoked. Each thread can identify itself with a unique identifier, in fact here we create a variable n that encodes this information.

CUDA Compilation Demo
Queue demo.

Performance Optimization
Coding a kernel for the GPU device takes a certain amount of care. It amounts to: taking care of, feeding, and keeping a whole lot of hamsters busy. Here are some rules of thumb:
1.Problem Decomposition: Divide up your problem so that there are a relatively large number (>32, <512) of threads per block.
Example: the 8800 gpus have 128 thread processors, each capable of organizing/executing 96 threads. The thread manager on each thread processor keeps threads in several queues. When a specific thread requests data from memory it is put into a wait queue, once the data is available it is put into a ready-to-resume queue and is resumed as soon as possible. This is why it is important to have a good number of threads available for execution in each block, allowing us to hide high memory access times.



4.It’s an entertaining enterprise trying to maximize the threads per block, keep lmem=0, and minimize rmem and smem in order to achieve a high occupancy rate. My experience is that it is difficult to achieve more than 66% occupancy (this requires 10 or less registers and modest shared memory) and 100% occupancy does not yield much better performance than 66% occupancy for non-trivial kernels. My best code performance has been achieved by a compromise of kernel complexity and register counts.
5.Memory Bank Conflicts: The promise of the fast shared memory, simultaneously accessible by all the threads in a half-warp, is tempered by the possibility that the individual threads may request to read from the shared memory in a random access manner. A simple rules of thumb is that if you intend to use shared memory, try to ensure that each thread is reading an entry that is 32bits shifted from its thread neighbors. Examples:
i. thread n grabs an entry from a shared memory array as: float u = s_u[n];
ii. all threads read the same entry float u = s_[3]; This is referred to as a “broadcast”
6.Coalesced Memory Reads from Common Memory:


8.

CUDA Caveats
The current generation of nvidia’s GPUS (e.g. 8800, 9800, and C870) do not perform hardware based double precision arithmetic operations. In what follows we are assuming that this is not going to be a problem (probably true for short time time-domain computations). The real importance of this discussion is the preparation for the next generation or two of devices that will be double enabled.
The amount of memory available to each GPU is usually limited: typically 512MB for 8800/9800 or 1.5GB for the tesla cards.

The Big Question: Can CUDA Deliver for DG ?
FDTD yielded 5x speed up

Sources
http://dis.um.es/~domingo/08/CD/26May/ProgramacionParalela/Fernandez/slides.pdf
http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/downloads/Accelerating_HPC.pdf
http://media.hpcwire.com/documents/nbody_gems3_ch31%5B1%5D.pdf
http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf
http://www.stanford.edu/class/cs148/Lectures/GPGPU%20via%20CUDA.pdf
http://multicore.amd.com/us-en/quadcore/
Excellent notes: