How many hamsters can you fit in a GPU ?

General Purpose GPU Computing & DG:

We have now seen the basic mechanics of general purpose GPU programs. The nvidia documentation

covers basic programming using CUDA for the GPU, so we need to get our hands dirty and try different strategies for DG.

Up to this point we have focused on choosing:

1. 1.the best nodes for the triangles/tetrahedra

2. 2.the best modes to form the operators

3. 3.the most loosely coupled numerical method (DG)

4. 4.an abstraction of the operations required

5. 5.a method with significant operator templating

6. 6.a method suitable for hyperbolic problems (with desirable locality)

7. 7.a highly scalable method (scales to 500+ processors even on older hardware)

Can we bring this approach to the desktop ?

Natural Data Decomposition

Recall that the CUDA thread manager requires us to decide on the number of blocks for the grid and the number of threads per block.

The natural data decomposition (probably not optimal) is:

# blocks for the grid :=  # elements in the mesh = K

# threads per block  :=  # nodes in an element   = Np

This is by no means the only data-partitioning we could use, but it is natural given the structure of the mesh and the cohesive nature of the elemental communications.

Let’s review the sequence of operations for the 2D Maxwell’s TM equations. In MATLAB:

It is initially very tempting to to take a BLAS based version of this
(see nudg++  trunk/Src/Examples2D/Maxwells2D/Maxwell2D_RHS.cpp)

and port directly to CUDA using nvidia’s cuBLAS.

// Maxwell2D_RHS.cpp

// function [rhsHx, rhsHy, rhsEz] = MaxwellRHS2D(Hx,Hy,Ez)

// 2007/06/06

//---------------------------------------------------------

#include "Maxwell2D.h"

//---------------------------------------------------------

void Maxwell2D::RHS()

//---------------------------------------------------------

{

// function [rhsHx, rhsHy, rhsEz] = MaxwellRHS2D(Hx,Hy,Ez)

// Purpose  : Evaluate RHS flux in 2D Maxwell TM form

// Define field differences at faces

dHx = Hx(vmapM)-Hx(vmapP);

dHy = Hy(vmapM)-Hy(vmapP);

dEz = Ez(vmapM)-Ez(vmapP);

// Impose reflective boundary conditions (Ez+ = -Ez-)

dHx(mapB)=0.0; dHy(mapB)=0.0; dEz(mapB)=2.0*Ez(vmapB);

// evaluate upwind fluxes

alpha = 1.0;

ndotdH =  nx.dm(dHx) + ny.dm(dHy);

fluxHx =  ny.dm(dEz) + alpha*(ndotdH.dm(nx) - dHx);

fluxHy = -nx.dm(dEz) + alpha*(ndotdH.dm(ny) - dHy);

fluxEz = -nx.dm(dHy) + ny.dm(dHx) - alpha*dEz;

// local derivatives of fields

// compute right hand sides of the PDE's

rhsHx = -Ezy  + LIFT*(Fscale.dm(fluxHx))/2.0;

rhsHy =  Ezx  + LIFT*(Fscale.dm(fluxHy))/2.0;

rhsEz =  CuHz + LIFT*(Fscale.dm(fluxEz))/2.0;

}

But here’s the bad news: First recall that there is a substantial latency for memory accesses and then look at the large number of independent BLAS like operations (dotmul, gemm). This is a critical problem. Each operation requires the loading of large arrays from the device’s common memory onto the thread processors and then transfer of results back to the common memory. Broadly speaking the operation partition of the above code is at the wrong scale. Furthermore, even if we were to neglect the level-1 BLAS operations as potentially small potatoes, and just focus on the level-3 matrix multiplies we would find results that echo other’s testing of cuBLAS for sgemm which revealed relatively low performance for the size of matrices we will be using.

In the absence of a really smart, application level optimizer for CUDA we are going to have to reorganize the sequence of operations to take advantage of data locality. A couple of useful rules:

1. 1.once we load a piece of data into memory we should do as many operations on it as possible

2. 2.it is useful to write a new version of the code that is:
i. in regular C
ii. is cache aware
iii. is based on floats
iv. is tested

3. 3.this kind of rewrite is not necessarily a bad thing.

4. 4.use the C based CPU code to test the GPU code for correctness.

5. 5.it is a lot easier to debug the C code running serially than it is to debug the GPU CUDA code running in up to 8192 threads !! [ this is also true for debugging MPI code. Always make sure you have a correct serial reference code first ]

CPU v. GPU code

Speed ups

As an example of the kind of speed up that is possible I took the 3D linearized acoustics code and compared the performance on CPU and GPU (running at with N between 2 and 9):

[Acoustics 3D Tesla C870]

Speed up is somewhat unfair -- we are comparing (time for compute on one core)/(time for compute on GPU). Thus we are only using 1/2 of the cores available on the CPU. Further, the reference CPU code was not optimized manually with SSE instructions (Streaming SIMD Extensions), although they may have been invoked by the compiler under optimization. The SSE instructions allow single instructions to be performed on multiple data, in particular on registers of 128bits. Clearly some avenues to explore.

However, in the next section we show some astonishing performance figures for computing on one GPU.

GFLOPS

It is straightforward to sum up the total number of floating point operations used during a DG simulation which we can use to determine how many billion operations per second (GFLOPS) are being performed.

[Acoustics 3D Tesla C870]

The reference CPU code achieves a fairly uniform 3 GFLOPS, essentially bound by the CPU’s clock speed of 3GHz. However, the GPU (a Tesla C870) obtains between 20 GFLOPS for N=2 and 150+GFLOPS for N=7. Even if we SSE optimized the CPU code and used the multiple cores the CPU would still be a long way behind the GPU.

The GPU can actually obtain higher GFLOPS. These calculations are slowed down by the requirement of reading/writing large arrays. Codes/algorithms with this requirement are sometimes referred to as “memory bound”

Haamster Power

The Tesla C870 is a very high end accelerator card from nvidia. We built a PC with off the shelf components to see if it could also achieve these high speed ups. We used two nvidia 9800GX2 graphics cards, each consisting of two 8800GT cards. We used a dual core Intel proc for CPU and most importantly a 1350W power supply. Specific details available on request.

For the 3D time-domain Maxwell’s equations we obtained the following speed up on the DIY GPU-PC, code-named haamster:

[Maxwells 3D 9800GX2 on 1 GPU]

The speed ups are slightly less impressive because the CPU on this PC is faster than those on the Tesla, and there is more data involved [ 6 fields v. 4 fields for acoustics ]. The variation for N>5 is due to variable amounts of loop unrolling, to be discussed in the next section.

gpuDG Kernels

How did the code achieve 150+ GFLOPS ?  Let’s examine the kernel used for the elemental derivatives for Maxwell’s equations (i.e. the curl of the electric and magnetic fields). It’s a little complicated to so we will break it up into pieces. Non-code comments in red:

1) It would be neat if we could load the Dr, Ds, Dt matrices into constant memory or even shared memory and only load them once. Unfortunately they are too big for these resources, so we chose to buffer the solution values and geometric factors:

__global__ void MaxwellsGPU_VOL_Kernel3D(float *g_rhsQ){

// We use shared memory as fast memory to store the fields values at all nodes in

// the current element (block)

__device__ __shared__ float s_Q[p_Nfields*BSIZE];

__device__ __shared__ float s_facs[12];

2) We can use the thread/block coordinates any way we wish to. Here we choose to assign each block to an element, and each thread in the block to a node:

// Find which node and element this thread is in charge of

const int k = blockIdx.x;

3) To buffer the field data we load it into shared memory, using (hopefully) coalesced reads. In this case each thread of a block loads one nodal value for the Hx,Hy,Hz... fields:

// Perform coalesced reads of the field data

int m = n+k*p_Nfields*BSIZE;

int id = n;

s_Q[id] = tex1Dfetch(t_Q, m); m+=BSIZE; id+=BSIZE;

s_Q[id] = tex1Dfetch(t_Q, m); m+=BSIZE; id+=BSIZE;

s_Q[id] = tex1Dfetch(t_Q, m); m+=BSIZE; id+=BSIZE;

s_Q[id] = tex1Dfetch(t_Q, m); m+=BSIZE; id+=BSIZE;

s_Q[id] = tex1Dfetch(t_Q, m); m+=BSIZE; id+=BSIZE;

s_Q[id] = tex1Dfetch(t_Q, m);

// Load in the geometric factors for this element

if(p_N<3 && n==0)

for(m=0;m<12;++m)

s_facs[m] = tex1Dfetch(t_vgeo, 12*k+m);

else if(n<12 && p_N>=3)

s_facs[n] = tex1Dfetch(t_vgeo, 12*k+n);

5) We have to make sure that all the threads get to here before we start using the shared memory variables. Fortunately the __syncthreads call only takes a few cycles:

// Make sure all threads have completed their reads into shared memory

6) I tried lots of tricks to reduce the number of variables, with little success. Sometimes more verbose code is easier for the compiler to optimize:

// Variables for local derivatives of magnetic and electric fields

float dHxdr=0,dHxds=0,dHxdt=0;

float dHydr=0,dHyds=0,dHydt=0;

float dHzdr=0,dHzds=0,dHzdt=0;

float dExdr=0,dExds=0,dExdt=0;

float dEydr=0,dEyds=0,dEydt=0;

float dEzdr=0,dEzds=0,dEzdt=0;

float Q;

7) 50% of the entire GPU time (including all kernels) is spent in this single loop. It is really the inner loop of three nested loops (think level 3 BLAS). We only need this inner loop because the thread/block decomposition is effectively taking care of the outer two loops:

// Perform matrix-vector product to differentiate each component of electric and magnetic fields

for(m=0;p_Np-m;){

// Use texture memory bound version of local derivative matrices

float4 D = tex1Dfetch(t_DrDsDt, n+m*BSIZE);

// Here D.x refers to entry in Dr

id = m;

Q = s_Q[id]; dHxdr += D.x*Q; dHxds += D.y*Q; dHxdt += D.z*Q; id += BSIZE;

Q = s_Q[id]; dHydr += D.x*Q; dHyds += D.y*Q; dHydt += D.z*Q; id += BSIZE;

Q = s_Q[id]; dHzdr += D.x*Q; dHzds += D.y*Q; dHzdt += D.z*Q; id += BSIZE;

Q = s_Q[id]; dExdr += D.x*Q; dExds += D.y*Q; dExdt += D.z*Q; id += BSIZE;

Q = s_Q[id]; dEydr += D.x*Q; dEyds += D.y*Q; dEydt += D.z*Q; id += BSIZE;

Q = s_Q[id]; dEzdr += D.x*Q; dEzds += D.y*Q; dEzdt += D.z*Q;

++m;

// ..... loop unrolling omitted for brevity

}

NOTE: the above implementation of the matrix-matrix multiply is the “naive” approach. We do not suffer too much because we are reusing the premultiplying matrix 6 times.

8) Again, lots of experiments with different variables yielded very little speed up compared to this relatively transparent notation for the geometric factors:

// Force geometric factors into registers

const float drdx= s_facs[0];

const float drdy= s_facs[1];

const float drdz= s_facs[2];

const float dsdx= s_facs[4];

const float dsdy= s_facs[5];

const float dsdz= s_facs[6];

const float dtdx= s_facs[8];

const float dtdy= s_facs[9];

const float dtdz= s_facs[10];

9) Finally we locate the destination for the result in the residual vector:

// Find location of this node in list

m = n+p_Nfields*BSIZE*k;

10) Once we know where the result should go we apply the chain rule to compute the curls of the electric and magnetic fields, and let the compiler sort this into multiply/add operations as best it can:

// Apply chain rule to evaluate curls

g_rhsQ[m] = -(drdy*dEzdr+dsdy*dEzds+dtdy*dEzdt - drdz*dEydr-dsdz*dEyds-dtdz*dEydt); m += BSIZE;

g_rhsQ[m] = -(drdz*dExdr+dsdz*dExds+dtdz*dExdt - drdx*dEzdr-dsdx*dEzds-dtdx*dEzdt); m += BSIZE;

g_rhsQ[m] = -(drdx*dEydr+dsdx*dEyds+dtdx*dEydt - drdy*dExdr-dsdy*dExds-dtdy*dExdt); m += BSIZE;

g_rhsQ[m] =  (drdy*dHzdr+dsdy*dHzds+dtdy*dHzdt - drdz*dHydr-dsdz*dHyds-dtdz*dHydt); m += BSIZE;

g_rhsQ[m] =  (drdz*dHxdr+dsdz*dHxds+dtdz*dHxdt - drdx*dHzdr-dsdx*dHzds-dtdx*dHzdt); m += BSIZE;

g_rhsQ[m] =  (drdx*dHydr+dsdx*dHyds+dtdx*dHydt - drdy*dHxdr-dsdy*dHxds-dtdy*dHxdt);

}

This kernel is the result of a good deal of experimentation, sometimes random, sometimes patiently systematic. It is likely that some simple tweaks or even an entire rewrite could offer up more speed !!

Notes:

1. 1.the code performs K*Nfields*Np loads of common memory data values into shared memory so that they can be efficiently used Nfields*3*Np*Np times per thread.

2. 2.the “small” derivative matrices are bundled together, aligned into 128 bit float4s, and read through the texture cache. Unfortunately the code has to load from this array K*Np*Np*4 times. This is the dominant memory access cost in all the kernels.

3. 3.the reads and writes are coalesced ( I hope )

4. 4.there is only one loop, with the conditional statement cost spread over many algebraic operations

5. 5.not shown: we perform manual unrolling. nvcc does not seem to do a great job of doing this automatically with the #pragma unroll directive.

GPU+MPI

We have demonstrated up to 40x speed ups for a single GPU compared with a single CPU core. This is definitely a good start, but not enough for computing in 3D. The next step is to partition the mesh and use multiple GPUs to compute on the partitions with inter-GPU communication handled by MPI. This is a generic solution for GPUs attached to multiple nodes in a server or cluster. It can also be used on a single PC with multiple GPU cards. One could also use openMP or pthreads or CUDAs own  threading solution for this latter case, but the MPI parallelism model helps avoid issues of making global CUDA variables (e.g. textures) thread safe.

As with the CPU+MPI set up, it also pays to interleave computation and communication by using non-blocked sends and receives. A simple send/recv implementation is given here:

static int Nmess = 0;

void MaxwellsGPU3D_MPI_Send(){

float foo = mpi_procid, bah;

MPI_Status status;

// Transfer outgoing data from GPU to CPU array

void get_partial_gpu_data(int Ntotal, int *g_index, float *h_partQ);

get_partial_gpu_data(parNtotalout, c_parmapOUT, f_outQ);

// Message counter

Nmess = 0;

// now post send & recv requests for each proc

int sk = 0;

for(int p=0;p<mpi_nprocs;++p){

if(p!=mpi_procid){

if(parNout[p]){

// symmetric communications (different ordering)

MPI_Isend(f_outQ+sk, parNout[p], MPI_FLOAT, p, 6666+p,

MPI_COMM_WORLD, mpi_out_requests +Nmess);

MPI_Irecv(f_inQ+sk,  parNout[p], MPI_FLOAT, p, 6666+mpi_procid,

MPI_COMM_WORLD,  mpi_in_requests +Nmess);

sk+=parNout[p];

++Nmess;

}

}

}

}

void MaxwellsGPU3D_MPI_Recv(float *c_partQ){

int p;

MPI_Status *instatus  = (MPI_Status*) calloc(mpi_nprocs, sizeof(MPI_Status));

MPI_Status *outstatus = (MPI_Status*) calloc(mpi_nprocs, sizeof(MPI_Status));

// Wait for all incoming messages to arrive

MPI_Waitall(Nmess, mpi_in_requests, instatus);

// Sort from proc order to correct local order

for(int n=0;n<parNtotalout;++n)

f_inpermQ[n] = f_inQ[parpermIN[n]];

// Copy into CUDA array

cudaMemcpy(c_partQ, f_inpermQ, parNtotalout*sizeof(float),cudaMemcpyHostToDevice);

// Make sure all outgoing messages have left

MPI_Waitall(Nmess, mpi_out_requests, outstatus);

// Tidy up

free(outstatus);

free(instatus);

}

NOTE: unfortunately CUDA engages the CPU in the undesirable practice of “busy waits” while the GPU is computing. I suspect that the CPU is continually polling the GPU to see if it has finished yet. This causes the CPU to reach 100% load and real problems start when there are more GPUs than there are CPUs on the host device. In this case there is substantial slow down.

Parallel Performance: the haamster PC

Of academic interest: how well does the MPI/CUDA scale on the dual 9800GX2 commodity PC ?

Strong scaling results for fixed size problem: Maxwells 3D with N=7, K=15628, 528 time steps, 5 stage RK4:

Comments: we see perfect scaling to two processes/GPUs. We see a dip in scaling at 4 GPUs because of the unfortunate “busy wait” that CUDA engages the CPUs in while the GPU is working.

We did not think that the CPUs would be busy so we only included a dual core Intel CPU in the build specs. This will be upgrade to a four core CPU and hopefully this will resolve the issue.

However, haamster is delivering a respectable performance of approximately 0.45TFLOPS. Not bad for a \$2K self built PC. This is approximately 150x the single core performance (no SSE).

First some perspective, using real life workhorse clusters at my own institution as reference bars.

The Rice terascale cluster went online in 2003 (making it 35 in “cluster years”):

Vivek Sarkar’s group at Rice recently received a new Tesla server from Nvidia:

(1x) HP DL320s w/ Dual Core 2.67GHz Xeon, 4GB memory, 2 x 750GB disks

(8x) HP DL140 G3 w/ Quad Core 2.33 GHz Xeon, 4GB memory, 80GB disk

(1x) HP ProCurve 1400-24G Gigabit Ethernet switch

(4x) NVIDIA Tesla S870

I have been fortunate to have extensive access to this new Tesla server. Strong scaling for a fixed size problem scaling on 1-16 nodes (Maxwells 3D, K=15628, N=7). Note that the 12 and 16 process cases were 2 processes per node:

Just for kicks we can estimate the number of GFLOPS ( ball park figures only here )

This little 8 node, 16 GPU cluster can obtain approximately 1.9TFLOPS.  Although this sounds like a lot of FLOPS it is still quite a bit shy of nvidia’s peak performance estimates.

NOTE: these are preliminary timings and not the final word.

Beyond the Desktop

The above performance numbers sound impressive, but check out this news release from Los Alamos National Lab and their roadrunner computer. They report 1.14 PFLOPS (P for peta). We have some way to go...but notice that they are also using accelerators:

Roadrunner was built using commercially available hardware, including aspects of commercial game console technologies. Roadrunner has a unique hybrid design comprised of nodes containing two AMD OpteronTM dual-core processors plus four PowerXCell 8iTM processors used as computational accelerators. The accelerators are a special IBM-developed variant of the Cell processors used in the Sony PlayStation® 3. Roadrunner uses a Linux operating system. The project’s total cost is approximately \$120 million.”

Question: how high can the CUDA/GPU solution scale ?