CUDA
We created a CUDA port of miniGMG to explore GPU performance. Unfortuantely, it hasn't been maintained and it isn't pretty. At a very high-level, miniGMG is similar to HPGMG. However, the data decomposition in HPGMG is more sophisticated and the ability to build a true distributed V-Cycle makes it more scalable. This CUDA version of miniGMG is presented in the hope that it will provide insights as to how to optimize port and optimize HPGMG for GPUs. please contact Sam Williams for access to the code.
Compilation is relatively simple. The 'compile' file includes instructions on how I compiled on Dirac(NERSC) and Titan(OLCF) under different modes.
There are two versions of the key kernels: a straightforward 2D blocked implementation (the 2D tiling of the ij iteration space we've been talking about) and a version that flattens the ij iteration space and strip mines it. Both use shared memory and a number of other optimizations.
The execution is the same as the cpu version of miniGMG (but is different than HPGMG-FV).
In miniGMG, you specify 4 args (single proc) or 7 args (MPI+x). see jobs.titan.* for aprun examples.
e.g.
aprun -n 8 -N 1 ./run 6 4 4 4 2 2 2
Will run 8 processes (1 per node) in a 2x2x2 process grid.
Each process will allocate a 4x4x4 grid of boxes.
Each box is (2^6)^3 = 64^3 cells on the finest-level.
The MG solver will restrict each of these to 4^3 or 2^3 cells at which point it switches to BiCGStab.