Performance and Algorithms Research

# miniGMG

miniGMG is a compact benchmark for understanding the performance challenges associated with geometric multigrid solvers found in applications built from AMR MG frameworks like CHOMBO or BoxLib when running on modern multi- and manycore-based supercomputers.  It includes both productive reference examples as well as highly-optimized implementations for CPUs and GPUs.  It is sufficiently general that it has been used to evaluate a broad range of research topics including PGAS programming models and algorithmic tradeoffs inherit in multigrid.  miniGMG was developed under the CACHE Joint Math-CS Institute.

Note, miniGMG code has been supersceded by HPGMG.  HPGMG is a FMG benchmark for evaluating Geometric MG challenges on petascale and exascale systems built from multiprocessors and manycore accelerators.  Nevertheless, miniGMG remains valuable for simplified, small-scale (less than 1000 nodes) geometric multigrid studies.

## Geometric Multigrid and Problem Definition

At a high level, the benchmark solves Au=f where u and f are cell-centered (finite volume) 3D structured grids.  The operator A is a 2nd order finite volume discretization of the helmholtz operator (a*alpha[]*u[] - b* div beta[] grad u[]), where a and b are scalar constants and alpha[] and beta[] are spatially varying coefficients.  The benchmark generates a u_exact[] for a large cubical 3D grid partitioned into subdomains(boxes) which are distributed across the supercomputer.  It then manually differentiates u[] for form f[], then uses a multigrid solver to calculate a u[].  It may then use u_exact[] to test correctness and order.

The basic relaxation operator is Gauss-Seidel Red-Black which is applied twice (red/black/red/black) at every level up and and down the v-cycle.  miniGMG implements a truncated v-cycle (u-cycle)... boxes are restricted until they reach 2^3 cells.  At this point, miniGMG switches to one of a few bottom solvers (BiCGStab is used often used in codes like BoxLib).  Once a sufficiently accurate solution is obtained on this coarse grid (the union of all 2^3 boxes), it is interpolated back up the v-cycle.

## Compilation

Although no make file currently exists, compilation is straightforward.  Please see the file 'compile' for examples on various platforms.

There are a few basic arguments that should be used.  Most are selfexplanatory.

`-D__TEST_MG_CONVERGENCE                 // terminate MG when it attains the desired convergence-D__PRINT_NORM                          // used for verifying the results-D__PRINT_COMMUNICATION_BREAKDOWN       // sepearates out various facets of communication-D__MPI                                 // compiles the distributed (MPI) version-D__COLLABORATIVE_THREADING=6           // controls inter- vs intra-box parallelism-D__USE_BICGSTAB                        // use BiCGStab as a bottom (coarse grid) solver-D__FUSION_RESIDUAL_RESTRICTION         // fuse the residual and restriction into one operation-D__PREFETCH_NEXT_PLANE_FROM_DRAM-D__LINEAR_INTERPOLATION`

Let us consider an example for Edison, the Cray XC30 at NERSC where the MPI compiler uses icc and is invoked as 'cc'.  The following compiles the baseline code.  Any box>=64^3 will be threaded.

cc -O3 -fno-alias -fno-fnalias -xAVX   -openmp miniGMG.c mg.c box.c solver.c operators.ompif.c timer.x86.c -D__MPI -D__COLLABORATIVE_THREADING=6 -D__TEST_MG_CONVERGENCE -D__PRINT_NORM -D__USE_BICGSTAB -o run.miniGMG

One can experiment with faster smoothers by using either of the following.  Both use the avx optimized operators (operators.avx.c), but the latter will also attempt to compare a communication-avoiding smoother to the baseline smoother implementation.

cc -O3 -fno-alias -fno-fnalias -xAVX -openmp miniGMG.c mg.c box.c solver.c operators.avx.c timer.x86.c -D__MPI -D__COLLABORATIVE_THREADING=6 -D__PREFETCH_NEXT_PLANE_FROM_DRAM -D__TEST_MG_CONVERGENCE -D__PRINT_NORM -D__USE_BICGSTAB -o run.miniGMG

cc -O3 -fno-alias -fno-fnalias -xAVX -openmp   bench.c mg.c box.c solver.c operators.avx.c timer.x86.c -D__MPI -D__COLLABORATIVE_THREADING=6 -D__PREFETCH_NEXT_PLANE_FROM_DRAM -D__TEST_MG_CONVERGENCE -D__PRINT_NORM -D__USE_BICGSTAB -o run.miniGMG

There are similar optimizations for BGQ and to explore the OpenMP task model.  Similar optimization or programming model research can be conducted by creating a new subdirectory (based on opearators.ompif) and creating the appropriate includes.

## Running the Benchmark

The benchmark exploits OpenMP and/or MPI for parallelism.  You must thus set OMP_NUM_THREADS correctly.  For a machine like Edison at NERSC, this is simply

`% export OMP_NUM_THREADS=12`

Moreover, on multisocket architectures or when using MPI, you must set affinity correctly.  Consider the typical command line execution...

./run.miniGMG log2BoxSize [BoxesPerProcess_i BoxesPerProcess_j BoxesPerProcess_k] [Processes_i Processes_j Processes_k]

• log2BoxSize is the log base 2 of the dimension of each box on the finnest grid (e.g. 6 is a good proxy for real applications)
• As each process is given a rectahedral collection of boxes, BoxesPerProcess_ denote the number of boxes in i,j,k.  The product of these arguments is the total number of boxes per process.
• Processes_i,j,k are the number of MPI processes that form the overal process grid.  The product is the total number of processes and this must match the total number of processes used by mpirun or the equivalent.
• note, the total domain should be cubical.  Thus, BoxesPerProcess_i*Processes_i == BoxesPerProcess_j*Processes_j == BoxesPerProcess_k*Processes_k

On edison, (the Cray XC30 at nersc), one uses aprun to invoke mpi jobs.  A job script may include the following...

`#PBS -l mppwidth=96export OMP_NUM_THREADS=12aprun -n 8  -d 12  -N 2  -S 1  -ss  -cc numa_node ./run.miniGMG  6  2 4 8   4 2 1`

This will launch 8 mpi processes (-n 8 == 4*2*1) of 12 threads (-d 12 == OMP_NUM_THREADS) with 2 processes per node (-N 2), 1 process per NUMA node (-S 1) with the appropriate NUMA controls (-ss  -cc numa_node).  Moreover, each box is 2^6 on each size (64^3) and there are 64 boxes per node in a 2x4x8 array.

A number of example job scripts have been created and are included.

## Understanding the Results

During execution, the benchmark will output some debug information for understanding convergence and performance.  The following is an example and examines a key subset of this information.  The initial output details how MPI and OpenMP were initialized.  Moreover, it notes how deep the v-cycle is (down to 2^3 boxes).  It then shows the progress as it creates the structured grids noting their respective sizes and the total memory explicitly allocated with malloc().  Thus, the 2K^3 overall problem represents 8 billion degrees of freedom.

`+ aprun -n 512 -d 12 -N 2 -S 1 -ss -cc numa_node ./run.miniGMG.edison 7 2 2 2 8 8 8Requested MPI_THREAD_FUNNELED, got MPI_THREAD_FUNNELED512 MPI Tasks of 12 threadstruncating the v-cycle at 2^3 subdomainscreating domain...       done  128 x 128 x 128 (per subdomain)  256 x 256 x 256 (per process)  2048 x 2048 x 2048 (overall)  1-deep ghost zones  allocated 1865 MB`

As the multigrid solver progresses, the max (inf) norm of the residual is reported after each v-cycle.  One expects to reduce the norm by one digit on each v-cycle.  Thus to attain a norm less than 1e-15, we required 11 v-cycles.

`MGSolve...v-cycle= 1, norm=0.00002091903646017090 (2.091904e-05)v-cycle= 2, norm=0.00000079708396334668 (7.970840e-07)v-cycle= 3, norm=0.00000007951502395414 (7.951502e-08)v-cycle= 4, norm=0.00000000581619537788 (5.816195e-09)v-cycle= 5, norm=0.00000000048970464287 (4.897046e-10)v-cycle= 6, norm=0.00000000003900568126 (3.900568e-11)v-cycle= 7, norm=0.00000000000318039461 (3.180395e-12)v-cycle= 8, norm=0.00000000000025703104 (2.570310e-13)v-cycle= 9, norm=0.00000000000002088201 (2.088201e-14)v-cycle=10, norm=0.00000000000000170463 (1.704634e-15)v-cycle=11, norm=0.00000000000000014284 (1.428395e-16)done`

Finally, we see a timing report.  Vertically are the key operations within the v-cycle (communication is further broken down into its constituient operations).  Horizontally is a breakdown of time (in seconds) by level in the v-cycle.  Thus, one can see the difference in time spent in each operation at each level.  These times are totaled by level and by function.  Finally, the total time required to build the solver (note geometric multigrid solves can be built extremely quickly), the time spent in the solver, the number of v-cycles, and the total number of bottom solver (e.g. BiCGStab) iterations summed across all v-cycles is reported.

`                                  0            1            2            3            4            5            6                              128^3         64^3         32^3         16^3          8^3          4^3          2^3        totalsmooth                     2.244879     0.288221     0.020186     0.003279     0.000672     0.000267     0.000000     2.557504residual                   0.569046     0.035340     0.001833     0.000328     0.000077     0.000036     0.000030     0.606691restriction                0.041538     0.003994     0.000310     0.000072     0.000032     0.000028     0.000000     0.045975interpolation              0.076533     0.006586     0.000567     0.000105     0.000038     0.000032     0.000000     0.083860applyOp                    0.000000     0.000000     0.000000     0.000000     0.000000     0.000000     0.001715     0.001715BLAS1                      0.157396     0.004949     0.000776     0.000184     0.000055     0.000027     0.014614     0.178002BLAS3                      0.000000     0.000000     0.000000     0.000000     0.000000     0.000000     0.000000     0.000000communication              0.314615     0.069810     0.024858     0.017584     0.009740     0.005763     0.318338     0.760707  local exchange           0.047781     0.008262     0.001819     0.000730     0.000368     0.000233     0.001743     0.060936  pack MPI buffers         0.047688     0.007722     0.001089     0.000569     0.000294     0.000215     0.001630     0.059207  unpack MPI buffers       0.022835     0.004058     0.001226     0.000530     0.000349     0.000231     0.001712     0.030940  MPI_Isend                0.002422     0.002161     0.000856     0.000659     0.000779     0.000374     0.002755     0.010005  MPI_Irecv                0.000456     0.000402     0.000152     0.000205     0.000119     0.000079     0.000677     0.002089  MPI_Waitall              0.169658     0.047091     0.019666     0.014850     0.007801     0.004603     0.022721     0.286390  MPI_collectives          0.023637     0.000000     0.000000     0.000000     0.000000     0.000000     0.286850     0.310487--------------         ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------Total by level             3.386964     0.404774     0.047960     0.021436     0.010595     0.006159     0.334945     4.212834  Total time in MGBuild       0.081082  Total time in MGSolve       4.235200              " v-cycles      4.213100      number of v-cycles            11Bottom solver iterations           397`

We thus observe that this 8 billion DOF problem was solved in 4.23 seconds on 512 processes (6144 cores).  It required 397 BiCGStab iterations on the coarse grid spread over 11 vcycles (approx 36 terations per v-cycle).  As the time spent smoothing the fine grid was non-trivial (2.244 seconds), one might be motivated to analyze it.  Each box has (128+2)^3 cells including ghost zones.  There are 8 boxes (2 2 2) per process.  Each call to smooth moves 64 bytes of data per cell per stencil sweep.  There are 4 calls to smooth and 2 stencil sweeps per smooth in the v-cycle.  There are 11 v-cycles.  Thus, smooth requires one move at least 8 * 130^3 * 64 * 8 * 11 bytes of data = 98.99e9 bytes.  Moving this data in 2.244 seconds suggests each process attained an average DRAM bandwidth of 44GB/s.  This is quite good given this was vanilla OpenMP code without optimization and one could never hope for better than 54GB/s on this machine.

Variants and Features

 CPUs GPUs ompif omptask sse/avx qpx NVIDIA Parallelism Distributed Memory Parallelism MPI MPI MPI MPI MPI Threading OpenMP OpenMP OpenMP OpenMP CUDA for if() collapse(2) tasks parallel regions parallel regions Smoothers Gauss-Seidel, Red-Black (GSRB) comm. avoiding comm. avoiding comm. avoiding, wavefront, AVX comm. avoiding, wavefront, QPX yes, auto-tuned Weighted Jacobi comm. avoiding - - - - Chebyshev comm. avoiding - - - - Residual fusion with restriction yes - - - Interpolation pointwise-constant yes - - - yes pointwise-linear yes - - - - Bottom Solvers multiple smooths yes yes yes yes yes BiCGStab yes yes yes yes yes CG yes yes yes yes - CABiCGStab yes yes yes yes - CACG yes yes yes yes - Structure of MG Solver V-Cycles yes yes yes yes yes F-Cycles (FMG) yes yes yes yes - Boundary Conditions periodic yes yes yes yes yes Convergence Criteria fixed number of v-cycles yes yes yes yes yes norm of the residual yes yes yes yes -

Note, communication-avoiding aggregates MPI, but relies on HW/compiler to realize locality.  Wavefront variants make locality explicit.  If an optimized OpenMP variant is lacking, the code will default to the baseline (ompif) version

## CUDA

We created a MPI+CUDA port of miniGMG to explore GPU performance and porting challenges.  The lessons learned from this exercise facilitated the HPGMG port to CUDA.

Compilation is relatively simple. The 'compile' file includes instructions on how I compiled on Dirac(NERSC) and Titan(OLCF) under different modes.  The execution is the same as the cpu version of miniGMG (but is different than HPGMG-FV).  In miniGMG, you specify 4 args (single proc) or 7 args (MPI+x).  see jobs.titan.* for aprun examples.

e.g.
aprun -n 8  -N 1 ./run 6  4 4 4  2 2 2
Will run 8 processes (1 per node) in a 2x2x2 process grid.
Each process will allocate a 4x4x4 grid of boxes.
Each box is (2^6)^3 = 64^3 cells on the finest-level.
The MG solver will restrict each of these to 4^3 or 2^3 cells at which point it switches to BiCGStab.