HPGMG Submission Guidelines
Those wishing to submit HPGMG-FV results should email them to Samuel Williams. The current submission guidelines are as follows:
- Submissions must run the variable-coefficient fourth-order, "Laplacian" using the out-of-place GSRB smoother (3 pre-smooths and 3 post-smooths) with fourth-order homogeneous Dirichlet boundary conditions using the (F-Cycle) FMG solver (i.e. the defaults). Moreover, second-order, 1:2 volume-averaged prolongation (27pt) is required within the v-cycles and fourth-order, 1:2 volume-averaged prolongation (125pt) is required between v-cycles. Restriction within the V-Cycle is 2:1 piecewise constant.
- Submissions must benchmark 3 problem sizes of grid spacings h, 2h, and 4h (e.g. 256^3 per socket, 128^3 per socket, 64^3 per socket) performing at least 10 solves and running for at least 60s for each. A single invocation of either the CPU or GPU reference implementations will benchmark these three problem sizes, summarize performance, and validate the error properties.
- Submissions may use either the provided BiCGStab Krylov solver or the provided iterative smoother-based solver (i.e. omit -DUSE_BICGSTAB) as a coarse grid solver. Submissions may not otherwise change the mathematical properties like reducing the number of smooths, changing the discretization, using low order boundary conditions, or calculating the norm of the right hand side (f) once per executable instead of once per solve.
- Submissions may use any problem size that takes the form take the form (C * 2^k)³ for any odd C≤11 and k≥4. Any implementation of the v-cycle must eliminate the k factors-of-two from the problem dimension.
- Submissions may use any out-of-place implementation of GSRB so long as it does not transform the semantics of the smoother (red and black are defined based on global indices not local indices).
- HPGMG-FV may be ported to any other architecture or any programming model Ports of HPGMG-FV to other programming models and architectures must conform to the mathematical properties of the MPI+OpenMP reference implementation and should produce identical results in the absence of BiCGStab or rounding differences between discrete multiply/add instructions and a fused multiply-add instruction. That being said, constructs like messages, cubical boxes, and blocks(tiling) are purely artifacts of the MPI+OpenMP reference implementation. Implementations in other programming models may explore alternates to messages, boxes, and blocks(tiling) so long as they preserve the mathematics of the program.
- Submissions must include the full text output of the HPGMG-FV benchmark (from the 'HPGMG-FV Benchmark' to 'Done' including the residuals and error analysis) and are subject to an internal (non-public) code-review.
- v0.3 of the reference MPI+OpenMP implementation conforms to these requirements. If using the reference MPI+OpenMP implementation, please note...
- Submissions may use any combination of threads and processes.
- Submissions may use any data decomposition (e.g. box size and boxes per process) and thus, aprun -n# ./hpgmg-fv 8 16 is perfectly acceptable.
- Submissions may vary the -DBLOCKCOPY_TILE_* tiling options and -DBOX_ALIGN_* padding options of the reference implementation to find whichever work best for the target machine. For flat MPI, the former enables cache blocking(tiling). The default 8x8x10000 works well on most CPUs. Users may disable tiling by specifying -DBLOCKCOPY_TILE_{I,J,K}=10000.
- Submissions may use either of the two alternate GSRB implementations included in the reference implementation. They are enabled with -DGSRB_FP and -DGSRB_BRANCH respectively.
- The previous XL/C compiler issues on BGQ systems have been resolved. If your compiler version cannot be updated, you may modify norm() as discussed on the hpgmg mailing list.
- The reference implementation allows for coarse grid problem dimensions of up to 11³ ( C≤11). This ensures coarse grid solves which occur on a single process is not overly computationally complex. Submissions may decrease the maximum coarse grid size by compiling with -DMAX_COARSE_DIM=7 or any moderately small odd integer.
- Given a total number of processes and a target number of boxes per process, the driver calculates a valid problem size and total number of boxes. Given the MAX_COARSE_DIM constraint, this can be less than the product of processes and maximum boxes per process. The load balancer then attempts to partition the boxes among process. It can be impaired when the maximum number of boxes per process is small or 1. Although large boxes are usually more efficient for on-node computation, having more boxes can facilitate load balancing. That is, although one 128³ box and eight 64³ boxes per process require the same local work, the latter facilitates load balancing as the partitioner can assign 8 boxes to most processes and 7 to a subset in order to efficiently partition a large non-power-of-two problem. Researchers should experiment with this when conducting weak scaling experiments in order to determine which maximizes performance (DOF/s).
- When selecting a box size and the target number of boxes per process, that the memory requirements of a b³ box (including all levels and all meta data) is perhaps 200*b³ bytes. Thus, machines with 32GB per processor (far less useable) should use fewer than 64 x 128³ boxes per processor.
- When selecting the optimal number of boxes per process on machines with non power of two processors, one should balance full utilization against irregular decompositions. For example, on Mira with 48K nodes, one could run 36³ processes (46656 nodes) each with 8 x 128³ boxes. This will result in a global problem of 9216³ = (9*1024)³ cells, 8:1 agglomeration in the v-cycle and a coarse grid of 9³ cells. Conversely, on could choose to run with 48K processes each with 18 boxes. This would produce a global problem of 12288³ = (3*4096)³ cells with a coarse grid of 3³ cells. The former has a shallower v-cycle but a more challenging bottom solve.