Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Computational Cosmology Center

MADbench2

MADbench2 is a tool for testing the integrated performance of the I/O, communication and calculation subsystems of massively parallel architectures under the stresses of a real scientific application.

MADbench2 is based on the MADspec code, which calculates the maximum likelihood angular power spectrum of the Cosmic Microwave Background radiation from a noisy pixelized map of the sky and its pixel-pixel noise correlation matrix.

MADbench2 retains the full computational complexity of its parent scientific application code, but uses self-generated pseudo-data to allow the myriad computationally irrelevant details associated with handling real CMB datasets to be by-passed.

MADbench2 can be run in two modes:

  1. regular mode, in which the full code is run.
  2. IO mode, in which all calculation/communication is replaced with busy-work.

In addition, MADbench2 can be run as single- or multi-gang; in the former all the matrix operations are carried out distributed over all of the processors, whereas in the latter the matrices are built, summed and inverted over all the processors (S & D), but then redistributed over subsets of processors (gangs) for their subsequent manipulations (W & C). This gang-parallelism allows the data to be dense on the processors for the dominant matrix-matrix multiplication (W) phase even with very large numbers of processors.

Compiling MADbench2

To run in regular mode, MADbench2 needs to be linked to the ScaLAPACK & LAPACK libraries and their dependencies (BLAS, PBLAS, BLACS). The MADbench2.h file contains system-specific definitions and declarations; this file should be augmented as needed and the code compiled with -D SYSTEM.

To run in IO mode, MADbench2 should be compiled with -D IO (in addition to -D SYSTEM) whereupon all of the library calls are redefined to busy-work so that none of the libraries are needed.  

Running MADbench2

MADbench2 is run as

> MADbench2.x   $NO_PIX   $NO_BIN   $NO_GANG   $SBLOCKSIZE   $FBLOCKSIZE   $RMOD   $WMOD


where

NO_PIX
Sets the size of the pseudo-data - all the component matrices have NO_PIX x NO_PIX elements
NO_BIN
Sets the size of the pseudo-dataset - there are NO_BIN component matrices
NO_GANG
Sets the level of gang-parallelism - there are NO_GANG gangs
SBLOCKSIZE
Sets the ScaLAPACK blocksize - all matrices will be block-cycically distributed with side SBLOCKSIZE.
FBLOCKSIZE
Sets the file blocksize - all IO will start at a file-offset that is an integer multiple of FBLOCKSIZE.
RMOD
Sets the degree of simultaneous reading - 1:RMOD processors will read at once.
WMOD
Sets the degree of simultaneous writing - 1:WMOD processors will write at once.

Running MADbench2 requires:

  • a square number of processors
  • a uniform square number of processors per gang
  • a uniform number of bins per gang
  • a scalapack blocksize that distributes some data to every processor
  • a file blocksize that is a whole number of doubles
  • a number of gangs that is exactly divisible by the read-modulus and the write-modulus

each of which is checked on initialization.

In addition, MADbench2 requires 5 x NO_PIX2 x 8 bytes of memory per gang.

Environment Variables

Variable

Allowed Values

Default

IOMETHOD
POSIX, MPI
POSIX
IOMODE
SYNC, ASYNC
SYNC
FILETYPE
UNIQUE, SHARED
UNIQUE
REMAP
CUSTOM, SCALAPACK
CUSTOM
BWEXP
Any number
None

NOTES

(i) The remap options are only used in multi-gang mode; the custom remap is provided for cases where the ScaLAPACK remap function (pdgemr2d) performs poorly or fails.

(ii) The busy-work exponent, BWEXP, sets the amount of busy-work done as a function of the IO data size.  For N data elements written/read the busy-work function will perform NBWEXP floating-point operations.

If the IO data are thought of as matrices, as in the parent application, then
   BWEXP = 1.0 corresponds to level 2 BLAS
   BWEXP = 1.5 corresponds to level 3 BLAS

If the IO data are thought of as vectors then
   BWEXP = 1.0 corresponds to level 1 BLAS
   BWEXP = 2.0 corresponds to level 2 BLAS

Component Functions

Name

Purpose

Calculations

Communication

Input/Output

S Build the signal correlation matrix as the weighted sum of the signal derivative matrices with respect to the bin-powers:

S = sum Cb dSdCb
Legendre polynomial recursion.
Weighted summation.
None. NO_BIN writes each of
O(NO_PIX2) bytes on NO_PE processors.
D Build the data correlation matrix from the signal and (pseudo)noise matrices and invert it

D-1 = (S + N)-1
1 Cholesky decomposition (pdpotrf) & matrix inversion (pdpotri) on NO_PE processors. ScaLAPACK BLACS calls. None.
W
Calculate the matrix product for each bin

Wb = D-1 dSdCb
NO_BIN general matrix-matrix multiplications (pdgemm) each on NO_PE/NO_GANG processors. If NO_GANG>1 then NO_BIN+1 all-to-gang matrix remappings.

ScaLAPACK BLACS calls.
NO_BIN reads each of
O(NO_PIX2) bytes on NO_PE processors.

NO_BIN writes each of
O(NO_PIX2) bytes on NO_PE/NO_GANG processors.
C
Calculate the first two derivatives of the likelihood function of the (pseudo)data d

dLdCb = dT Wb D-1 d - Tr Wb
d2LdCbdCb' = Tr [ Wb Wb' ]

and the quadratic bin-power correction

dCb = - d2LdCbdCb'-1 dLdCb
1 symmetric matrix-vector multiplications (pdsymv) over NO_PE/NO_GANG processors.

NO_BIN general matrix-vector multiplications (pdgemv) each on NO_PE/NO_GANG processors.

NO_BIN matrix transpositions (pdtran) each on NO_PE/NO_GANG processors.

1 symmetric triangular solve (dpotrs) on 1 processor.
If NO_GANG>1 then O(NO_BIN2) inter-gang matrix transfers.

ScaLAPACK BLACS calls.
O(NO_BIN2/NO_GANG) reads each of O(NO_PIX2) bytes on NO_PE/NO_GANG processors

Note that in IO mode:
      (i) the component functions replace their calculation and communication with busy-work.
     (ii) the D function is skipped entirely.
    (iii) the C function performs only NO_BIN/NO_GANG reads each of O(NO_PIX2) bytes on NO_PE/NO_GANG processors.

Error checking

All mallocs and IO calls are explicitly checked for success and MADbench2 aborts if any one fails.
In case of failure, the processor ID and attempted action are reported before exiting.

Output

MADbench2 reports the mean, minimum and maximum times spent in calculation/communication, busy-work, reading and writing in each function.

In addition, the first element of the MADspec solution vector is reported to check that the code performed correctly. In full mode, NO_PIX = 5000 & NO_BIN = 4 should return dC[0] = -9.22431e-01; IO mode always returns dC[0] = 0.00000.

Papers

Downloads