MADbench2
MADbench2 is a tool for testing the integrated performance of the I/O, communication and calculation subsystems of massively parallel architectures under the stresses of a real scientific application.
MADbench2 is based on the MADspec code, which calculates the maximum likelihood angular power spectrum of the Cosmic Microwave Background radiation from a noisy pixelized map of the sky and its pixelpixel noise correlation matrix.
MADbench2 retains the full computational complexity of its parent scientific application code, but uses selfgenerated pseudodata to allow the myriad computationally irrelevant details associated with handling real CMB datasets to be bypassed.
MADbench2 can be run in two modes:
 regular mode, in which the full code is run.
 IO mode, in which all calculation/communication is replaced with busywork.
In addition, MADbench2 can be run as single or multigang; in the former all the matrix operations are carried out distributed over all of the processors, whereas in the latter the matrices are built, summed and inverted over all the processors (S & D), but then redistributed over subsets of processors (gangs) for their subsequent manipulations (W & C). This gangparallelism allows the data to be dense on the processors for the dominant matrixmatrix multiplication (W) phase even with very large numbers of processors.
Compiling MADbench2
To run in regular mode, MADbench2 needs to be linked to the ScaLAPACK & LAPACK libraries and their dependencies (BLAS, PBLAS, BLACS). The MADbench2.h file contains systemspecific definitions and declarations; this file should be augmented as needed and the code compiled with D SYSTEM.
To run in IO mode, MADbench2 should be compiled with D IO (in addition to D SYSTEM) whereupon all of the library calls are redefined to busywork so that none of the libraries are needed.
Running MADbench2
MADbench2 is run as
where
NO_PIX 
Sets the size of the pseudodata  all the component matrices have NO_PIX x NO_PIX elements 
NO_BIN 
Sets the size of the pseudodataset  there are NO_BIN component matrices 
NO_GANG 
Sets the level of gangparallelism  there are NO_GANG gangs 
SBLOCKSIZE 
Sets the ScaLAPACK blocksize  all matrices will be blockcycically distributed with side SBLOCKSIZE. 
FBLOCKSIZE 
Sets the file blocksize  all IO will start at a fileoffset that is an integer multiple of FBLOCKSIZE. 
RMOD 
Sets the degree of simultaneous reading  1:RMOD processors will read at once. 
WMOD 
Sets the degree of simultaneous writing  1:WMOD processors will write at once. 
Running MADbench2 requires:
 a square number of processors
 a uniform square number of processors per gang
 a uniform number of bins per gang
 a scalapack blocksize that distributes some data to every processor
 a file blocksize that is a whole number of doubles
 a number of gangs that is exactly divisible by the readmodulus and the writemodulus
each of which is checked on initialization.
In addition, MADbench2 requires 5 x NO_PIX2 x 8 bytes of memory per gang.
Environment Variables
Variable 
Allowed Values 
Default 
IOMETHOD 
POSIX, MPI 
POSIX 
IOMODE 
SYNC, ASYNC 
SYNC 
FILETYPE 
UNIQUE, SHARED 
UNIQUE 
REMAP 
CUSTOM, SCALAPACK 
CUSTOM 
BWEXP 
Any number 
None 
NOTES
(i) The remap options are only used in multigang mode; the custom remap is provided for cases where the ScaLAPACK remap function (pdgemr2d) performs poorly or fails.
(ii) The busywork exponent, BWEXP, sets the amount of busywork done as a function of the IO data size. For N data elements written/read the busywork function will perform N^{BWEXP} floatingpoint operations.
If the IO data are thought of as matrices, as in the parent application, then
BWEXP = 1.0 corresponds to level 2 BLAS
BWEXP = 1.5 corresponds to level 3 BLAS
If the IO data are thought of as vectors then
BWEXP = 1.0 corresponds to level 1 BLAS
BWEXP = 2.0 corresponds to level 2 BLAS
Component Functions
Name 
Purpose 
Calculations 
Communication 
Input/Output 
S  Build the signal correlation matrix as the weighted sum of the signal derivative matrices with respect to the binpowers: S = sum C_{b} dSdC_{b} 
Legendre polynomial recursion. Weighted summation. 
None.  NO_BIN writes each of O(NO_PIX^{2}) bytes on NO_PE processors. 
D  Build the data correlation matrix from the signal and (pseudo)noise matrices and invert it D^{1} = (S + N)^{1} 
1 Cholesky decomposition (pdpotrf) & matrix inversion (pdpotri) on NO_PE processors.  ScaLAPACK BLACS calls.  None. 
W 
Calculate the matrix product for each bin W_{b} = D^{1} dSdC_{b} 
NO_BIN general matrixmatrix multiplications (pdgemm) each on NO_PE/NO_GANG processors.  If NO_GANG>1 then NO_BIN+1 alltogang matrix remappings. ScaLAPACK BLACS calls. 
NO_BIN reads each of O(NO_PIX^{2}) bytes on NO_PE processors. NO_BIN writes each of O(NO_PIX^{2}) bytes on NO_PE/NO_GANG processors. 
C 
Calculate the first two derivatives of the likelihood function of the (pseudo)data d dLdC_{b} = d^{T} W_{b} D^{1} d  Tr W_{b} d^{2}LdC_{b}dC_{b'} = Tr [ W_{b} W_{b'} ] and the quadratic binpower correction dCb =  d^{2}LdC_{b}dC_{b'}^{1} dLdC_{b} 
1 symmetric matrixvector multiplications (pdsymv) over NO_PE/NO_GANG processors. NO_BIN general matrixvector multiplications (pdgemv) each on NO_PE/NO_GANG processors. NO_BIN matrix transpositions (pdtran) each on NO_PE/NO_GANG processors. 1 symmetric triangular solve (dpotrs) on 1 processor. 
If NO_GANG>1 then O(NO_BIN^{2}) intergang matrix transfers. ScaLAPACK BLACS calls. 
O(NO_BIN^{2}/NO_GANG) reads each of O(NO_PIX^{2}) bytes on NO_PE/NO_GANG processors 
Note that in IO mode:
(i) the component functions replace their calculation and communication with busywork.
(ii) the D function is skipped entirely.
(iii) the C function performs only NO_BIN/NO_GANG reads each of O(NO_PIX^{2}) bytes on NO_PE/NO_GANG processors.
Error checking
All mallocs and IO calls are explicitly checked for success and MADbench2 aborts if any one fails.
In case of failure, the processor ID and attempted action are reported before exiting.
Output
MADbench2 reports the mean, minimum and maximum times spent in calculation/communication, busywork, reading and writing in each function.
In addition, the first element of the MADspec solution vector is reported to check that the code performed correctly. In full mode, NO_PIX = 5000 & NO_BIN = 4 should return dC[0] = 9.22431e01; IO mode always returns dC[0] = 0.00000.
Papers
Downloads

MADbench2.tar MADbench2 code tarball