Empirical Roofline Tool (ERT)
The Empirical Roofline Tool (ERT) empirically determines the machine characteristics (CPU or GPU-accelerated) that are needed to generate the machine characteristics for the roofline model. These characteristics are the bandwidth of each level in the memory hierarchy and the maximum gflop rate. This tool is needed for several reasons:
- It is time-consuming and very difficult, if not impossible, to estimate the machine characteristics needed for roofline analysis.
- Even if the machine characteristics can be estimated, these are theoretical maximums and there may exist no code that can achieve them.
- The theoretical maximums give a developer no guidance as to what code achieves maximum performance, what types of parallelism are needed, which compiler(s) to use, what options to use, and how to run code(s) optimally.
Below is an example from running ERT on NERSC's Cori (single 68-core Knights Landing manycore processor). Note, ERT always labels the first level of memory as L1 and the last level as DRAM. As such, in quadflat with numactl -m1, ERT labels the MCDRAM memory as 'DRAM'. One should remember that KNL MCDRAM bandwidth is dependent on array size, clustering mode (quadrant, SNC2, SNC4), and whether the MCDRAM is a cache or memory. Larger problem sizes have been observed to attain higher bandwidth (over 450GB/s) while -qopt-streaming-stores=never has been observed to be beneficial in the quadcache mode.
Similarly, we can compile ERT for CUDA and run it on ORNL's SummitDev (4 x NVIDIA P100 GPUs). As ERT uses a read-modify-write kernel, it is nominally cached at the GPU L2 (L1 is invisible). Nevertheless, ERT labels the first cache it sees (GPU L2) as an L1. Moreover, as with KNL, it labels the last level of memory it detects (HBM on P100) as DRAM. As such, with 4 GPUs, each SummitDev node has nearly 2TB/s of aggregate sustained HBM bandwidth.
Please contact Charlene Yang with any questions about the Roofline Toolkit.