Roofline Performance Model
Roofline is a visually intuitive performance model used to bound the performance of various numerical methods and operations running on multicore, manycore, or accelerator processor architectures. Rather than simply using percent-of-peak estimates, the model can be used to assess the quality of attained performance by combining locality, bandwidth, and different parallelization paradigms into a single performance figure. One can examine the resultant Roofline figure in order to determine both the implementation and inherent performance limitations.
The core parameter behind the Roofline model is Arithmetic Intensity. Arithmetic Intensity is the ratio of total floating-point operations to total data movement (bytes). A BLAS-1 vector-vector increment ( x[i]+=y[i] ) would have a very low arithmetic intensity of 0.0417 (N FLOPS / 24N Bytes) and would be independent of the vector size. Conversely, FFT's perform 5*N*logN flops for a N-point double complex transform. If out of place on a write allocate cache architecture, the transform would move at least 48N bytes. As such, FFT's would have an arithmetic intensity of 0.104*logN and would grow slowly with data size. Unfortuantely, cache capacities would limit FFT arithmetic intensity to perhaps 2 flops per byte. Finally, BLAS3 and N-Body Particle-Particle methods would have arithmetic intensity grow very quickly.
The most basic Roofline model can be used to bound Floating-point performance as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity.
One can visualize the Roofline model by plotting the performance bound (GFlop/s) as a function of Arithmetic Intensity. The resultant curve can be viewed as a performance envelope under which kernel or application performance exists.
Effects of NUMA on Memory Bandwidth
Modern SMPs and GPU-accelerated systems will present non-uniform memory access. Depending on the locality of data and the placement of threads, memory bandwidth can vary dramatically. The following example highlights the impact of first-touch allocation of data in OpenMP on bandwidth. In effect, the resultant lower bandwidth can depress performance for virtually any attaininable arithmetic intensity.
Effects of Cache Behavior on Arithmetic Intensity
The Roofline model requires an estimate of total data movement. On cache-based architectures, the 3C's cache model highlights the fact that there can be more than simply compulsory data movement. Cache capacity and conflict misses can increase data movement and reduce arithmetic intensity. Similarly, superfluous cache write-allocations can result in a doubling of data movement. The vector initilization operation x[i]=0.0 demands one write allocate and one write back per cache line touched. The write allocate is superfluous as all elements of that cacheline are to be overwritten. Unfortunately, the presence of hardware stream prefetchers can make it very difficult to quantifty how much beyond compulsory data movment actually occured.
Instruction-Level Parallelism and Performance
Modern processor architectures are deeply pipelined. Although this can increase frequency and peak performance, it does increase the latency of various instructions. In order to avoid bubbles in the pipeline and attain peak performance, the programmer and compiler must collaborate and ensure independent instructions are issued in sequence (instruction-level parallelism). A lack of ILP can depress performance on sufficiently compute-intensive kernels. Conversely, on memory-intensive operations, a lack of ILP may not impede performance. The example below highlights this in a contrived summation example in which partial sums are constructed and summed at the conclusion of the loop.
Data-Level Parallelism and Performance
Data-level parallelism (vectorization, SIMDization, etc...) has become a very attractive approach to maximizing performance and energy efficiency. Unfortunately, attainable performance is highly dependent on the compiler or programmer's ability to exploit these instructions. For high arithmetic intensity, a lack of efficient SIMDization can depress performance. However, for low arithmetic intensities, the impact on performance may be negligible.
Run Time vs Arithmetic Intensity
Rather than viewing Roofline as performance as a function of arithmetic intensity, one can use the model to understand the relationship between run time and arithmetic intensity. To do so, one must use the data set sizes convert performance (per element) into run time for all elements. In the example below, we show that run time is independent of the degree of the polynomial until one reaches the machine balance. Beyond that point, time to solution should increase linearly with the degree of the polynomial. Researchers should view this as an opportunity. One can change the algorithm (e.g., move to a high-order method and attain better error) without affecting run time.
The Roofline Toolkit is a set of tools designed to assist in software development and optimization by providing detailed, accurate information about machine characteristics, the current location of given code/kernel on the roofline graph, and code analysis/characterization informing the software developer about possible opportunities for optimization. The current public release is available via Bitbucket here.
Empirical Roofline Tool (ERT)
The Empirical Roofline Tool, ERT, empirically determines the machine characteristics that are needed to generate the roofline graph and do roofline analysis. These characteristics are the bandwidth of each level in the memory hierarchy and the maximum gflop rate. This tool is needed for several reasons:
- It is time consuming and very difficult, if not impossible, to estimate the machine characteristics needed for roofline analysis.
- Even if the machine characteristics can be estimated, these are theoretical maximums and there may exist no code that can achieve them.
- The theoretical maximums give a developer no guidance as to what code achieves maximum performance, what types of parallelism are needed, which compiler(s) to use, what options to use, and how to run code(s) optimally.
The Roofline Visualizer can be used to view the results generated locally by the Empirical Roofline Tool or a remote Roofline repository.
Please contact Terry Ligocki with any questions about the Roofline Toolkit.
The following represents a core list of Roofline-related publications. They can provide a more in-depth discussion of the theory, application, and nuances associated with using the Roofline model.
Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,
- Download File: PMBS16-KNL.pdf (pdf: 789 KB)
Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699
Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, and Andrea L. Bertozzi, "OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms", 12th International Workshop on OpenMP (iWOMP), October 2016, doi: 10.1007/978-3-319-45550-1_2
Douglas Doerfer, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti, "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016,
- Download File: ixpug16-roofline.pdf (pdf: 575 KB)
Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12
Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7
- Download File: PMBS14-Roofline.pdf (pdf: 340 KB)
H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125
- Download File: ipdps14mfdnfinal.pdf (pdf: 631 KB)
S. Williams, "The Roofline Model", chapter in Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010)
S. Williams, A. Waterman, D. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009, doi: 10.1145/1498765.1498785
Auto-tuning Performance on Multicore Computers, S. Williams, PhD, 2008,
S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,
- Download File: hotchips08-roofline-talk.pdf (pdf: 8 MB)