Auto-tuning
Automatic Performance Tuning or “Auto-tuning,” is an empirical, feedback-driven performance optimization technique designed to maximize performance across a wide variety of architectures without sacrificing portability or productivity. Over the years, auto-tuning has expanded from simple loop tiling and unroll-and-jam to encompass transformations to data structures, parallelization, and algorithmic parameters.
Perhaps the genesis of auto-tuning was PHiPAC (Portable High Performance ANSI C) and eventually ATLAS. When optimizing Matrix-Matrix multiplication, one often cache-blocks the computation to maximize locality in each level of the cache hierarchy. This transforms the reference three-nested loop example into a 6-deep loop nest. Rather than hard coding the tiling, researchers observed they could simply leverage the compute capabilities of modern processors to search over all possibilities to automatically find the fastest configuration for the target machine.
Subsequently, auto-tuning was applied to sparse linear algebra, principally Sparse Matrix-Vector Multiplication (SpMV). Unlike dense matrices, most elements of a sparse matrix are zero. As such, most data structures only encode the nonzeros. Indexing into this data structure can impair in-core performance and increase data movement. As a result, SpMV performance often falls below its Roofline bound. To ameliorate this, one can select alternate data structures on a matrix by matrix or cache block by cache block basis. Alternate data structures can amortize metadata, and increase instruction- and data-level parallelism. As a result, one can improve performance. Rather than requiring the user to select and implement a format, the Optimized Sparse Kernel Interface (OSKI) attempted to auto-tune the process.
Research Topics
Broadly speaking, auto-tuning is premised on three fundamental research topics:
- creating novel application- and architecture-relevant optimizations and their associated parameter space
- automating the generation of code for this optimization space
- efficiently searching this optimization space (a mathematical optimization problem).
At LBL, our research has primarily focused on researching and prototyping novel optimizations and automating the code generation process. This has been applied in several projects beginning with an LDRD, followed by our original X-Stack auto-tuning project, and finally SUPER and our current X-Stack project: X-Tune.
Researchers
Publications
2022
Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,
- Download File: MCHPC22_final.pdf (pdf: 401 KB)
2018
Tuowen Zhao, Samuel Williams, Mary Hall, Hans Johansen, "Delivering Performance Portable Stencil Computations on CPUs and GPUs Using Bricks", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,
- Download File: p3hpc-bricks-final.pdf (pdf: 1.3 MB)
2017
Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,
- Download File: hips17-snowflake.pdf (pdf: 475 KB)
2015
Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,
- Download File: ipdps15CHiLL.pdf (pdf: 1.8 MB)
2014
Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,
- Download File: wosc14chill.pdf (pdf: 973 KB)
2013
Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,
- Download File: hipc13chill.pdf (pdf: 989 KB)
James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376
2011
S. Williams, et al., Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning, Supercomputing (SC), 2011,
- Download File: sc11-lbmhd-talk.pptx (pptx: 933 KB)
S. Williams, et al., Performance Optimization of HPC Applications on Multi- and Manycore Processors, Workshop on Hybrid Technologies for NASA Applications, 4th Internation Conference on Space Mission Challenges for Information Technology, 2011,
- Download File: smc11-lbnl-talk.pptx (pptx: 3 MB)
David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)
Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458
- Download File: sc11-lbmhd.pdf (pdf: 666 KB)
- Download File: sc11lbmhdtalk.pdf (pdf: 1.4 MB)
2010
S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)
S. Williams, et al., Lattice Boltzmann Hybrid Auto-tuning on High-End Computational Platforms, Workshop on Programming Environments for Emerging Parallel Systems (PEEPS), 2010,
- Download File: peeps10-lbmhd-talk.pdf (pdf: 1.2 MB)
- Download File: peeps10-lbmhd-talk.pptx (pptx: 1.3 MB)
A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415
- Download File: ipdps10-fmm.pdf (pdf: 671 KB)
K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376
Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421
- Download File: ipdps10-ast.pdf (pdf: 789 KB)
S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14
2009
"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,
Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,
- Download File: cug09-autotune.pdf (pdf: 354 KB)
Best Paper Award
S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,
- Download File: cug09-lbmhd.pdf (pdf: 443 KB)
S. Williams, et al., A Generalized Framework for Auto-tuning Stencil Computations, Cray User Group (CUG), 2009,
- Download File: cug09-ast-talk.pdf (pdf: 835 KB)
- Download File: cug09-ast-talk.pptx (pptx: 814 KB)
S. Williams, et al., Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4, Cray User Group (CUG), 2009,
- Download File: cug09-hybridLBMHD-talk.pdf (pdf: 911 KB)
- Download File: cug09-hybridLBMHD-talk.pptx (pptx: 981 KB)
K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,
- Download File: iwapt09-27pt.pdf (pdf: 465 KB)
Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072
- Download File: sc09-cotuning.pdf (pdf: 912 KB)
S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002
- Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)
2008
Auto-tuning Performance on Multicore Computers, Samuel Williams, PhD, 2008,
S. Williams, Auto-tuning Performance on Multicore Computers, Ph.D. Thesis Dissertation Talk, University of California at Berkeley, 2008,
- Download File: SWWilliams-Thesis-Talk.pdf (pdf: 9.8 MB)
- Download File: SWWilliams-Thesis-Talk.ppt (ppt: 5 MB)
S. Williams, et al, "Auto-tuning and the Roofline model", View From the Top: Craig Mundie (Ph.D student poster session), 2008,
- Download File: swwilliams-mundie-poster.pdf (pdf: 866 KB)
S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,
- Download File: hotchips08-roofline-talk.pdf (pdf: 8 MB)
S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,
- Download File: scidac08-peri-talk.pdf (pdf: 9.5 MB)
- Download File: scidac08-peri-talk.ppt (ppt: 5.5 MB)
D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,
- Download File: jpconf8125012038.pdf (pdf: 1.2 MB)
S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,
- Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)
S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038
- Download File: jpconf8125012089.pdf (pdf: 874 KB)
K. Datta, S. Williams, V. Volkov, M. Murphy, "Autotuning Structured Grid Kernels", ParLab Summer Retreat, 2008,
- Download File: parlab08-stencillbmhd-poster.pdf (pdf: 3.6 MB)
K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004
- Download File: sc08-stencil.pdf (pdf: 598 KB)
S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, PERI -- Auto-tuning Memory-intensive Kernels for Multicore, Journal of Physics: Conference Series, Pages: 012038 2008,
S. Williams, et al., Autotuning Sparse and Structured Grid Kernels, Parlab Winter Retreat, 2008,
- Download File: parlab08-spmvstructured-talk.pdf (pdf: 8.1 MB)
- Download File: parlab08-spmvstructured-talk.ppt (ppt: 2.9 MB)
K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,
- Download File: parlab08-structured-poster.pdf (pdf: 1.8 MB)
S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, IEEE International Symposium on Parallel & Distributed Processing (IPDPS)., Pages: 1-14 2008,
- Download File: ipdps08-lbmhd-talk.pdf (pdf: 10 MB)
- Download File: ipdps08-lbmhd-talk.ppt (ppt: 2.6 MB)
S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295
- Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)
2007
S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, DOE/DOD Workshop on Emerging High-Performance Architectures and Applications, 2007,
- Download File: hpa07-spmv-talk.pdf (pdf: 7.9 MB)
- Download File: hpa07-spmv-talk.ppt (ppt: 2.7 MB)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674
- Download File: sc07-spmv.pdf (pdf: 438 KB)
S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, Supercomputing (SC), 2007,
- Download File: sc07-spmv-talk.pdf (pdf: 6.4 MB)
- Download File: sc07-spmv-talk.ppt (ppt: 2.5 MB)
S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core processors, Center for Scalable Application Development Software (CScADS), 2007,
- Download File: cscads07-spmv-talk.pdf (pdf: 1.4 MB)
- Download File: cscads07-spmv-talk.ppt (ppt: 754 KB)
S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core SMPs, Parlab Seminar, 2007,
- Download File: parlab07-spmv-talk.pdf (pdf: 1.2 MB)
- Download File: parlab07-spmv-talk.ppt (ppt: 1.6 MB)
S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674
- Download File: parco08-spmv.pdf (pdf: 1.5 MB)