Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Performance and Algorithms Research

Auto-tuning

Automatic Performance Tuning or “Auto-tuning,” is an empirical, feedback-driven performance optimization technique designed to maximize performance across a wide variety of architectures without sacrificing portability or productivity. Over the years, auto-tuning has expanded from simple loop tiling and unroll-and-jam to encompass transformations to data structures, parallelization, and algorithmic parameters. 

Perhaps the genesis of auto-tuning was PHiPAC (Portable High Performance ANSI C) and eventually ATLAS. When optimizing Matrix-Matrix multiplication, one often cache-blocks the computation to maximize locality in each level of the cache hierarchy. This transforms the reference three-nested loop example into a 6-deep loop nest. Rather than hard coding the tiling, researchers observed they could simply leverage the compute capabilities of modern processors to search over all possibilities to automatically find the fastest configuration for the target machine.

Subsequently, auto-tuning was applied to sparse linear algebra, principally Sparse Matrix-Vector Multiplication (SpMV).  Unlike dense matrices, most elements of a sparse matrix are zero.  As such, most data structures only encode the nonzeros.  Indexing into this data structure can impair in-core performance and increase data movement.  As a result, SpMV performance often falls below its Roofline bound.  To ameliorate this, one can select alternate data structures on a matrix by matrix or cache block by cache block basis.  Alternate data structures can amortize metadata, and increase instruction- and data-level parallelism.  As a result, one can improve performance.  Rather than requiring the user to select and implement a format, the Optimized Sparse Kernel Interface (OSKI) attempted to auto-tune the process.

Research Topics

Broadly speaking, auto-tuning is premised on three fundamental research topics:

  • creating novel application- and architecture-relevant optimizations and their associated parameter space
  • automating the generation of code for this optimization space
  • efficiently searching this optimization space (a mathematical optimization problem).

At LBL, our research has primarily focused on researching and prototyping novel optimizations and automating the code generation process. This has been applied in several projects beginning with an LDRD, followed by our original X-Stack auto-tuning project, and finally SUPER and our current X-Stack project: X-Tune.

Researchers

Publications

Journal Article

2009

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

2008

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

2007

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Conference Paper

2022

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,

2018

Tuowen Zhao, Samuel Williams, Mary Hall, Hans Johansen, "Delivering Performance Portable Stencil Computations on CPUs and GPUs Using Bricks", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

2017

Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,

2015

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

2014

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

2013

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

2011

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

2010

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

2009

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Best Paper Award

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

2008

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

2007

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Book

2011

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

Book Chapter

2013

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

2010

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

Presentation/Talk

2011

S. Williams, et al., Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning, Supercomputing (SC), 2011,

S. Williams, et al., Performance Optimization of HPC Applications on Multi- and Manycore Processors, Workshop on Hybrid Technologies for NASA Applications, 4th Internation Conference on Space Mission Challenges for Information Technology, 2011,

2010

S. Williams, et al., Lattice Boltzmann Hybrid Auto-tuning on High-End Computational Platforms, Workshop on Programming Environments for Emerging Parallel Systems (PEEPS), 2010,

2009

S. Williams, et al., A Generalized Framework for Auto-tuning Stencil Computations, Cray User Group (CUG), 2009,

S. Williams, et al., Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4, Cray User Group (CUG), 2009,

2008

S. Williams, Auto-tuning Performance on Multicore Computers, Ph.D. Thesis Dissertation Talk, University of California at Berkeley, 2008,

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

S. Williams, et al., Autotuning Sparse and Structured Grid Kernels, Parlab Winter Retreat, 2008,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, IEEE International Symposium on Parallel & Distributed Processing (IPDPS)., Pages: 1-14 2008,

2007

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, DOE/DOD Workshop on Emerging High-Performance Architectures and Applications, 2007,

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, Supercomputing (SC), 2007,

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core processors, Center for Scalable Application Development Software (CScADS), 2007,

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core SMPs, Parlab Seminar, 2007,

Thesis/Dissertation

2008

Web Article

2009

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

Poster

2008

S. Williams, et al, "Auto-tuning and the Roofline model", View From the Top: Craig Mundie (Ph.D student poster session), 2008,

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

K. Datta, S. Williams, V. Volkov, M. Murphy, "Autotuning Structured Grid Kernels", ParLab Summer Retreat, 2008,

K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,

Other

2008

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, PERI -- Auto-tuning Memory-intensive Kernels for Multicore, Journal of Physics: Conference Series, Pages: 012038 2008,