Careers | Phone Book | A - Z Index

Auto-tuning

Automatic Performance Tuning or "Auto-tuning", is an empirical, feedback-driven performance optimization technique designed to maximize performance across a wide variety of architectures without sacrificing portability or productivity. Over the years, auto-tuning has expanded from simple loop tiling and unroll-and-jam to encompass transformations to data structures, parallelization, and algorithmic parameters. 

Perhaps the genesis of auto-tuning was PHiPAC (Portable High Performance ANSI C) and eventually ATLAS. When optimizing Matrix-Matrix multiplication, one often cache-blocks the computation to maximize locality in each level of the cache hierarchy. This transforms the reference three-nested loop example into a 6-deep loop nest. Rather than hard coding the tiling, researchers observed they could simply leverage the compute capabilities of modern processors to search over all possibilities to automatically find the fastest configuration for the target machine.

Subsequently, auto-tuning was applied to sparse linear algebra, principally Sparse Matrix-Vector Multiplication (SpMV).  Unlike dense matrices, most elements of a sparse matrix are zero.  As such, most data structures only encode the nonzeros.  Indexing into this data structure can impair in-core performance and increase data movement.  As a result, SpMV performance often falls below its Roofline bound.  To ameliorate this, one can select alternate data structures on a matrix by matrix or cache block by cache block basis.  Alternate data structures can amortize meta data, and increase instruction- and data-level parallelism.  As a result one can improve performance.  Rather than requiring the user to select and implement a format, the Optimized Sparse Kernel Interface (OSKI) attempted to auto-tune the process.

Research Topics

Broadly speaking, auto-tuning is premised on three fundamental research topics: 

  • creating novel application- and architecture-relevant optimizations and their associated parameter space
  • automating the generation of code for this optimization space
  • efficiently searching this optimization space (a mathematical optimization problem).

At LBL, our research has primarily focused on researching and prototyping novel optimizations and automating the code generation process. This has been applied in several projects begining with an LDRD, followed by our original X-Stack auto-tuning project, and finally SUPER and our current X-Stack project: X-Tune.

Researchers

Publications

Perhaps the genesis of auto-tuning was PHiPAC (Portable High Performance ANSI C) and eventually ATLAS.  When optimizing Matrix-Matrix multiplication, one often cache-blocks the computation to maximize locality in each level of the cache hierarchy.  This transforms the reference three-nested loop example into a 6-deep loop nest.  Rather than hard coding the tiling, researchers observed they could simply leverage the compute capabilities of modern processors to search over all possibilities to automatically find the fastest configuration for the target machine.
Broadly speaking, auto-tuning is premised on three fundamental research topics: 
creating novel application- and architecture-relevant optimizations and their associated parameter space
automating the generation of code for this optimization space
efficiently searching this optimization space.
At LBL, our research has primarily focused on researching and prototyping novel optimizations and automating the code generation process.  This has been applied in several projects including X-Tune, our X-Stack 1 project, and an LDRD.

 

 our X-Stack 1 project, and an LDRD.

 

Automatic Performance Tuning or "Auto-tuning", is an empirical, feedback-driven performance optimization technique designed to maximize performance across a wide variety of architectures without sacrificing portability or productivity.    Over the years, auto-tuning has expanded from simple loop tiling and unroll-and-jam to encompass transformations to data structures, parallelization, and algorithmic parameters. 


Perhaps the genesis of auto-tuning was PHiPAC (Portable High Performance ANSI C) and eventually ATLAS.  When optimizing Matrix-Matrix multiplication, one often cache-blocks the computation to maximize locality in each level of the cache hierarchy.  This transforms the reference three-nested loop example into a 6-deep loop nest.  Rather than hard coding the tiling, researchers observed they could simply leverage the compute capabilities of modern processors to search over all possibilities to automatically find the fastest configuration for the target machine.


Broadly speaking, auto-tuning is premised on three fundamental research topics: 

  • creating novel application- and architecture-relevant optimizations and their associated parameter space
  • automating the generation of code for this optimization space
  • efficiently searching this optimization space.

 

At LBL, our research has primarily focused on researching and prototyping novel optimizations and automating the code generation process.  This has been applied in several projects including X-Tune, our X-Stack 1 project, and an LDRD.

Researchers

 

Publications

2015

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

2014

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

2013

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

2011

S. Williams, et al., Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning, Supercomputing (SC), 2011,

S. Williams, et al., Performance Optimization of HPC Applications on Multi- and Manycore Processors, Workshop on Hybrid Technologies for NASA Applications, 4th Internation Conference on Space Mission Challenges for Information Technology, 2011,

A. Buluç, S. Williams, L. Oliker, J. Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", International Parallel Distributed Processing Symposium (IPDPS), May 2011, doi: 10.1109/IPDPS.2011.73

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

2010

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-tuning Stencil Computations on Multicore and Accelerators", Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, ( 2010)

S. Williams, et al., Lattice Boltzmann Hybrid Auto-tuning on High-End Computational Platforms, Worshop on Programming Environments for Emerging Parallel Systems (PEEPS), 2010,

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

Samuel Williams, Kaushik Datta, Leonid Oliker, Jonathan Carter, John Shalf, Katherine Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Performance Tuning of Scientific Applications, edited by D. H. Bailey, R. F. Lucas, S. W. Williams, (CRC Press: 2010) Pages: 219

2009

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Best Paper Award

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

S. Williams, et al., A Generalized Framework for Auto-tuning Stencil Computations, Cray User Group (CUG), 2009,

S. Williams, et al., Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4, Cray User Group (CUG), 2009,

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Samuel Williams, Carter, Oliker, Shalf, Katherine A. Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel Distributed Computing (JPDC), 2009, 69:762-777, doi: 10.1016/j.jpdc.2009.04.002

2008

S. Williams, Auto-tuning Performance on Multicore Computers, Ph.D. Thesis Dissertation Talk, University of California at Berkeley, 2008,

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick, "Stencil Computation Optimization and Auto-Tuning on State-of-the-Art Multicore Architectures", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2008, doi: 10.1109/SC.2008.5222004

S. Williams, et al, "Auto-tuning and the Roofline model", View From the Top: Craig Mundie (Ph.D student poster session), 2008,

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

K. Datta, S. Williams, V. Volkov, M. Murphy, "Autotuning Structured Grid Kernels", ParLab Summer Retreat, 2008,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", IEEE International Symposium on Parallel and Distributed Processing. BEST PAPER AWARD - Applications Track, 2008, doi: 10.1109/IPDPS.2008.4536295

Samuel Williams, Oliker, W. Vuduc, Shalf, A. Yelick, James Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Parallel Computing, 2008, 35:38, doi: 10.1016/j.parco.2008.12.006

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, IEEE International Symposium on Parallel & Distributed Processing (IPDPS)., Pages: 1-14 2008,

K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,

S. Williams, et al., Autotuning Sparse and Structured Grid Kernels, Parlab Winter Retreat, 2008,

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, PERI -- Auto-tuning Memory-intensive Kernels for Multicore, Journal of Physics: Conference Series, Pages: 012038 2008,

2007

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, DOE/DOD Workshop on Emerging High-Performance Architectures and Applications, 2007,

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, Supercomputing (SC), 2007,

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core processors, Center for Scalable Application Development Software (CScADS), 2007,

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core SMPs, Parlab Seminar, 2007,