Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Performance and Algorithms Research

Samuel Williams

PhotoIDSamuelWilliams
Samuel Williams
Senior Scientist
Phone: 510-486-5936
LBNL Office: 59-3040 (CRT)
One Cyclotron Rd.
MS:59R4104
Berkeley, CA 94720-8142 us

Biographical Sketch

Sam Williams is a senior scientist in the Performance and Algorithms Research Group at the Lawrence Berkeley National Laboratory (LBNL). His research interests include high-performance computing, performance modeling, data-driven surrogate and reduced order modeling (DD-ROM), computer architecture, and hardware/software co-design.

Dr. Williams received his Ph.D. in Computer Science from the University of California at Berkeley (UCB) in December of 2008 and his masters in December of 2003. During this period, his doctoral research focused on multicore architectures and automated performance tuning under the guidance of David Patterson. To that end, Dr. Williams created the Roofline Model to enable developers, computer scientists, computer architects, and applied mathematicians to quickly and visually assess performance bottlenecks on multicore, manycore, and GPU-accelerated systems.

Previously, as a graduate student, Sam worked in several Parallel Computing Laboratory (ParLab) research groups including: BeBOP, Architecture, and the Berkeley View. His masters research was funded through the IRAM project under which he implemented the RTL for the integer and floating-point datapaths, verified the simulators and all RTL, floorplanned the entire VIRAM1 chip, and performed all necessary place-and-route (PnR) work. Sam received bachelor degrees in Electrical Engineering (computer specialization), Mathematics (applied), and Physics from Southern Methodist University (SMU) graduating summa cum laude in 1999. During that period, he spent five semesters as an intern at Cyrix corporation where he worked on RTL and gate-level verification, PnR (place and route), and silicon debug.

Current Research

  • Roofline Model (A performance model for throughput computing)
  • NERSC Architectural Evaluation and Performance Modeling
    • Value of FPGAs for HPC computing
    • Deep Neural Networks for Machine Learning for Science
    • Disaggregated Memory Architectures
    • Disaggregated (GPU) Accelerated Architectures
    • Scientific Workflows 
  • PROTEAS-TUNE (a compiler-based approach to auto-tuning) ECP Project
    • Development of novel data structure transformations for scalable, performance-portable computations on GPU-accelerated systems.
  • RAPIDS SciDAC4 CS/Data Institute
  • Data-Driven Surrogate/Reduced Order Modeling for simulation of fusion devices
  • EFIT-AI SciDAC4 FES Partnership
    • Performance-Portable GPU acceleration of Tokmak plasma equilibrium reconstruction codes
  • CTTS (Center for Tokamak Transient Simulations) SciDAC4 FES Partnership
    • Optimization of GPU-accelerated (direct solver) preconditioners for fusion simulations
  • ISEP (Integrated Simulation of Energetic Particles) SciDAC4 FES Partnership
    • Development of surrogate models to accelerate/replace fusion simulations
    • Optimization of particle-in-cell simulations on GPU-accelerated HPC systems
  • HPGMG-FV (High Performance Geometric Multigrid)
    • scalable compact benchmark for understanding the challenges of Geometric Multigrid on petascale and exascale systems built from multicore processors, manycore processors, and accelerators
    • used to evaluate alternatives to HPL for the Top500 rankings.
    • source code

Previous Research

 

Books

altPerformance Tuning of Scientific Applications, edited by: David H. Bailey, Robert F. Lucas, Samuel W. Williams, CRC Press, 2010, ISBN: 978-1-4398156-9-4.

 

 

 

Honors and Awards

  • Best Paper, Performance Modeling, Benchmarking, and Simulation (PMBS), 2021
  • Best Paper, Performance Modeling, Benchmarking, and Simulation (PMBS), 2020
  • Best Paper, International Symposium on Benchmarking, Measuring, and Optimizing (Bench), 2019
  • Best Paper, Performance Modeling, Benchmarking, and Simulation (PMBS), 2019
  • Best Paper, Cray Users Group (CUG), 2009
  • Best Paper, Applications Track, International Parallel and Distributed Processing Symposium (IPDPS) 2008
  • 2nd Place, Student Design Competition, Design Automation Conference, 2004.
  • Phi Beta Kappa, 1999
  • Eta Kappa Nu, 1996
  • Tau Beta Pi, 1995
  • Sigma Pi Sigma, 1998
  • Robert Stewart Hyer Society (highest academic honor at SMU), 1996
  • Outstanding Senior in SMU's School of Engineering and Applied Science, 1999
  • SMU's Charles J. Pipes Award for Outstanding Performance in Mathematics, 1998
  • University of California Microelectronics Fellowship, 1999-2000
  • Robert Stewart Hyer Scholarship (outstanding physics student), 1996
  • J. Lindsay Embrey Scholarship, 1994-1999
  • SMU University Scholarship, 1994-1999

 

Journal Articles

Marco Siracusa, Emanuele Del Sozzo, Marco Rabozzi, Lorenzo Di Tucci, Samuel Williams, Donatella Sciuto, Marco Domenico Santambrogio, "A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model", Transactions on Computers (TC), September 2021, doi: 10.1109/TC.2021.3111761

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Nan Ding, Muaaz Awan, Samuel Williams, "Instruction Roofline: An insightful visual performance model for GPUs", CCPE, August 4, 2021, doi: 10.1002/cpe.6591

Charlene Yang, Yunsong Wang, Thorsten Kurth, Steven Farrell, Samuel Williams, "Hierarchical Roofline Performance Analysis for Deep Learning Applications", Intelligent Computing, LNNS, July 15, 2021, doi: 10.1007/978-3-030-80126-7

Charlene Yang, Thorsten Kurth, Samuel Williams, "Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC-9 Perlmutter system", Concurrency and Computation: Practice and Experience (CCPE), August 2019, doi: 10.1002/cpe.5547

Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blashke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max P. Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, Michael Zingale, "AMReX: a framework for block-structured adaptive mesh refinement", Journal of Open Source Software, May 2019, doi: 10.21105/joss.01370

Wenjing Ma, Yulong Ao, Chao Yang, Samuel Williams, "Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight", Cluster Computing, May 2019, doi: 10.1007/s10586-019-02938-w

Jack Deslippe, Doug Doerfler, Brandon Cook, Tareq Malas, Samuel Williams, Sudip Dosanjh, "Optimizing science applications for the Cori, Knights Landing, System at NERSC", Advances in Parallel Computing, New Frontiers in High Performance Computing and Big Data, August 2017, 30, doi: 10.3233/978-1-61499-816-7-235

Bei Wang, Stephane Ethier, William Tang, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Modern Gyrokinetic Particle-in-cell Simulation of Fusion Plasmas on Top Supercomputers", International Journal of High-Performance Computing Applications (IJHPCA), May 2017, doi: https://doi.org/10.1177/1094342017712059

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends", Journal of Parallel and Distributed Computing (JPDC), February 2017, doi: 10.1016/j.jpdc.2017.02.010

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

 

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Nicholas Chaimov, Khaled Ibrahim, Samuel Williams, Costin Iancu, "Exploiting Communication Concurrency on High Performance Computing Systems", IJHPCA, April 17, 2015,

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

Khaled Z Ibrahim, Kamesh Madduri, Samuel Williams, Bei Wang, Stephane Ethier, Leonid Oliker, "Analysis and optimization of gyrokinetic toroidal simulations on homogeneous and heterogeneous platforms", International Journal of High Performance Computing Applications (IJHPCA), July 2013, doi: 10.1177/1094342013492446

K Madduri, J Su, S Williams, L Oliker, S Ethier, K Yelick, "Optimization of parallel particle-to-grid interpolation on leading multicore platforms", IEEE Transactions on Parallel and Distributed Systems, January 1, 2012, 23:1915--1922, doi: 10.1109/TPDS.2012.28

Kamesh Madduri, Eun-Jin Im, Khaled Z. Ibrahim, Samuel Williams, Stephane Ethier, Leonid Oliker, "Gyrokinetic Particle-in-cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing (PARCO), January 2011, 37:501 - 520, doi: 10.1016/j.parco.2011.02.001

S. Zhou, D. Duffy, T. Clune, M. Suarez, S. Williams, M. Halem, "The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models", Concurrency and Computation:Practice and Experience (CCPE), August 2009, doi: 10.1002/cpe.1482

S. Williams, A. Waterman, D. Patterson, "Roofline: an insightful visual performance model for multicore architectures", Communications of the ACM (CACM), April 2009, doi: 10.1145/1498765.1498785

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, "Hardware/Compiler Co-development for an Embedded Media Processor", Proceedings of the IEEE, 2001, doi: 10.1109/5.964446

Conference Papers

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, "Evaluating the Performance of One-sided Communication on CPUs and GPUs", 2023 International Workshop on Performance, Portability & Productivity in HPC, November 12, 2023,

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

Tuowen Zhao, Mary Hall, Hans Johansen, Samuel Williams, "Improving Communication by Optimizing On-Node Data Movement with Data Layout", PPoPP, February 2021,

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, Nicholas J. Wright, "The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing", (BEST PAPER) Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS), November 2020,

Yunsong Wang, Charlene Yang, Steven Farrell, Yan Zhang, Thorsten Kurth, Samuel Williams, "Time-Based Roofline for Deep Learning Performance Analysis", Deep Learning on Supercomputing (DLonSC), November 2020,

Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, Lorenzo Di Tucci, Samuel Williams, Marco D. Santambrogio, "A CAD-based methodology to optimize HLS code via the Roofline model", International Conference on Computer Aided Design (ICCAD), November 2020, doi: 10.1145/3400302.3415730

Christopher Daley, Hadia Ahmed, Samuel Williams, Nicholas Wright, "A case study of porting HPGMG from CUDA to OpenMP target offload", The International Workshop on OpenMP (IWOMP), September 2020,

Jonathan R Madsen, Muaaz G Awan, Hugo Brunie, Jack Deslippe, Rahul Gayatri, Leonid Oliker, Yunsong Wang, Charlene Yang, Samuel Williams, "TiMemory: Modular Performance Analysis for HPC", International Supercomputing Conference (ISC), June 2020, doi: 10.1007/978-3-030-50743-5_22

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Tuowen Zhao, Mary Hall, Samuel Williams, Hans Johansen, "Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs", Supercomputing (SC), November 2019,

Nan Ding, Samuel Williams, "An Instruction Roofline Model for GPUs", Performance Modeling, Benchmarking, and Simulation (PMBS), BEST PAPER AWARD, November 18, 2019,

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories", International Symposium on Benchmarking, Measuring and Optimizing (Bench), BEST PAPER AWARD, November 2019,

Charlene Yang, Thorsten Kurth, Samuel Williams, "Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System", Cray User Group (CUG), May 2019,

Tuowen Zhao, Samuel Williams, Mary Hall, Hans Johansen, "Delivering Performance Portable Stencil Computations on CPUs and GPUs Using Bricks", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Charlene Yang, Rahulkumar Gayatri, Thorsten Kurth, Protonu Basu, Zahra Ronaghi, Adedoyin Adetokunbo, Brian Friesen, Brandon Cook, Douglas Doerfler, Leonid Oliker, Jack Deslippe, Samuel Williams, "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Hongzhang Shan, Samuel Williams, Calvin W. Johnson, "Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2018,

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis", HPCS Special Session on High Performance Computing Benchmarking and Optimization (HPBench), July 2018,

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams, "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", ISC, June 2018,

Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook, Samuel Williams, "Toward Automated Application Profiling on Cray Systems", Cray User Group (CUG), May 2018,

Tuowen Zhao, Mary Hall, Protonu Basu, Samuel Williams, Hans Johansen, "SIMD code generation for stencils on brick decompositions", Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018,

Philip C. Roth, Hongzhang Shan, David Riegner, Nikolas Antolin, Sarat Sreepathi, Leonid Oliker, Samuel Williams, Shirley Moore, Wolfgang Windl, "Performance Analysis and Optimization of the RAMPAGE Metal Alloy Potential Generation Software", SIGPLAN International Workshop on Software Engineering for Parallel Systems (SEPS), October 2017,

Hongzhang Shan, Samuel Williams, Calvin Johnson, Kenneth McElvain, "A Locality-based Threading Algorithm for the Configuration-Interaction Method", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Bryce Adelstein Lelbach, Hans Johansen, Samuel Williams, "Simultaneously Solving Swarms of Small Sparse Systems on SIMD Silicon", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, Jack Deslippe, "Performance Variability on Xeon Phi", Intel Xeon Phi Users Group (IXPUG), June 2017,

Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samuel Williams, Woo-Sun Yang, and Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", Intel Xeon Phi Users Group (IXPUG), June 2017,

Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,

William Tang, Bei Wang, Stephane Ethier, Grzegorz Kwasniewski, Torsten Hoefler, Khaled Z. Ibrahim4, Kamesh Madduri, Samuel Williams, Leonid Oliker, Carlos Rosales-Fernandez, Tim Williams, "Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide", Supercomputing, November 2016,

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,

Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, and Andrea L. Bertozzi, "OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms", 12th International Workshop on OpenMP (iWOMP), October 2016, doi: 10.1007/978-3-319-45550-1_2

Douglas Doerfer, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti, "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016,

Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,

H Shan, S Williams, Y Zheng, W Zhang, B Wang, S Ethier, Z Zhao, IEEE, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PROCEEDINGS OF PAW 2016: 1ST PGAS APPLICATIONS WORKSHOP (PAW), January 2016, 17--24, doi: 10.1109/PAW.2016.008

Hongzhang Shan, Kenneth McElvain, Calvin Johnson, Samuel Williams, W. Erich Ormand, "Parallel Implementation and Performance Optimization of the Configuration-Interaction Method", Supercomputing (SC), November 2015, doi: 10.1145/2807591.2807618

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured Mesh-Based Simulations", Procedia Computer Science, 1877-0509, June 2015, 51:2016-2025, doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Khaled Z. Ibrahim, Samuel W. Williams, Evgeny Epifanovsky, Anna I. Krylov, "Analysis and Tuning of Libtensor Framework on Multicore Architectures", High Performance Computing Conference (HIPC), December 2014,

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2013, doi: 10.1145/2503210.2503258

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, L. Oliker, "Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid", Workshop on Stencil Computations (WOSC), 2013,

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, Sam Williams, "Loop chaining: A programming abstraction for balancing locality and parallelism", Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, May 2013, 375--384, doi: 10.1109/IPDPSW.2013.68

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

Aydın Buluç, Samuel Williams, Leonid Oliker, James Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", IPDPS, IEEE, 2011, doi: https://doi.org/10.1109/IPDPS.2011.73

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Best Paper Award

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

J. Gebis, S. Williams, D. Patterson, C. Kozyrakis, "VIRAM1: A Media-Oriented Vector Processor with Embedded DRAM", 41st Design Automation Student Design Contest (DAC), 2004,

Books

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

Book Chapters

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

Samuel W. Williams, David H. Bailey, "Parallel Computer Architecture", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010) Pages: 11-33

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)

S. Williams, "The Roofline Model", chapter in Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010)

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

Presentation/Talks

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, Evaluating the Performance of One-sided Communication on CPUs and GPUs, 2023 International Workshop on Performance, Portability & Productivity in HPC, November 13, 2023,

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 8, 2023,

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, May 2022,

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, April 2021,

Samuel Williams, Roofline Analysis on NVIDIA GPUs, ECP Annual Meeting, April 2021,

Samuel Williams, Introduction to the Roofline Model, Supercomputing (SC), November 2020,

Samuel Williams, The Roofline Model: A Bridge between Computer Science, Applied Math, and Computational Science, SciDAC Meeting, July 2020,

Samuel Williams, Introduction to the Roofline Model, NERSC NVIDIA Roofline Hackathon, July 2020,

Samuel Williams, Introduction to the Roofline Model, NERSC GPU For Science Workshop, July 2020,

Samuel Williams, Charlene Yang, Yunsong Wang, Roofline Performance Modeling for HPC and Deep Learning Applications, NVIDIA GPU Technology Conference (GTC), March 2020,

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 2020,

Samuel Williams, Roofline on GPUs (Advanced Topics), ECP Annual Meeting, February 2020,

Charlene Yang, Samuel Williams, Performance Analysis of GPU-Accelerated Applications using the Roofline Model, GPU Technology Conference (GTC), March 2019,

Samuel Williams, Performance Modeling and Analysis, CS267 Lecture, University of California at Berkeley, February 14, 2019,

Samuel Williams, Introduction to the Roofline Model, Roofline Tutorial, ECP Annual Meeting, January 2019,

Samuel Williams, Roofline on CPU-based Systems, Roofline Tutorial, ECP Annual Meeting, January 2019,

Samuel Williams, Introduction to the Roofline Model, Supercomputing, November 2018,

Samuel Williams, Roofline on Manycore and Accelerated Systems, ModSim, August 2018,

Samuel Williams, Parallelism and Performance, MolSSI Summer School, August 2018,

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 8, 2018,

Samuel Williams, Advisor Hand-On: Stencil Example, ECP Annual Meeting, February 8, 2018,

Samuel Williams, Performance Modeling and Analysis, CS267 lecture, University of California at Berkeley, January 30, 2018,

Samuel Williams, Introduction to the Roofline Model, Roofline Training, November 2017,

Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

Samuel Williams, HPGMG on the Knights Landing Processor, HPGMG BoF, Supercomputing, November 2016,

Samuel Williams, HPGMG Benchmark, Top500 BoF, Supercomputing, November 2016,

Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

S Williams, D Patterson, L Oliker, J Shalf, K Yelick, The roofline model: A pedagogical tool for program analysis and optimization, 2008 IEEE Hot Chips 20 Symposium, HCS 2008, 2016, doi: 10.1109/HOTCHIPS.2008.7476531

Samuel Williams, X-TUNE, X-Stack PI Meeting, December 2015,

Samuel Williams, 4th Order HPGMG-FV Implementation, HPGMG BoF, Supercomputing, November 2015,

Samuel Williams, HPGMG-FV, FastForward2 Proxy App Presentation, December 2014,

Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

Samuel Williams, At Exascale, Will Bandwidth Be Free?, DOE ModSim Workshop, 2013,

Samuel Williams, Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors, Supercomputing (SC), November 2012,

S. Williams, et al., Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning, Supercomputing (SC), 2011,

S. Williams, et al., Stencil Computations on CPUs, Stanford Earth Sciences Algorithms and Architectures Initiative (SESAAI), 2011,

S. Williams, et al., Performance Optimization of HPC Applications on Multi- and Manycore Processors, Workshop on Hybrid Technologies for NASA Applications, 4th Internation Conference on Space Mission Challenges for Information Technology, 2011,

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

S. Williams, et al, Stencil Computations on CPUs, Society of Exploration Geophysicists High-Performance Computing Workshop (SEG), July 2011,

S. Williams, et al., Lattice Boltzmann Hybrid Auto-tuning on High-End Computational Platforms, Workshop on Programming Environments for Emerging Parallel Systems (PEEPS), 2010,

S. Williams, et al., A Generalized Framework for Auto-tuning Stencil Computations, Cray User Group (CUG), 2009,

S. Williams, et al., Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4, Cray User Group (CUG), 2009,

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

S. Williams, Auto-tuning Performance on Multicore Computers, Ph.D. Thesis Dissertation Talk, University of California at Berkeley, 2008,

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

S. Williams, et al., A Vision for Integrating Performance Counters into the Roofline model, UPCRC PMU Workshop (Performance Counters), 2008,

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, IEEE International Symposium on Parallel & Distributed Processing (IPDPS)., Pages: 1-14 2008,

S. Williams, et al., Autotuning Sparse and Structured Grid Kernels, Parlab Winter Retreat, 2008,

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, DOE/DOD Workshop on Emerging High-Performance Architectures and Applications, 2007,

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, Supercomputing (SC), 2007,

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core processors, Center for Scalable Application Development Software (CScADS), 2007,

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core SMPs, Parlab Seminar, 2007,

S. Williams, et al., Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor, Global Signal Processing Expo (GSPx), 2006,

S. Williams, et al., 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D), UTK Summit on Software and Algorithms for the Cell Processor, 2006,

S. Williams, et al., The Potential of the Cell Processor for Scientific Computing, LBL Scientific Computing Seminar, 2006,

Samuel Williams, Shalf, Oliker, Kamil, Husbands, Katherine A. Yelick, The potential of the cell processor for scientific computing, Conf. Computing Frontiers, Pages: 9-20 2006,

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, The potential of the cell processor for scientific computing, Proceedings of the 3rd Conference on Computing Frontiers 2006, CF 06, Pages: 9--20 2006, doi: 10.1145/1128022.1128027

C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, K. Yelick, Vector IRAM: A media-oriented vector processor with embedded DRAM, Hot Chips 12, 2000,

Reports

Esmond Ng, Katherine J. Evans, Peter Caldwell, Forrest M. Hoffman, Charles Jackson, Kerstin Van Dam, Ruby Leung, Daniel F. Martin, George Ostrouchov, Raymond Tuminaro, Paul Ullrich, Stefan Wild, Samuel Williams, "Advances in Cross-Cutting Ideas for Computational Climate Science (AXICCS)", January 2017, doi: 10.2172/1341564

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends (tech report version)", LBNL. - Report Number: LBNL-1005853, July 1, 2016, LBNL 1005853, doi: 10.2172/1274416

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker,
"Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

J. Krueger, P. Micikevicius, S. Williams, "Optimization of Forward Wave Modeling on Contemporary HPC Architectures", LBNL Technical Report, 2012, LBNL 5751E,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

Samuel Webb Williams, Andrew Waterman, David A. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", EECS Tech Report UCB/EECS-2008-134, October 2008,

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

S. Williams, "Verification of VIRAM1", Masters Report, University of California at Berkeley, 2003,

Thesis/Dissertations

Web Articles

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

Posters

Nan Ding, Samuel Williams, Sherry Li, Yang Liu, "Leveraging One-Sided Communication for Sparse Triangular Solvers", SciDAC19, July 18, 2019,

Samuel Williams, Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker, "Performance Analysis using the Roofline Model", SciDAC PI Meeting, July 2019,

Alex Druinsky, Brian Austin, Sherry Li, Osni Marques, Eric Roman, Samuel Williams, "A Roofline Performance Analysis of an Algebraic Multigrid Solver", Supercomputing (SC), November 2014,

B. Wang, S. Ethier, W. Tang, K. Ibrahim, K. Madduri, S. Williams, "Advances in gyrokinetic particle in cell simulation for fusion plasmas to Extreme scale", Supercomputing (SC), 2012,

A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

S. Williams, et al, "Auto-tuning and the Roofline model", View From the Top: Craig Mundie (Ph.D student poster session), 2008,

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

K. Datta, S. Williams, V. Volkov, M. Murphy, "Autotuning Structured Grid Kernels", ParLab Summer Retreat, 2008,

S. Zhou, D. Duffy, T. Clune, M. Suarez, S. Williams, M. Halem, "Impacts of the IBM Cell Processor on Supporting Climate Models", International Supercomputing Conference (ISC), 2008,

S. Williams, et. al, "The Roofline Model: A Pedagogical Tool for Program Analysis and Optimization", Parlab Summer Retreat, 2008,

K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,