Publications

Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

Download File: SC16-HPGMG-BoF-Intro.pdf (pdf: 1020 KB)

Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

Download File: CU16SWWilliams.pptx (pptx: 1 MB)

Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

Download File: SC14HPGMGBoF.pdf (pdf: 1.9 MB)

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Download File: hpgmg.pdf (pdf: 183 KB)

Christopher Daley, Hadia Ahmed, Samuel Williams, Nicholas Wright, "A case study of porting HPGMG from CUDA to OpenMP target offload", The International Workshop on OpenMP (IWOMP), September 2020,

Download File: p24-daley.pdf (pdf: 272 KB)

Christopher Daley, Hadia Ahmed, Samuel Williams, Nicholas Wright, "A case study of porting HPGMG from CUDA to OpenMP target offload", The International Workshop on OpenMP (IWOMP), September 2020,

Download File: p24-daley.pdf (pdf: 272 KB)

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125

Download File: ipdps14mfdnfinal.pdf (pdf: 631 KB)

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Download File: miniGMGLBNL-6676E.pdf (pdf: 906 KB)

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

Download File: sc12-mg.pdf (pdf: 808 KB)
Download File: sc12mgtalk.pdf (pdf: 1.9 MB)

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Download File: P3HPC24_bricks_mg_final.pdf (pdf: 358 KB)

Oscar Antepara, Samuel Williams, Max Carlson, Jerry Watkins, "Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers", Performance, Portability & Productivity in HPC (P3HPC), November 2024,

Download File: P3HPC24_IceSheet_final-v2.pdf (pdf: 1.4 MB)

Sterling Smith, Zichuan Anthony Xing, Torrin Bechtel, Severin Denk, Earl DeShazer, Orso Meneghini, Tom Neiser, Laurie Stephey, Oscar Antepara, Christopher Mitchell Clark, Eli Dart, Pengfei Ding, Sean Flanagan, Raffi Nazikian, David Schissel, Christine Simpson, Nicholas Tyler, Thomas D. Uram, Samuel Williams, "Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D National Fusion Facility Using Leadership Class Computing Resources", Extreme-Scale Experiment-in-the-Loop Computing (XLOOP), November 2024,

Mahesh Lakshminarasimhan, Oscar Antepara, Tuowen Zhao, Benjamin Sepanski, Protonu Basu, Hans Johansen, Mary Hall, Samuel Williams, "Bricks: A high-performance portability layer for computations on block-structured grids", The International Journal of High Performance Computing Applications (IJHPCA), August 19, 2024, doi: 10.1177/10943420241268288

Mahesh Lakshminarasimhan, Mary Hall, Samuel Williams, Oscar Antepara, "BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs", Proceedings of the 53rd International Conference on Parallel Processing (ICPP), August 12, 2024,

Download File: ICPP24_BrickDL_final-v2.pdf (pdf: 1.7 MB)

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Download File: P3HPC23_bricks_final-v4.pdf (pdf: 684 KB)

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Download File: WACCPD23_EFIT_final.pdf (pdf: 697 KB)

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji, "Comprehensive Performance Modeling and System Design Insights for Foundation Models", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_ModelingTransformerTraining_final.pdf (pdf: 736 KB)

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_DCGM_final.pdf (pdf: 319 KB)

Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, Jack Deslippe, "Performance Variability on Xeon Phi", Intel Xeon Phi Users Group (IXPUG), June 2017,

Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/978-3-319-10214-6_13

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

Ariful Azad, Georgios A. Pavlopoulos, Christos A. Ouzounis, Nikos C. Kyrpides, Aydin Buluç, "HipMCL: A high-performance parallel implementation of the Markov cluster algorithm for large scale networks", Nucleic Acids Research, April 2018,

Ariful Azad, Aydin Buluc, "Towards a GraphBLAS Library in Chapel", IPDPS Workshops, Orlando, FL, May 2017,

Download File: GraphBLAS-Chapel.pdf (pdf: 368 KB)

Ariful Azad, Aydin Buluc, "A work-efficient parallel sparse matrix-sparse vector multiplication algorithm", IEEE International Parallel & Distributed Processing Symposium (IPDPS), Orlando, FL, May 2017,

Download File: SpMSpV-ipdps17.pdf (pdf: 422 KB)

Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

Ariful Azad, Bartek Rajwa, Alex Pothen, "flowVS: Channel-Specic Variance Stabilization in Flow Cytometry", BMC Bioinformatics, June 2016,

Ariful Azad, Aydın Buluç, "A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs", Parallel Computing, June 2016,

Ariful Azad, Aydin Buluç, "Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs", IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2016,

Download File: MaximumMatchingDist-IPDPS16.pdf (pdf: 620 KB)

Ariful Azad, Aydın Buluç, Alex Pothen, "Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting", IEEE Transactions on Parallel and Distributed Systems (TPDS), May 2016,

Download File: matchingGraft-TPDS.pdf (pdf: 1.4 MB)

Ariful Azad, Aydın Buluç, Distributed-memory algorithms for cardinality matching using matrix algebra, SIAM Conference on Parallel Processing for Scientific Computing (PP), Paris, France, April 2016,

Download File: Azad-SIAM-PP16.pdf (pdf: 1.3 MB)

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Ariful Azad, Aydin Buluc, "Distributed-Memory Algorithms for Maximal Cardinality Matching using Matrix Algebra", IEEE Cluster, Chicago, IL, September 2015,

Download File: maximalMatching-distribute.pdf (pdf: 659 KB)

Mahantesh Halappanavar, Alex Pothen, Ariful Azad, Fredrik Manne, Johannes Langguth, Arif Khan, "Codesign Lessons Learned from Implementing Graph Matching on Multithreaded Architectures", IEEE Computer, August 2015,

Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

Download File: triangles-gabb.pdf (pdf: 384 KB)

Ariful Azad, Aydin Buluç, Alex Pothen, "A Parallel Tree Grafting Algorithm for Maximum Cardinality Matching in Bipartite Graphs", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: matchingGraft.pdf (pdf: 518 KB)

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

David H. Bailey, Lin-Wang Wang, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, Byounghak Lee, "Tuning an electronic structure code", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2011) Pages: 339-354 doi: 10.1201/b10509

Samuel W. Williams, David H. Bailey, "Parallel Computer Architecture", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010) Pages: 11-33

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

Zhengji Zhao, Juan Meza, Byounghak Lee, Hongzhang Shan, Eric Strohmaier, David H. Bailey, Lin-Wang Wang, "The linearly scaling 3D fragment method for large scale electronic structure calculations", Journal of Physics: Conference Series, July 1, 2009,

Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, David H. Bailey, "Linearly scaling 3D fragment method for large-scale electronic structure calculations", Proceedings of SC08, November 2008,

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

Download File: jpconf8125012038.pdf (pdf: 1.2 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

Download File: Top500PowerNov14SC07.pdf (pdf: 3.8 MB)

H Shan, E Strohmaier, J Qiang, DH Bailey, K Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 06, January 2006, doi: 10.1145/1188455.1188557

Charlene Yang, Rahulkumar Gayatri, Thorsten Kurth, Protonu Basu, Zahra Ronaghi, Adedoyin Adetokunbo, Brian Friesen, Brandon Cook, Douglas Doerfler, Leonid Oliker, Jack Deslippe, Samuel Williams, "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Download File: p3hpc-roofline-final.pdf (pdf: 372 KB)

Tuowen Zhao, Mary Hall, Protonu Basu, Samuel Williams, Hans Johansen, "SIMD code generation for stencils on brick decompositions", Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018,

Protonu Basu, Using Empirical Roofline Toolkit and Nvidia nvprof, ECP Annual Meeting, February 8, 2018,

Download File: ECP18-Roofline-4-NVProf.pdf (pdf: 1.4 MB)

Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,

Download File: hips17-snowflake.pdf (pdf: 475 KB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: ipdps15CHiLL.pdf (pdf: 1.8 MB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

Download File: wosc14chill.pdf (pdf: 973 KB)

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

Download File: hipc13chill.pdf (pdf: 989 KB)

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, L. Oliker, "Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid", Workshop on Stencil Computations (WOSC), 2013,

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Download File: International-Journal-of-High-Performance-Computing-Applications-2015-Unat-209-32.pdf (pdf: 4.3 MB)

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Leonid Oliker, Julian Borrill, Hongzhang Shan, John Shalf, Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark., 2007,

Download File: SC07-MadBench-talk.ppt (ppt: 2.7 MB)

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Download File: PMBS20-NVSHMEM-final.pdf (pdf: 659 KB)

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

Ariful Azad, Georgios A. Pavlopoulos, Christos A. Ouzounis, Nikos C. Kyrpides, Aydin Buluç, "HipMCL: A high-performance parallel implementation of the Markov cluster algorithm for large scale networks", Nucleic Acids Research, April 2018,

Yang You, Aydin Buluc, James Demmel, "Scaling deep learning on GPU and Knights Landing clusters", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), 2017,

Ariful Azad, Aydin Buluc, "Towards a GraphBLAS Library in Chapel", IPDPS Workshops, Orlando, FL, May 2017,

Download File: GraphBLAS-Chapel.pdf (pdf: 368 KB)

Aydin Buluc, Tim Mattson, Scott McMillan, Jose Moreira, Carl Yang, "Design of the GraphBLAS API for C", IEEE Workshop on Graph Algorithm Building Blocks, IPDPSW, 2017,

Download File: GABB17.pdf (pdf: 359 KB)

Ariful Azad, Aydin Buluc, "A work-efficient parallel sparse matrix-sparse vector multiplication algorithm", IEEE International Parallel & Distributed Processing Symposium (IPDPS), Orlando, FL, May 2017,

Download File: SpMSpV-ipdps17.pdf (pdf: 422 KB)

Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, José Moreira, John Owens, Carl Yang, Marcin Zalewski, Timothy Mattson., "Mathematical foundations of the GraphBLAS", IEEE High Performance Extreme Computing (HPEC), September 1, 2016,

Ariful Azad, Aydın Buluç, "A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs", Parallel Computing, June 2016,

Ariful Azad, Aydin Buluç, "Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs", IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2016,

Download File: MaximumMatchingDist-IPDPS16.pdf (pdf: 620 KB)

Ariful Azad, Aydın Buluç, Alex Pothen, "Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting", IEEE Transactions on Parallel and Distributed Systems (TPDS), May 2016,

Download File: matchingGraft-TPDS.pdf (pdf: 1.4 MB)

Ariful Azad, Aydın Buluç, Distributed-memory algorithms for cardinality matching using matrix algebra, SIAM Conference on Parallel Processing for Scientific Computing (PP), Paris, France, April 2016,

Download File: Azad-SIAM-PP16.pdf (pdf: 1.3 MB)

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Ariful Azad, Aydin Buluc, "Distributed-Memory Algorithms for Maximal Cardinality Matching using Matrix Algebra", IEEE Cluster, Chicago, IL, September 2015,

Download File: maximalMatching-distribute.pdf (pdf: 659 KB)

Aydin Buluç, Scott Beamer, Kamesh Madduri, Krste Asanović, David Patterson., "Distributed-memory breadth-first search on massive graphs.", In D. Bader (editor), Parallel Graph Algorithms. CRC Press/Taylor-Francis, ( 2015)

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "MerAligner: A Fully Parallel Sequence Aligner", IEEE 29th International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 561--570, doi: 10.1109/IPDPS.2015.96

Aligning a set of query sequences to a set of target sequences is an important task in bioinformatics. In this work we present merAligner, a highly parallel sequence aligner that implements a seed -- and -- extend algorithm and employs parallelism in all of its components. MerAligner relies on a high performance distributed hash table (seed index) and uses one-sided communication capabilities of the Unified Parallel C to facilitate a fine-grained parallelism. We leverage communication optimizations at the construction of the distributed hash table and software caching schemes to reduce communication during the aligning phase. Additionally, merAligner preprocesses the target sequences to extract properties enabling exact sequence matching with minimal communication. Finally, we efficiently parallelize the I/O intensive phases and implement an effective load balancing scheme. Results show that merAligner exhibits efficient scaling up to thousands of cores on a Cray XC30 supercomputer using real human and wheat genome data while significantly outperforming existing parallel alignment tools.

Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

Download File: triangles-gabb.pdf (pdf: 384 KB)

Ariful Azad, Aydin Buluç, Alex Pothen, "A Parallel Tree Grafting Algorithm for Maximum Cardinality Matching in Bipartite Graphs", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: matchingGraft.pdf (pdf: 518 KB)

Aydin Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, Christian Schulz., "Recent advances in graph partitioning", ArXiv, ( 2015)

E Georganas, A Buluç, J Chapman, S Hofmeyr, C Aluru, R Egan, L Oliker, D Rokhsar, K Yelick, "HipMer: An extreme-scale de novo genome assembler", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2015, 15-20-No, doi: 10.1145/2807591.2807664

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125

Download File: ipdps14mfdnfinal.pdf (pdf: 631 KB)

Tim Mattson, David Bader, Jon Berry, Aydin Buluc, Jack Dongarra, Christos Faloutsos, John Feo, John Gilbert, Joseph Gonzalez, Bruce
Hendrickson, Jeremy Kepner, Charles Leiserson, Andrew Lumsdaine, David Padua, Stephen Poole, Steve Reinhardt, Mike Stonebraker, Steve Wallach,
Andrew Yoo, "Standards for Graph Algorithm Primitives", HPEC, 2013,

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

J. Shalf, L. Oliker, M. Lijewski, S. Kamil, J. Carter, A. Canning, S. Ethier, "Performance Characteristics of Potential Petascale Scientific Applications", Chapman & Hall/CRC Computational Science, (CRC Press: 2007) Pages: 1

Download File: CactusGRB.pdf (pdf: 712 KB)

Book Chapter

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

Download File: sc11-lbmhd.pdf (pdf: 666 KB)
Download File: sc11lbmhdtalk.pdf (pdf: 1.4 MB)

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

Download File: cug09-lbmhd.pdf (pdf: 443 KB)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Download File: iwapt09-27pt.pdf (pdf: 465 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", Extended Version: Lecture Notes in Computer Science, 2007,

Download File: LNCS07.pdf (pdf: 445 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

J. Carter, Y. He, J. Shalf, H. Shan, E. Strohmaier, H. Wasserman, "The Performance Effect of Multi-core on Scientific Applications", Proceedings of Cray User Group, 2007, LBNL 62662,

Download File: CUG2007FINAL.pdf (pdf: 149 KB)

J. Levesque, J. Larkin, M. Foster, J. Glenski, G. Geissler, S. Whalen, B. Waldecker, J. Carter, D. Skinner, Y. He, H. Wasserman, J. Shalf, H. Shan, E. Strohmaier, "Understanding and Mitigating Multicore Performance Issues on the AMD Opteron Architecture", 2007, LBNL 62500,

Download File: LBNL-62500.v3.pdf (pdf: 2.4 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Download File: International-Journal-of-High-Performance-Computing-Applications-2015-Unat-209-32.pdf (pdf: 4.3 MB)

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

Download File: ipdps10-ast.pdf (pdf: 789 KB)

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: ipdps15CHiLL.pdf (pdf: 1.8 MB)

Xuan Jiang, Raja Sengupta, James Demmel, Samuel Williams, "Large scale multi-GPU based parallel traffic simulation for accelerated traffic assignment and propagation", Transportation Research Part C: Emerging Technologies, December 2024, 169:104873, doi: 10.1016/j.trc.2024.104873

Yang You, Aydin Buluc, James Demmel, "Scaling deep learning on GPU and Knights Landing clusters", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), 2017,

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, "Evaluating the Performance of One-sided Communication on CPUs and GPUs", 2023 International Workshop on Performance, Portability & Productivity in HPC, November 12, 2023,

Download File: OneSided_MPI_P3HPC_.pdf (pdf: 2.5 MB)

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Download File: PMBS22_GPU_final.pdf (pdf: 719 KB)

Nan Ding, Muaaz Awan, Samuel Williams, "Instruction Roofline: An insightful visual performance model for GPUs", CCPE, August 4, 2021, doi: 10.1002/cpe.6591

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

Download File: One-side-SPTRS-SIAM-PP20-.pdf (pdf: 2.9 MB)

Nan Ding, Samuel Williams, "An Instruction Roofline Model for GPUs", Performance Modeling, Benchmarking, and Simulation (PMBS), BEST PAPER AWARD, November 18, 2019,

Download File: InstructionRooflineModel-PMBS19-.pdf (pdf: 970 KB)

Nan Ding, Samuel Williams, An Instruction Roofline Model for GPUs, Performance Modeling, Benchmarking, and Simulation (PMBS), BEST PAPER AWARD, November 18, 2019,

Download File: PMBS19-InstructionRoofline-1-NanDing.pdf (pdf: 6.3 MB)

Nan Ding, Samuel Williams, Sherry Li, Yang Liu, "Leveraging One-Sided Communication for Sparse Triangular Solvers", SciDAC19, July 18, 2019,

Download File: SciDAC19-Poster-SpTRSV-NanDing.pdf (pdf: 774 KB)

Samuel Williams, Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker, "Performance Analysis using the Roofline Model", SciDAC PI Meeting, July 2019,

Download File: SciDAC19-Poster-Roofline-SWWilliams.pdf (pdf: 4.9 MB)

Nan Ding, Victor W Lee, Wei Xue, Weimin Zheng, "Understanding Potential Performance Issues Using Resource-based Alongside Time Models", SC'18, November 13, 2018,

Download File: Modeling-SC18poster.pdf (pdf: 1.2 MB)
Download File: 2-Modeling.pdf (pdf: 1.8 MB)

Nan Ding, WeiXue, Zhenya Song, Haohuan Fub, Shiming Xu, WeiminZhenga, "An automatic performance model-based scheduling tool for coupled climate system models", JPDC, January 31, 2018,

Haohuan Fu, Junfeng Liao, Nan Ding, Xiaohui Duan, Lin Gan,Yishuang Liang,Xinliang Wang,Jinzhe Yang,Yan Zheng,Weiguo Liu,Lanning Wang,Guangwen Yang, "Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight (ACM Gordon Bell Prize Finalist)", SC'17, November 12, 2017,

Download File: redesign.pdf (pdf: 4.8 MB)

Haohuan Fu, Junfeng Liao, Wei Xue, Lanning Wang, Dexun Chen, Long Gu, Jinxiu Xu, Nan Ding, Xinliang Wang, Conghui He, Shizhen Xu, Yishuang Liang, Jiarui Fang, Yuanchao Xu, Weijie Zheng, etc., "Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer", SC'16, November 13, 2016,

Download File: refactoring.pdf (pdf: 1.7 MB)

Nan Ding, Weu Xue, Xu Ji, Haoyu Xu, Zhenya Song, "CESMTuner: An Auto-Tuning Framework for the Community Earth System Model", HPCC'14, IEEE, August 20, 2014, doi: 10.1109/HPCC.2014.51

Download File: CESMTuner.pdf (pdf: 860 KB)

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "MerAligner: A Fully Parallel Sequence Aligner", IEEE 29th International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 561--570, doi: 10.1109/IPDPS.2015.96

Aligning a set of query sequences to a set of target sequences is an important task in bioinformatics. In this work we present merAligner, a highly parallel sequence aligner that implements a seed -- and -- extend algorithm and employs parallelism in all of its components. MerAligner relies on a high performance distributed hash table (seed index) and uses one-sided communication capabilities of the Unified Parallel C to facilitate a fine-grained parallelism. We leverage communication optimizations at the construction of the distributed hash table and software caching schemes to reduce communication during the aligning phase. Additionally, merAligner preprocesses the target sequences to extract properties enabling exact sequence matching with minimal communication. Finally, we efficiently parallelize the I/O intensive phases and implement an effective load balancing scheme. Results show that merAligner exhibits efficient scaling up to thousands of cores on a Cray XC30 supercomputer using real human and wheat genome data while significantly outperforming existing parallel alignment tools.

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

Download File: triangles-gabb.pdf (pdf: 384 KB)

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

E Georganas, A Buluç, J Chapman, S Hofmeyr, C Aluru, R Egan, L Oliker, D Rokhsar, K Yelick, "HipMer: An extreme-scale de novo genome assembler", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2015, 15-20-No, doi: 10.1145/2807591.2807664

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720

Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15-servers.pdf (pdf: 1.2 MB)

Costin Iancu, Erich Strohmaier, "Optimizing communication overlap for high-speed networks", Principles and Practice of Parallel Programming (PPoPP), 2007,

Download File: paper58-iancu.pdf (pdf: 469 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

K. Ibrahim, L. Oliker,, "Preprocessing Pipeline Optimization for Scientific Deep-Learning Workloads", IPDPS 22, June 3, 2022,

Download File: SciML-optimization-12.pdf (pdf: 17 MB)

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Khaled Ibrahim, Roofline on GPUs (advanced topics), ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-6-advanced.pdf (pdf: 15 MB)

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Download File: PMBS20-NVSHMEM-final.pdf (pdf: 659 KB)

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories", International Symposium on Benchmarking, Measuring and Optimizing (Bench), BEST PAPER AWARD, November 2019,

Samuel Williams, Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker, "Performance Analysis using the Roofline Model", SciDAC PI Meeting, July 2019,

Download File: SciDAC19-Poster-Roofline-SWWilliams.pdf (pdf: 4.9 MB)

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis", HPCS Special Session on High Performance Computing Benchmarking and Optimization (HPBench), July 2018,

Download File: hpbench18-roofline.pdf (pdf: 2.4 MB)

Bei Wang, Stephane Ethier, William Tang, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Modern Gyrokinetic Particle-in-cell Simulation of Fusion Plasmas on Top Supercomputers", International Journal of High-Performance Computing Applications (IJHPCA), May 2017, doi: https://doi.org/10.1177/1094342017712059

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends", Journal of Parallel and Distributed Computing (JPDC), February 2017, doi: 10.1016/j.jpdc.2017.02.010

William Tang, Bei Wang, Stephane Ethier, Grzegorz Kwasniewski, Torsten Hoefler, Khaled Z. Ibrahim4, Kamesh Madduri, Samuel Williams, Leonid Oliker, Carlos Rosales-Fernandez, Tim Williams, "Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide", Supercomputing, November 2016,

Download File: sc16-gtcp-submit.pdf (pdf: 971 KB)

Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends (tech report version)", LBNL. - Report Number: LBNL-1005853, July 1, 2016, LBNL 1005853, doi: 10.2172/1274416

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15-servers.pdf (pdf: 1.2 MB)

Khaled Z. Ibrahim, Samuel W. Williams, Evgeny Epifanovsky, Anna I. Krylov, "Analysis and Tuning of Libtensor Framework on Multicore Architectures", High Performance Computing Conference (HIPC), December 2014,

Download File: HIPC14-libtensor.pdf (pdf: 277 KB)

Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2013, doi: 10.1145/2503210.2503258

Download File: sc13gtc.pdf (pdf: 1.3 MB)

Khaled Z Ibrahim, Kamesh Madduri, Samuel Williams, Bei Wang, Stephane Ethier, Leonid Oliker, "Analysis and optimization of gyrokinetic toroidal simulations on homogeneous and heterogeneous platforms", International Journal of High Performance Computing Applications (IJHPCA), July 2013, doi: 10.1177/1094342013492446

B. Wang, S. Ethier, W. Tang, K. Ibrahim, K. Madduri, S. Williams, "Advances in gyrokinetic particle in cell simulation for fusion plasmas to Extreme scale", Supercomputing (SC), 2012,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Download File: sc11-gtc.pdf (pdf: 1.3 MB)

Kamesh Madduri, Eun-Jin Im, Khaled Z. Ibrahim, Samuel Williams, Stephane Ethier, Leonid Oliker, "Gyrokinetic Particle-in-cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing (PARCO), January 2011, 37:501 - 520, doi: 10.1016/j.parco.2011.02.001

Download File: parco11-gtc.pdf (pdf: 2 MB)

Khaled Z. Ibrahim, Erich Strohmaier, "Characterizing the Relation Between Apex-Map Synthetic Probes and Reuse Distance Distributions", The 39th International Conference on Parallel Processing (ICPP), 2010, 353 -362,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Download File: P3HPC24_bricks_mg_final.pdf (pdf: 358 KB)

Mahesh Lakshminarasimhan, Oscar Antepara, Tuowen Zhao, Benjamin Sepanski, Protonu Basu, Hans Johansen, Mary Hall, Samuel Williams, "Bricks: A high-performance portability layer for computations on block-structured grids", The International Journal of High Performance Computing Applications (IJHPCA), August 19, 2024, doi: 10.1177/10943420241268288

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Download File: P3HPC23_bricks_final-v4.pdf (pdf: 684 KB)

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,

Download File: MCHPC22_final.pdf (pdf: 401 KB)

Tuowen Zhao, Mary Hall, Hans Johansen, Samuel Williams, "Improving Communication by Optimizing On-Node Data Movement with Data Layout", PPoPP, February 2021,

Download File: PPoPP-Bricks-MPI-final.pdf (pdf: 864 KB)

Tuowen Zhao, Mary Hall, Samuel Williams, Hans Johansen, "Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs", Supercomputing (SC), November 2019,

Download File: SC19-VectorScatter-final.pdf (pdf: 1019 KB)

Tuowen Zhao, Samuel Williams, Mary Hall, Hans Johansen, "Delivering Performance Portable Stencil Computations on CPUs and GPUs Using Bricks", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Download File: p3hpc-bricks-final.pdf (pdf: 1.3 MB)

Tuowen Zhao, Mary Hall, Protonu Basu, Samuel Williams, Hans Johansen, "SIMD code generation for stencils on brick decompositions", Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018,

Bryce Adelstein Lelbach, Hans Johansen, Samuel Williams, "Simultaneously Solving Swarms of Small Sparse Systems on SIMD Silicon", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Download File: pgas15-hpgmg.pdf (pdf: 803 KB)

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

Download File: ipdps10-ast.pdf (pdf: 789 KB)

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

Download File: sirev09-stencil.pdf (pdf: 2.8 MB)

K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,

Download File: parlab08-structured-poster.pdf (pdf: 1.8 MB)

Shoaib Kamil, Shalf, Erich Strohmaier, "Power efficiency in high performance computing", IPDPS, 2008, 1-8,

Download File: powereffreportfull.pdf (pdf: 312 KB)

J. Shalf, L. Oliker, M. Lijewski, S. Kamil, J. Carter, A. Canning, S. Ethier, "Performance Characteristics of Potential Petascale Scientific Applications", Chapman & Hall/CRC Computational Science, (CRC Press: 2007) Pages: 1

Download File: CactusGRB.pdf (pdf: 712 KB)

Book Chapter

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

Download File: ijpp07-cell.pdf (pdf: 1000 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

Download File: Top500PowerNov14SC07.pdf (pdf: 3.8 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

Download File: cf06-cell-potential.pdf (pdf: 213 KB)

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

Download File: mspc06-stencil.pdf (pdf: 421 KB)

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, and Andrea L. Bertozzi, "OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms", 12th International Workshop on OpenMP (iWOMP), October 2016, doi: 10.1007/978-3-319-45550-1_2

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

Download File: cug11-fusion.pdf (pdf: 377 KB)

J. Krueger, P. Micikevicius, S. Williams, "Optimization of Forward Wave Modeling on Contemporary HPC Architectures", LBNL Technical Report, 2012, LBNL 5751E,

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

Download File: sc11-greenwave.pdf (pdf: 614 KB)

Bryce Adelstein Lelbach, Hans Johansen, Samuel Williams, "Simultaneously Solving Swarms of Small Sparse Systems on SIMD Silicon", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

Download File: One-side-SPTRS-SIAM-PP20-.pdf (pdf: 2.9 MB)

S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7

Download File: PMBS14-Roofline.pdf (pdf: 340 KB)

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

J. Shalf, L. Oliker, M. Lijewski, S. Kamil, J. Carter, A. Canning, S. Ethier, "Performance Characteristics of Potential Petascale Scientific Applications", Chapman & Hall/CRC Computational Science, (CRC Press: 2007) Pages: 1

Download File: CactusGRB.pdf (pdf: 712 KB)

Book Chapter

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

Download File: One-side-SPTRS-SIAM-PP20-.pdf (pdf: 2.9 MB)

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, Nicholas J. Wright, "The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing", (BEST PAPER) Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS), November 2020,

Download File: PMBS20-FPGA-final.pdf (pdf: 2.9 MB)

Bei Wang, Stephane Ethier, William Tang, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Modern Gyrokinetic Particle-in-cell Simulation of Fusion Plasmas on Top Supercomputers", International Journal of High-Performance Computing Applications (IJHPCA), May 2017, doi: https://doi.org/10.1177/1094342017712059

Aydin Buluç, Scott Beamer, Kamesh Madduri, Krste Asanović, David Patterson., "Distributed-memory breadth-first search on massive graphs.", In D. Bader (editor), Parallel Graph Algorithms. CRC Press/Taylor-Francis, ( 2015)

B. Wang, S. Ethier, W. Tang, K. Ibrahim, K. Madduri, S. Williams, "Advances in gyrokinetic particle in cell simulation for fusion plasmas to Extreme scale", Supercomputing (SC), 2012,

K Madduri, J Su, S Williams, L Oliker, S Ethier, K Yelick, "Optimization of parallel particle-to-grid interpolation on leading multicore platforms", IEEE Transactions on Parallel and Distributed Systems, January 1, 2012, 23:1915--1922, doi: 10.1109/TPDS.2012.28

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Download File: sc11-gtc.pdf (pdf: 1.3 MB)

Kamesh Madduri, Eun-Jin Im, Khaled Z. Ibrahim, Samuel Williams, Stephane Ethier, Leonid Oliker, "Gyrokinetic Particle-in-cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing (PARCO), January 2011, 37:501 - 520, doi: 10.1016/j.parco.2011.02.001

Download File: parco11-gtc.pdf (pdf: 2 MB)

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Esmond Ng, Katherine J. Evans, Peter Caldwell, Forrest M. Hoffman, Charles Jackson, Kerstin Van Dam, Ruby Leung, Daniel F. Martin, George Ostrouchov, Raymond Tuminaro, Paul Ullrich, Stefan Wild, Samuel Williams, "Advances in Cross-Cutting Ideas for Computational Climate Science (AXICCS)", January 2017, doi: 10.2172/1341564

Download File: AXICCS-Report.pdf (pdf: 4 MB)

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

Download File: cms2.pdf (pdf: 613 KB)

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

Download File: sc11-greenwave.pdf (pdf: 614 KB)

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

Download File: sc09-cotuning.pdf (pdf: 912 KB)

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

Download File: cug11-fusion.pdf (pdf: 377 KB)

Esmond Ng, Katherine J. Evans, Peter Caldwell, Forrest M. Hoffman, Charles Jackson, Kerstin Van Dam, Ruby Leung, Daniel F. Martin, George Ostrouchov, Raymond Tuminaro, Paul Ullrich, Stefan Wild, Samuel Williams, "Advances in Cross-Cutting Ideas for Computational Climate Science (AXICCS)", January 2017, doi: 10.2172/1341564

Download File: AXICCS-Report.pdf (pdf: 4 MB)

Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, Nicholas J. Wright, "The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing", (BEST PAPER) Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS), November 2020,

Download File: PMBS20-FPGA-final.pdf (pdf: 2.9 MB)

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Download File: PMBS22_GPU_final.pdf (pdf: 719 KB)

K. Ibrahim, L. Oliker,, "Preprocessing Pipeline Optimization for Scientific Deep-Learning Workloads", IPDPS 22, June 3, 2022,

Download File: SciML-optimization-12.pdf (pdf: 17 MB)

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Download File: PMBS20-NVSHMEM-final.pdf (pdf: 659 KB)

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories", International Symposium on Benchmarking, Measuring and Optimizing (Bench), BEST PAPER AWARD, November 2019,

Samuel Williams, Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker, "Performance Analysis using the Roofline Model", SciDAC PI Meeting, July 2019,

Download File: SciDAC19-Poster-Roofline-SWWilliams.pdf (pdf: 4.9 MB)

Charlene Yang, Rahulkumar Gayatri, Thorsten Kurth, Protonu Basu, Zahra Ronaghi, Adedoyin Adetokunbo, Brian Friesen, Brandon Cook, Douglas Doerfler, Leonid Oliker, Jack Deslippe, Samuel Williams, "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Download File: p3hpc-roofline-final.pdf (pdf: 372 KB)

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis", HPCS Special Session on High Performance Computing Benchmarking and Optimization (HPBench), July 2018,

Download File: hpbench18-roofline.pdf (pdf: 2.4 MB)

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams, "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", ISC, June 2018,

Download File: ISC18-RooflineAdvisor-final.pdf (pdf: 966 KB)

Philip C. Roth, Hongzhang Shan, David Riegner, Nikolas Antolin, Sarat Sreepathi, Leonid Oliker, Samuel Williams, Shirley Moore, Wolfgang Windl, "Performance Analysis and Optimization of the RAMPAGE Metal Alloy Potential Generation Software", SIGPLAN International Workshop on Software Engineering for Parallel Systems (SEPS), October 2017,

Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samuel Williams, Woo-Sun Yang, and Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", Intel Xeon Phi Users Group (IXPUG), June 2017,

Download File: ixpug17-nesap.pdf (pdf: 395 KB)

Bei Wang, Stephane Ethier, William Tang, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Modern Gyrokinetic Particle-in-cell Simulation of Fusion Plasmas on Top Supercomputers", International Journal of High-Performance Computing Applications (IJHPCA), May 2017, doi: https://doi.org/10.1177/1094342017712059

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

William Tang, Bei Wang, Stephane Ethier, Grzegorz Kwasniewski, Torsten Hoefler, Khaled Z. Ibrahim4, Kamesh Madduri, Samuel Williams, Leonid Oliker, Carlos Rosales-Fernandez, Tim Williams, "Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide", Supercomputing, November 2016,

Download File: sc16-gtcp-submit.pdf (pdf: 971 KB)

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,

Download File: PMBS16-KNL.pdf (pdf: 789 KB)

Douglas Doerfer, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti, "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016,

Download File: ixpug16-roofline.pdf (pdf: 575 KB)

Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured Mesh-Based Simulations", Procedia Computer Science, 1877-0509, June 2015, 51:2016-2025, doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: ipdps15CHiLL.pdf (pdf: 1.8 MB)

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "MerAligner: A Fully Parallel Sequence Aligner", IEEE 29th International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 561--570, doi: 10.1109/IPDPS.2015.96

Aligning a set of query sequences to a set of target sequences is an important task in bioinformatics. In this work we present merAligner, a highly parallel sequence aligner that implements a seed -- and -- extend algorithm and employs parallelism in all of its components. MerAligner relies on a high performance distributed hash table (seed index) and uses one-sided communication capabilities of the Unified Parallel C to facilitate a fine-grained parallelism. We leverage communication optimizations at the construction of the distributed hash table and software caching schemes to reduce communication during the aligning phase. Additionally, merAligner preprocesses the target sequences to extract properties enabling exact sequence matching with minimal communication. Finally, we efficiently parallelize the I/O intensive phases and implement an effective load balancing scheme. Results show that merAligner exhibits efficient scaling up to thousands of cores on a Cray XC30 supercomputer using real human and wheat genome data while significantly outperforming existing parallel alignment tools.

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15nwchem.pdf (pdf: 1.1 MB)

E Georganas, A Buluç, J Chapman, S Hofmeyr, C Aluru, R Egan, L Oliker, D Rokhsar, K Yelick, "HipMer: An extreme-scale de novo genome assembler", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2015, 15-20-No, doi: 10.1145/2807591.2807664

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7

Download File: PMBS14-Roofline.pdf (pdf: 340 KB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

Download File: wosc14chill.pdf (pdf: 973 KB)

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,

Download File: rpt83549.PDF (PDF: 615 KB)

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

Download File: hipc13chill.pdf (pdf: 989 KB)

Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/978-3-319-10214-6_13

Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2013, doi: 10.1145/2503210.2503258

Download File: sc13gtc.pdf (pdf: 1.3 MB)

Khaled Z Ibrahim, Kamesh Madduri, Samuel Williams, Bei Wang, Stephane Ethier, Leonid Oliker, "Analysis and optimization of gyrokinetic toroidal simulations on homogeneous and heterogeneous platforms", International Journal of High Performance Computing Applications (IJHPCA), July 2013, doi: 10.1177/1094342013492446

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, L. Oliker, "Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid", Workshop on Stencil Computations (WOSC), 2013,

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Download File: miniGMGLBNL-6676E.pdf (pdf: 906 KB)

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

Download File: sc12-mg.pdf (pdf: 808 KB)
Download File: sc12mgtalk.pdf (pdf: 1.9 MB)

A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

K Madduri, J Su, S Williams, L Oliker, S Ethier, K Yelick, "Optimization of parallel particle-to-grid interpolation on leading multicore platforms", IEEE Transactions on Parallel and Distributed Systems, January 1, 2012, 23:1915--1922, doi: 10.1109/TPDS.2012.28

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

Download File: cug11-fusion.pdf (pdf: 377 KB)

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Download File: sc11-gtc.pdf (pdf: 1.3 MB)

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

Download File: sc11-lbmhd.pdf (pdf: 666 KB)
Download File: sc11lbmhdtalk.pdf (pdf: 1.4 MB)

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

Download File: sc11-greenwave.pdf (pdf: 614 KB)

Kamesh Madduri, Eun-Jin Im, Khaled Z. Ibrahim, Samuel Williams, Stephane Ethier, Leonid Oliker, "Gyrokinetic Particle-in-cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing (PARCO), January 2011, 37:501 - 520, doi: 10.1016/j.parco.2011.02.001

Download File: parco11-gtc.pdf (pdf: 2 MB)

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

Download File: ipdps10-ast.pdf (pdf: 789 KB)

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415

Download File: ipdps10-fmm.pdf (pdf: 671 KB)

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

Download File: cug09-lbmhd.pdf (pdf: 443 KB)

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

Download File: sc09-cotuning.pdf (pdf: 912 KB)

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Download File: arcs09-viva.pdf (pdf: 448 KB)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Download File: iwapt09-27pt.pdf (pdf: 465 KB)

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

Download File: sirev09-stencil.pdf (pdf: 2.8 MB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Download File: siampp10-gtc-talk.pdf (pdf: 2.7 MB)
Download File: siampp10-gtc-talk.pptx (pptx: 1.3 MB)

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

Download File: hotchips08-roofline-talk.pdf (pdf: 8 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

Leonid Oliker, Julian Borrill, Hongzhang Shan, John Shalf, Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark., 2007,

Download File: SC07-MadBench-talk.ppt (ppt: 2.7 MB)

J. Shalf, L. Oliker, M. Lijewski, S. Kamil, J. Carter, A. Canning, S. Ethier, "Performance Characteristics of Potential Petascale Scientific Applications", Chapman & Hall/CRC Computational Science, (CRC Press: 2007) Pages: 1

Download File: CactusGRB.pdf (pdf: 712 KB)

Book Chapter

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", Extended Version: Lecture Notes in Computer Science, 2007,

Download File: LNCS07.pdf (pdf: 445 KB)

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

Download File: ijpp07-cell.pdf (pdf: 1000 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

Download File: cf06-cell-potential.pdf (pdf: 213 KB)

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", High Performance Computing for Computational Science., 2006,

Download File: vecpar06-vector.pdf (pdf: 410 KB)

Highest Ranked Conference Paper

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", VECPAR, 2006,

Download File: vecpar06-vector.pdf (pdf: 410 KB)

Jonathan Carter, Oliker, John Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", VECPAR, Springer Berlin/Heidelberg, 2006, 4395:490-503,

Download File: LNCS07-vector.pdf (pdf: 445 KB)

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

Download File: mspc06-stencil.pdf (pdf: 421 KB)

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

H. Shan, E. Strohmaier, L. Oliker, "Optimizing Performance of Superscalar Codes for a Single Cray X1 MSP", Proceedings of the 46th Cray User Group Conference:CUG, 2004,

Aydin Buluç, Scott Beamer, Kamesh Madduri, Krste Asanović, David Patterson., "Distributed-memory breadth-first search on massive graphs.", In D. Bader (editor), Parallel Graph Algorithms. CRC Press/Taylor-Francis, ( 2015)

S. Williams, A. Waterman, D. Patterson, "Roofline: an insightful visual performance model for multicore architectures", Communications of the ACM (CACM), April 2009, doi: 10.1145/1498765.1498785

Samuel Webb Williams, Andrew Waterman, David A. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", EECS Tech Report UCB/EECS-2008-134, October 2008,

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

Download File: hotchips08-roofline-talk.pdf (pdf: 8 MB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, "Hardware/Compiler Co-development for an Embedded Media Processor", Proceedings of the IEEE, 2001, doi: 10.1109/5.964446

C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, K. Yelick, Vector IRAM: A media-oriented vector processor with embedded DRAM, Hot Chips 12, 2000,

Download File: hotchips00-viram-talk.pdf (pdf: 57 KB)

"Machine learning and understanding for intelligent extreme scale scientific computing and discovery", DOE ASCR Machine Learning Workshop Report, January 2015, doi: 10.2172/1471083

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,

Download File: PMBS16-KNL.pdf (pdf: 789 KB)

Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,

Abhinav Sarje, Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers, Cray Users Group (CUG), May 12, 2016,

Abhinav Sarje, Particle Swarm Optimization, DUNE Wire-Cell Reconstruction Summit, December 2015,

Abhinav Sarje, Parallel Performance Optimizations on Unstructured Mesh-Based Simulations, International Conference on Computational Science, June 2015,

Download File: SarjeICCS2015.pdf (pdf: 4.6 MB)

Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured Mesh-Based Simulations", Procedia Computer Science, 1877-0509, June 2015, 51:2016-2025, doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

Thorsten Kurth, Andrew Pochinsky, Abhinav Sarje, Sergey Syritsyn, Andre Walker-Loud, "High-Performance I/O: HDF5 for Lattice QCD", arXiv:1501.06992, January 2015,

Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design and implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.

"Machine learning and understanding for intelligent extreme scale scientific computing and discovery", DOE ASCR Machine Learning Workshop Report, January 2015, doi: 10.2172/1471083

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

Abhinav Sarje, Next-Generation Scientific Computing with Graphics Processors, Beijing Computational Science Research Center, February 2012,

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Download File: International-Journal-of-High-Performance-Computing-Applications-2015-Unat-209-32.pdf (pdf: 4.3 MB)

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

Download File: cms2.pdf (pdf: 613 KB)

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Download File: hpgmg.pdf (pdf: 183 KB)

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Download File: miniGMGLBNL-6676E.pdf (pdf: 906 KB)

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

Download File: sc12-mg.pdf (pdf: 808 KB)
Download File: sc12mgtalk.pdf (pdf: 1.9 MB)

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Download File: sc11-gtc.pdf (pdf: 1.3 MB)

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

Download File: sc11-lbmhd.pdf (pdf: 666 KB)
Download File: sc11lbmhdtalk.pdf (pdf: 1.4 MB)

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

Download File: sc11-greenwave.pdf (pdf: 614 KB)

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

Download File: ipdps10-ast.pdf (pdf: 789 KB)

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

Download File: cug09-lbmhd.pdf (pdf: 443 KB)

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

Download File: sc09-cotuning.pdf (pdf: 912 KB)

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Download File: arcs09-viva.pdf (pdf: 448 KB)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Download File: iwapt09-27pt.pdf (pdf: 465 KB)

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

Download File: sirev09-stencil.pdf (pdf: 2.8 MB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Download File: siampp10-gtc-talk.pdf (pdf: 2.7 MB)
Download File: siampp10-gtc-talk.pptx (pptx: 1.3 MB)

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

Download File: hotchips08-roofline-talk.pdf (pdf: 8 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)

Shoaib Kamil, Shalf, Erich Strohmaier, "Power efficiency in high performance computing", IPDPS, 2008, 1-8,

Download File: powereffreportfull.pdf (pdf: 312 KB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

Leonid Oliker, Julian Borrill, Hongzhang Shan, John Shalf, Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark., 2007,

Download File: SC07-MadBench-talk.ppt (ppt: 2.7 MB)

J. Shalf, L. Oliker, M. Lijewski, S. Kamil, J. Carter, A. Canning, S. Ethier, "Performance Characteristics of Potential Petascale Scientific Applications", Chapman & Hall/CRC Computational Science, (CRC Press: 2007) Pages: 1

Download File: CactusGRB.pdf (pdf: 712 KB)

Book Chapter

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", Extended Version: Lecture Notes in Computer Science, 2007,

Download File: LNCS07.pdf (pdf: 445 KB)

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

Download File: ijpp07-cell.pdf (pdf: 1000 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

J. Carter, Y. He, J. Shalf, H. Shan, E. Strohmaier, H. Wasserman, "The Performance Effect of Multi-core on Scientific Applications", Proceedings of Cray User Group, 2007, LBNL 62662,

Download File: CUG2007FINAL.pdf (pdf: 149 KB)

J. Levesque, J. Larkin, M. Foster, J. Glenski, G. Geissler, S. Whalen, B. Waldecker, J. Carter, D. Skinner, Y. He, H. Wasserman, J. Shalf, H. Shan, E. Strohmaier, "Understanding and Mitigating Multicore Performance Issues on the AMD Opteron Architecture", 2007, LBNL 62500,

Download File: LBNL-62500.v3.pdf (pdf: 2.4 MB)

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

Download File: Top500PowerNov14SC07.pdf (pdf: 3.8 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

Download File: cf06-cell-potential.pdf (pdf: 213 KB)

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", High Performance Computing for Computational Science., 2006,

Download File: vecpar06-vector.pdf (pdf: 410 KB)

Highest Ranked Conference Paper

J. Carter, L. Oliker, J. Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", VECPAR, 2006,

Download File: vecpar06-vector.pdf (pdf: 410 KB)

Jonathan Carter, Oliker, John Shalf, "Performance Evaluation of Scientific Applications on Modern Parallel Vector Systems", VECPAR, Springer Berlin/Heidelberg, 2006, 4395:490-503,

Download File: LNCS07-vector.pdf (pdf: 445 KB)

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

Download File: mspc06-stencil.pdf (pdf: 421 KB)

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

Hongzhang Shan, Samuel Williams, Calvin W. Johnson, "Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2018,

Download File: pmbs18-reduce-final.pdf (pdf: 572 KB)

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams, "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", ISC, June 2018,

Download File: ISC18-RooflineAdvisor-final.pdf (pdf: 966 KB)

Philip C. Roth, Hongzhang Shan, David Riegner, Nikolas Antolin, Sarat Sreepathi, Leonid Oliker, Samuel Williams, Shirley Moore, Wolfgang Windl, "Performance Analysis and Optimization of the RAMPAGE Metal Alloy Potential Generation Software", SIGPLAN International Workshop on Software Engineering for Parallel Systems (SEPS), October 2017,

Hongzhang Shan, Samuel Williams, Calvin Johnson, Kenneth McElvain, "A Locality-based Threading Algorithm for the Configuration-Interaction Method", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Download File: pdsec17-bigstick.pdf (pdf: 715 KB)

H Shan, S Williams, Y Zheng, W Zhang, B Wang, S Ethier, Z Zhao, IEEE, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PROCEEDINGS OF PAW 2016: 1ST PGAS APPLICATIONS WORKSHOP (PAW), January 2016, 17--24, doi: 10.1109/PAW.2016.008

Download File: PAW16-stencil.pdf (pdf: 601 KB)

Hongzhang Shan, Kenneth McElvain, Calvin Johnson, Samuel Williams, W. Erich Ormand, "Parallel Implementation and Performance Optimization of the Configuration-Interaction Method", Supercomputing (SC), November 2015, doi: 10.1145/2807591.2807618

Download File: sc15-bigstick.pdf (pdf: 864 KB)

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Download File: pgas15-hpgmg.pdf (pdf: 803 KB)

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15nwchem.pdf (pdf: 1.1 MB)

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,

Download File: rpt83549.PDF (PDF: 615 KB)

Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/978-3-319-10214-6_13

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

Hongzhang Shan, Erich Strohmaier, James Amundson, Eric G. Stern, "Optimizing The Advanced Accelerator Simulation Framework Synergia Using OpenMP", IWOMP'12 Proceedings of the 8th International Conference on OpenMP, June 11, 2012,

Download File: synergia.pdf (pdf: 850 KB)

David H. Bailey, Lin-Wang Wang, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, Byounghak Lee, "Tuning an electronic structure code", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2011) Pages: 339-354 doi: 10.1201/b10509

Hongzhang Shan, Erich Strohmaier, "Developing a Parameterized Performance Proxy for Sequential Scientific Kernels", 12th IEEE International Conference on High Performance Computing and Communications (HPCC), 2010, September 1, 2010, doi: 10.1109/HPCC.2010.50

Zhengji Zhao, Juan Meza, Byounghak Lee, Hongzhang Shan, Eric Strohmaier, David H. Bailey, Lin-Wang Wang, "The linearly scaling 3D fragment method for large scale electronic structure calculations", Journal of Physics: Conference Series, July 1, 2009,

Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, David H. Bailey, "Linearly scaling 3D fragment method for large-scale electronic structure calculations", Proceedings of SC08, November 2008,

Leonid Oliker, Julian Borrill, Hongzhang Shan, John Shalf, Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark., 2007,

Download File: SC07-MadBench-talk.ppt (ppt: 2.7 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

J. Carter, Y. He, J. Shalf, H. Shan, E. Strohmaier, H. Wasserman, "The Performance Effect of Multi-core on Scientific Applications", Proceedings of Cray User Group, 2007, LBNL 62662,

Download File: CUG2007FINAL.pdf (pdf: 149 KB)

J. Levesque, J. Larkin, M. Foster, J. Glenski, G. Geissler, S. Whalen, B. Waldecker, J. Carter, D. Skinner, Y. He, H. Wasserman, J. Shalf, H. Shan, E. Strohmaier, "Understanding and Mitigating Multicore Performance Issues on the AMD Opteron Architecture", 2007, LBNL 62500,

Download File: LBNL-62500.v3.pdf (pdf: 2.4 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

H Shan, E Strohmaier, J Qiang, DH Bailey, K Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 06, January 2006, doi: 10.1145/1188455.1188557

E. Strohmaier, Hongzhang Shan, "Architecture Independent Performance Characterization and Benchmarking for Scientific Applications", International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Volendam, The Netherlands, October 2004,

Download File: mascots.pdf (pdf: 382 KB)

Hongzhang Shan, E. Strohmaier, "Performance Characterization of Cray X1 and Their Implications for Application Performance Tuning", International Conference of Supercomputing, Malo, France, June 2004,

Download File: ics04-x1.pdf (pdf: 292 KB)

H. Shan, E. Strohmaier, L. Oliker, "Optimizing Performance of Superscalar Codes for a Single Cray X1 MSP", Proceedings of the 46th Cray User Group Conference:CUG, 2004,

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Download File: hpgmg.pdf (pdf: 183 KB)

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

Hongzhang Shan, Erich Strohmaier, James Amundson, Eric G. Stern, "Optimizing The Advanced Accelerator Simulation Framework Synergia Using OpenMP", IWOMP'12 Proceedings of the 8th International Conference on OpenMP, June 11, 2012,

Download File: synergia.pdf (pdf: 850 KB)

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

David H. Bailey, Lin-Wang Wang, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, Byounghak Lee, "Tuning an electronic structure code", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2011) Pages: 339-354 doi: 10.1201/b10509

Khaled Z. Ibrahim, Erich Strohmaier, "Characterizing the Relation Between Apex-Map Synthetic Probes and Reuse Distance Distributions", The 39th International Conference on Parallel Processing (ICPP), 2010, 353 -362,

Hongzhang Shan, Erich Strohmaier, "Developing a Parameterized Performance Proxy for Sequential Scientific Kernels", 12th IEEE International Conference on High Performance Computing and Communications (HPCC), 2010, September 1, 2010, doi: 10.1109/HPCC.2010.50

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

Zhengji Zhao, Juan Meza, Byounghak Lee, Hongzhang Shan, Eric Strohmaier, David H. Bailey, Lin-Wang Wang, "The linearly scaling 3D fragment method for large scale electronic structure calculations", Journal of Physics: Conference Series, July 1, 2009,

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Download File: siampp10-gtc-talk.pdf (pdf: 2.7 MB)
Download File: siampp10-gtc-talk.pptx (pptx: 1.3 MB)

Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, David H. Bailey, "Linearly scaling 3D fragment method for large-scale electronic structure calculations", Proceedings of SC08, November 2008,

Shoaib Kamil, Shalf, Erich Strohmaier, "Power efficiency in high performance computing", IPDPS, 2008, 1-8,

Download File: powereffreportfull.pdf (pdf: 312 KB)

Costin Iancu, Erich Strohmaier, "Optimizing communication overlap for high-speed networks", Principles and Practice of Parallel Programming (PPoPP), 2007,

Download File: paper58-iancu.pdf (pdf: 469 KB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Scientific application performance on candidate petascale platforms", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM, 2007, doi: 10.1109/IPDPS.2007.370259

Download File: ipdps07-petascale.pdf (pdf: 4.4 MB)

J. Carter, Y. He, J. Shalf, H. Shan, E. Strohmaier, H. Wasserman, "The Performance Effect of Multi-core on Scientific Applications", Proceedings of Cray User Group, 2007, LBNL 62662,

Download File: CUG2007FINAL.pdf (pdf: 149 KB)

J. Levesque, J. Larkin, M. Foster, J. Glenski, G. Geissler, S. Whalen, B. Waldecker, J. Carter, D. Skinner, Y. He, H. Wasserman, J. Shalf, H. Shan, E. Strohmaier, "Understanding and Mitigating Multicore Performance Issues on the AMD Opteron Architecture", 2007, LBNL 62500,

Download File: LBNL-62500.v3.pdf (pdf: 2.4 MB)

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

Download File: Top500PowerNov14SC07.pdf (pdf: 3.8 MB)

L. Oliker, A. Canning, J. Carter, C. Iancu, M. Lijewski, S. Kamil, J. Shalf, H. Shan, E. Strohmaier, S. Ethier, T. Goodale, "Performance Characteristics of Potential Petascale Scientific Applications", Petascale Computing: Algorithms and Applications. Chapman & Hall/CRC Computational Science Series (Hardcover), edited by David A. Bader, ( 2007)

Chapter

H Shan, E Strohmaier, J Qiang, DH Bailey, K Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 06, January 2006, doi: 10.1145/1188455.1188557

E. Strohmaier, Hongzhang Shan, "Architecture Independent Performance Characterization and Benchmarking for Scientific Applications", International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Volendam, The Netherlands, October 2004,

Download File: mascots.pdf (pdf: 382 KB)

Hongzhang Shan, E. Strohmaier, "Performance Characterization of Cray X1 and Their Implications for Application Performance Tuning", International Conference of Supercomputing, Malo, France, June 2004,

Download File: ics04-x1.pdf (pdf: 292 KB)

H. Shan, E. Strohmaier, L. Oliker, "Optimizing Performance of Superscalar Codes for a Single Cray X1 MSP", Proceedings of the 46th Cray User Group Conference:CUG, 2004,

E. Strohmaier, Performance Characterization and Benchmarking for High Performance Systems and Applications, University of Tennessee, CS Seminar, November 8, 2002,

Download File: PerfCharBench-Knox.pdf (pdf: 348 KB)

Erich Strohmaier, Performance Characterization and Benchmarking for High Performance Systems and Applications, CCS Seminar, October 9, 2002,

Download File: SC20022.pdf (pdf: 115 KB)

Erich Strohmaier, Benchmarking for High Performance Systems and Applications, DARPA HPCS Performance Workshop, September 19, 2002,

Download File: Bench-Darpa.pdf (pdf: 414 KB)

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Download File: International-Journal-of-High-Performance-Computing-Applications-2015-Unat-209-32.pdf (pdf: 4.3 MB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

Download File: CU16SWWilliams.pptx (pptx: 1 MB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: ipdps15CHiLL.pdf (pdf: 1.8 MB)

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7

Download File: PMBS14-Roofline.pdf (pdf: 340 KB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

Download File: wosc14chill.pdf (pdf: 973 KB)

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Download File: hpgmg.pdf (pdf: 183 KB)

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

Download File: hipc13chill.pdf (pdf: 989 KB)

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, L. Oliker, "Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid", Workshop on Stencil Computations (WOSC), 2013,

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, Sam Williams, "Loop chaining: A programming abstraction for balancing locality and parallelism", Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, May 2013, 375--384, doi: 10.1109/IPDPSW.2013.68

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Download File: miniGMGLBNL-6676E.pdf (pdf: 906 KB)

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

Download File: sc12-mg.pdf (pdf: 808 KB)
Download File: sc12mgtalk.pdf (pdf: 1.9 MB)

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Esmond Ng, Katherine J. Evans, Peter Caldwell, Forrest M. Hoffman, Charles Jackson, Kerstin Van Dam, Ruby Leung, Daniel F. Martin, George Ostrouchov, Raymond Tuminaro, Paul Ullrich, Stefan Wild, Samuel Williams, "Advances in Cross-Cutting Ideas for Computational Climate Science (AXICCS)", January 2017, doi: 10.2172/1341564

Download File: AXICCS-Report.pdf (pdf: 4 MB)

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

Xuan Jiang, Raja Sengupta, James Demmel, Samuel Williams, "Large scale multi-GPU based parallel traffic simulation for accelerated traffic assignment and propagation", Transportation Research Part C: Emerging Technologies, December 2024, 169:104873, doi: 10.1016/j.trc.2024.104873

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji, "Comprehensive Performance Modeling and System Design Insights for Foundation Models", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_ModelingTransformerTraining_final.pdf (pdf: 736 KB)

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_DCGM_final.pdf (pdf: 319 KB)

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Download File: P3HPC24_bricks_mg_final.pdf (pdf: 358 KB)

Oscar Antepara, Samuel Williams, Max Carlson, Jerry Watkins, "Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers", Performance, Portability & Productivity in HPC (P3HPC), November 2024,

Download File: P3HPC24_IceSheet_final-v2.pdf (pdf: 1.4 MB)

Sterling Smith, Zichuan Anthony Xing, Torrin Bechtel, Severin Denk, Earl DeShazer, Orso Meneghini, Tom Neiser, Laurie Stephey, Oscar Antepara, Christopher Mitchell Clark, Eli Dart, Pengfei Ding, Sean Flanagan, Raffi Nazikian, David Schissel, Christine Simpson, Nicholas Tyler, Thomas D. Uram, Samuel Williams, "Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D National Fusion Facility Using Leadership Class Computing Resources", Extreme-Scale Experiment-in-the-Loop Computing (XLOOP), November 2024,

Mahesh Lakshminarasimhan, Oscar Antepara, Tuowen Zhao, Benjamin Sepanski, Protonu Basu, Hans Johansen, Mary Hall, Samuel Williams, "Bricks: A high-performance portability layer for computations on block-structured grids", The International Journal of High Performance Computing Applications (IJHPCA), August 19, 2024, doi: 10.1177/10943420241268288

Mahesh Lakshminarasimhan, Mary Hall, Samuel Williams, Oscar Antepara, "BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs", Proceedings of the 53rd International Conference on Parallel Processing (ICPP), August 12, 2024,

Download File: ICPP24_BrickDL_final-v2.pdf (pdf: 1.7 MB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Download File: P3HPC23_bricks_final-v4.pdf (pdf: 684 KB)

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Download File: WACCPD23_EFIT_final.pdf (pdf: 697 KB)

Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

Nan Ding, Muhammad Haseeb, Taylor Groves, Samuel Williams, "Evaluating the Performance of One-sided Communication on CPUs and GPUs", 2023 International Workshop on Performance, Portability & Productivity in HPC, November 12, 2023,

Download File: OneSided_MPI_P3HPC_.pdf (pdf: 2.5 MB)

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 8, 2023,

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, Christopher Delay, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", RESDIS, https://resdis.github.io/ws/2022/sc/, November 18, 2022,

Download File: Methodology-for-Evaluating-the-Potential-of-Disaggregated-Memory-Systems.pdf (pdf: 5.1 MB)

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", PMBS, November 2022,

Download File: PMBS22_GPU_final.pdf (pdf: 719 KB)

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,

Download File: MCHPC22_final.pdf (pdf: 401 KB)

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, May 2022,

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Marco Siracusa, Emanuele Del Sozzo, Marco Rabozzi, Lorenzo Di Tucci, Samuel Williams, Donatella Sciuto, Marco Domenico Santambrogio, "A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model", Transactions on Computers (TC), September 2021, doi: 10.1109/TC.2021.3111761

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Nan Ding, Muaaz Awan, Samuel Williams, "Instruction Roofline: An insightful visual performance model for GPUs", CCPE, August 4, 2021, doi: 10.1002/cpe.6591

Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

Charlene Yang, Yunsong Wang, Thorsten Kurth, Steven Farrell, Samuel Williams, "Hierarchical Roofline Performance Analysis for Deep Learning Applications", Intelligent Computing, LNNS, July 15, 2021, doi: 10.1007/978-3-030-80126-7

Douglas Doerfler, Farzad Fatollahi-Fard, Colin MacLean, Tan Nguyen, Samuel Williams, Nicholas J. Wright, Marco Siracusa, "Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs", International Workshop on OpenCL (iWOCL), April 2021, doi: 10.1145/3456669.3456671

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-1-intro.pdf (pdf: 22 MB)

Samuel Williams, Roofline Analysis on NVIDIA GPUs, ECP Annual Meeting, April 2021,

Download File: ECP21-Roofline-2-NVIDIA.pdf (pdf: 14 MB)

Tuowen Zhao, Mary Hall, Hans Johansen, Samuel Williams, "Improving Communication by Optimizing On-Node Data Movement with Data Layout", PPoPP, February 2021,

Download File: PPoPP-Bricks-MPI-final.pdf (pdf: 864 KB)

Anastasiia Butko, George Michelogiannakis, Samuel Williams, Costin Iancu, David Donofrio, John Shalf, Jonathan Carter, Irfan Siddiqi, "Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization", IEEE Conference on Rebooting Computing (ICRC), December 2020,

Download File: ICRC20-QUASAR-final.pdf (pdf: 1.1 MB)

Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, Nicholas J. Wright, "The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing", (BEST PAPER) Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS), November 2020,

Download File: PMBS20-FPGA-final.pdf (pdf: 2.9 MB)

Yunsong Wang, Charlene Yang, Steven Farrell, Yan Zhang, Thorsten Kurth, Samuel Williams, "Time-Based Roofline for Deep Learning Performance Analysis", Deep Learning on Supercomputing (DLonSC), November 2020,

Download File: DLonSC20-TimeRoofline-final.pdf (pdf: 534 KB)

Samuel Williams, Introduction to the Roofline Model, Supercomputing (SC), November 2020,

Download File: 2020.11.09-1005-tut108-Tutorial-Williams-Samuel.pdf (pdf: 25 MB)

Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, Lorenzo Di Tucci, Samuel Williams, Marco D. Santambrogio, "A CAD-based methodology to optimize HLS code via the Roofline model", International Conference on Computer Aided Design (ICCAD), November 2020, doi: 10.1145/3400302.3415730

Christopher Daley, Hadia Ahmed, Samuel Williams, Nicholas Wright, "A case study of porting HPGMG from CUDA to OpenMP target offload", The International Workshop on OpenMP (IWOMP), September 2020,

Download File: p24-daley.pdf (pdf: 272 KB)

Samuel Williams, The Roofline Model: A Bridge between Computer Science, Applied Math, and Computational Science, SciDAC Meeting, July 2020,

Download File: SciDAC20-Roofline-SWWilliams.pdf (pdf: 13 MB)

Samuel Williams, Introduction to the Roofline Model, NERSC NVIDIA Roofline Hackathon, July 2020,

Download File: NVIDIA-Roofline-intro.pdf (pdf: 33 MB)

Samuel Williams, Introduction to the Roofline Model, NERSC GPU For Science Workshop, July 2020,

Download File: GPU-For-Science-Roofline-SWWilliams.pdf (pdf: 9.6 MB)

Samuel Williams, Charlene Yang, Yunsong Wang, Roofline Performance Modeling for HPC and Deep Learning Applications, NVIDIA GPU Technology Conference (GTC), March 2020,

Download File: S21565-Roofline-1-Intro.pdf (pdf: 22 MB)

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

Download File: One-side-SPTRS-SIAM-PP20-.pdf (pdf: 2.9 MB)

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 2020,

Download File: ECP20-Roofline-1-intro.pdf (pdf: 24 MB)

Samuel Williams, Roofline on GPUs (Advanced Topics), ECP Annual Meeting, February 2020,

Download File: ECP20-Roofline-3-advanced-gpu.pdf (pdf: 18 MB)

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Download File: PMBS20-NVSHMEM-final.pdf (pdf: 659 KB)

Tuowen Zhao, Mary Hall, Samuel Williams, Hans Johansen, "Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs", Supercomputing (SC), November 2019,

Download File: SC19-VectorScatter-final.pdf (pdf: 1019 KB)

Nan Ding, Samuel Williams, "An Instruction Roofline Model for GPUs", Performance Modeling, Benchmarking, and Simulation (PMBS), BEST PAPER AWARD, November 18, 2019,

Download File: InstructionRooflineModel-PMBS19-.pdf (pdf: 970 KB)

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Performance Analysis of GPU Programming Models using the Roofline Scaling Trajectories", International Symposium on Benchmarking, Measuring and Optimizing (Bench), BEST PAPER AWARD, November 2019,

Charlene Yang, Thorsten Kurth, Samuel Williams, "Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC-9 Perlmutter system", Concurrency and Computation: Practice and Experience (CCPE), August 2019, doi: 10.1002/cpe.5547

Nan Ding, Samuel Williams, Sherry Li, Yang Liu, "Leveraging One-Sided Communication for Sparse Triangular Solvers", SciDAC19, July 18, 2019,

Download File: SciDAC19-Poster-SpTRSV-NanDing.pdf (pdf: 774 KB)

Samuel Williams, Charlene Yang, Khaled Ibrahim, Thorsten Kurth, Nan Ding, Jack Deslippe, Leonid Oliker, "Performance Analysis using the Roofline Model", SciDAC PI Meeting, July 2019,

Download File: SciDAC19-Poster-Roofline-SWWilliams.pdf (pdf: 4.9 MB)

Charlene Yang, Thorsten Kurth, Samuel Williams, "Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System", Cray User Group (CUG), May 2019,

Download File: cug19-roofline-final.pdf (pdf: 493 KB)

Wenjing Ma, Yulong Ao, Chao Yang, Samuel Williams, "Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight", Cluster Computing, May 2019, doi: 10.1007/s10586-019-02938-w

Charlene Yang, Samuel Williams, Performance Analysis of GPU-Accelerated Applications using the Roofline Model, GPU Technology Conference (GTC), March 2019,

Download File: GTC19-Roofline.pdf (pdf: 73 MB)

Samuel Williams, Performance Modeling and Analysis, CS267 Lecture, University of California at Berkeley, February 14, 2019,

Download File: CS267-2019-Roofline-SWWilliams.pptx (pptx: 15 MB)
Download File: CS267-2019-Roofline-SWWilliams.pdf (pdf: 35 MB)

Samuel Williams, Introduction to the Roofline Model, Roofline Tutorial, ECP Annual Meeting, January 2019,

Download File: ECP19-Roofline-1-intro.pdf (pdf: 9.9 MB)

Samuel Williams, Roofline on CPU-based Systems, Roofline Tutorial, ECP Annual Meeting, January 2019,

Download File: ECP19-Roofline-3-cpu.pdf (pdf: 26 MB)

Tuowen Zhao, Samuel Williams, Mary Hall, Hans Johansen, "Delivering Performance Portable Stencil Computations on CPUs and GPUs Using Bricks", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Download File: p3hpc-bricks-final.pdf (pdf: 1.3 MB)

Charlene Yang, Rahulkumar Gayatri, Thorsten Kurth, Protonu Basu, Zahra Ronaghi, Adedoyin Adetokunbo, Brian Friesen, Brandon Cook, Douglas Doerfler, Leonid Oliker, Jack Deslippe, Samuel Williams, "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Download File: p3hpc-roofline-final.pdf (pdf: 372 KB)

Hongzhang Shan, Samuel Williams, Calvin W. Johnson, "Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2018,

Download File: pmbs18-reduce-final.pdf (pdf: 572 KB)

Samuel Williams, Introduction to the Roofline Model, Supercomputing, November 2018,

Download File: SC18-Roofline-1-intro.pdf (pdf: 18 MB)

Samuel Williams, Roofline on Manycore and Accelerated Systems, ModSim, August 2018,

Download File: ModSim18-SWWilliams.pdf (pdf: 15 MB)

Samuel Williams, Parallelism and Performance, MolSSI Summer School, August 2018,

Download File: MolSSI18-SWWilliams.pdf (pdf: 17 MB)

Khaled Ibrahim, Samuel Williams, Leonid Oliker, "Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis", HPCS Special Session on High Performance Computing Benchmarking and Optimization (HPBench), July 2018,

Download File: hpbench18-roofline.pdf (pdf: 2.4 MB)

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams, "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", ISC, June 2018,

Download File: ISC18-RooflineAdvisor-final.pdf (pdf: 966 KB)

Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook, Samuel Williams, "Toward Automated Application Profiling on Cray Systems", Cray User Group (CUG), May 2018,

Download File: CUG18-profiling.pdf (pdf: 184 KB)

Tuowen Zhao, Mary Hall, Protonu Basu, Samuel Williams, Hans Johansen, "SIMD code generation for stencils on brick decompositions", Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018,

Samuel Williams, Introduction to the Roofline Model, ECP Annual Meeting, February 8, 2018,

Download File: ECP18-Roofline-1-intro.pdf (pdf: 9.1 MB)

Samuel Williams, Advisor Hand-On: Stencil Example, ECP Annual Meeting, February 8, 2018,

Download File: ECP18-Roofline-6-stencil.pdf (pdf: 3.3 MB)

Samuel Williams, Performance Modeling and Analysis, CS267 lecture, University of California at Berkeley, January 30, 2018,

Download File: CS267-Roofline-SWWilliams.pdf (pdf: 18 MB)
Download File: CS267-Roofline-SWWilliams.pptx (pptx: 17 MB)

Samuel Williams, Introduction to the Roofline Model, Roofline Training, November 2017,

Download File: roofline-intro.pptx (pptx: 3.1 MB)
Download File: roofline-intro.pdf (pdf: 3.6 MB)

Philip C. Roth, Hongzhang Shan, David Riegner, Nikolas Antolin, Sarat Sreepathi, Leonid Oliker, Samuel Williams, Shirley Moore, Wolfgang Windl, "Performance Analysis and Optimization of the RAMPAGE Metal Alloy Potential Generation Software", SIGPLAN International Workshop on Software Engineering for Parallel Systems (SEPS), October 2017,

Jack Deslippe, Doug Doerfler, Brandon Cook, Tareq Malas, Samuel Williams, Sudip Dosanjh, "Optimizing science applications for the Cori, Knights Landing, System at NERSC", Advances in Parallel Computing, New Frontiers in High Performance Computing and Big Data, August 2017, 30, doi: 10.3233/978-1-61499-816-7-235

Hongzhang Shan, Samuel Williams, Calvin Johnson, Kenneth McElvain, "A Locality-based Threading Algorithm for the Configuration-Interaction Method", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Download File: pdsec17-bigstick.pdf (pdf: 715 KB)

Bryce Adelstein Lelbach, Hans Johansen, Samuel Williams, "Simultaneously Solving Swarms of Small Sparse Systems on SIMD Silicon", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,

Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, Jack Deslippe, "Performance Variability on Xeon Phi", Intel Xeon Phi Users Group (IXPUG), June 2017,

Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samuel Williams, Woo-Sun Yang, and Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", Intel Xeon Phi Users Group (IXPUG), June 2017,

Download File: ixpug17-nesap.pdf (pdf: 395 KB)

Nathan Zhang, Michael Driscoll, Armando Fox, Charles Markley, Samuel Williams, Protonu Basu, "Snowflake: A Lightweight Portable Stencil DSL", High-level Parallel Programming Models and Supportive Environments (HIPS), May 2017,

Download File: hips17-snowflake.pdf (pdf: 475 KB)

Bei Wang, Stephane Ethier, William Tang, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Modern Gyrokinetic Particle-in-cell Simulation of Fusion Plasmas on Top Supercomputers", International Journal of High-Performance Computing Applications (IJHPCA), May 2017, doi: https://doi.org/10.1177/1094342017712059

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Phillip Colella, Mary Hall, "Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers", Parallel Computing (PARCO), April 2017, doi: 10.1016/j.parco.2017.04.002

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends", Journal of Parallel and Distributed Computing (JPDC), February 2017, doi: 10.1016/j.jpdc.2017.02.010

Esmond Ng, Katherine J. Evans, Peter Caldwell, Forrest M. Hoffman, Charles Jackson, Kerstin Van Dam, Ruby Leung, Daniel F. Martin, George Ostrouchov, Raymond Tuminaro, Paul Ullrich, Stefan Wild, Samuel Williams, "Advances in Cross-Cutting Ideas for Computational Climate Science (AXICCS)", January 2017, doi: 10.2172/1341564

Download File: AXICCS-Report.pdf (pdf: 4 MB)

Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

Download File: SC16-HPGMG-BoF-Intro.pdf (pdf: 1020 KB)

Samuel Williams, HPGMG on the Knights Landing Processor, HPGMG BoF, Supercomputing, November 2016,

Download File: SC16-HPGMG-BoF-KNL.pdf (pdf: 958 KB)

Samuel Williams, HPGMG Benchmark, Top500 BoF, Supercomputing, November 2016,

Download File: SC16-Top500-BoF-HPGMG.pdf (pdf: 1003 KB)

William Tang, Bei Wang, Stephane Ethier, Grzegorz Kwasniewski, Torsten Hoefler, Khaled Z. Ibrahim4, Kamesh Madduri, Samuel Williams, Leonid Oliker, Carlos Rosales-Fernandez, Tim Williams, "Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide", Supercomputing, November 2016,

Download File: sc16-gtcp-submit.pdf (pdf: 971 KB)

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,

Download File: PMBS16-KNL.pdf (pdf: 789 KB)

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, Costin Iancu, "Reaching Bandwidth Saturation Using Transparent Injection Parallelization", International Journal of High Performance Computing Applications (IJHPCA), November 2016, doi: 10.1177/1094342016672720

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, and Andrea L. Bertozzi, "OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms", 12th International Workshop on OpenMP (iWOMP), October 2016, doi: 10.1007/978-3-319-45550-1_2

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Cross-scale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends (tech report version)", LBNL. - Report Number: LBNL-1005853, July 1, 2016, LBNL 1005853, doi: 10.2172/1274416

Douglas Doerfer, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti, "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016,

Download File: ixpug16-roofline.pdf (pdf: 575 KB)

Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

Download File: CU16SWWilliams.pptx (pptx: 1 MB)

S Williams, D Patterson, L Oliker, J Shalf, K Yelick, The roofline model: A pedagogical tool for program analysis and optimization, 2008 IEEE Hot Chips 20 Symposium, HCS 2008, 2016, doi: 10.1109/HOTCHIPS.2008.7476531

Download File: parlab08-roofline-talk.pdf (pdf: 4.2 MB)
Download File: parlab08-roofline-talk.ppt (ppt: 4.3 MB)

H Shan, S Williams, Y Zheng, W Zhang, B Wang, S Ethier, Z Zhao, IEEE, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PROCEEDINGS OF PAW 2016: 1ST PGAS APPLICATIONS WORKSHOP (PAW), January 2016, 17--24, doi: 10.1109/PAW.2016.008

Download File: PAW16-stencil.pdf (pdf: 601 KB)

Samuel Williams, X-TUNE, X-Stack PI Meeting, December 2015,

Download File: XStackPI2015XTuneSWWilliams.pdf (pdf: 5.9 MB)

Samuel Williams, 4th Order HPGMG-FV Implementation, HPGMG BoF, Supercomputing, November 2015,

Download File: SC15HPGMGBoF4thOrder.pdf (pdf: 1.6 MB)

Hongzhang Shan, Kenneth McElvain, Calvin Johnson, Samuel Williams, W. Erich Ormand, "Parallel Implementation and Performance Optimization of the Configuration-Interaction Method", Supercomputing (SC), November 2015, doi: 10.1145/2807591.2807618

Download File: sc15-bigstick.pdf (pdf: 864 KB)

Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Download File: pgas15-hpgmg.pdf (pdf: 803 KB)

Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured Mesh-Based Simulations", Procedia Computer Science, 1877-0509, June 2015, 51:2016-2025, doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

Protonu Basu, Samuel Williams, Brian Van Straalen, Mary Hall, Leonid Oliker, Phillip Colella, "Compiler-Directed Transformation for Higher-Order Stencils", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Download File: ipdps15CHiLL.pdf (pdf: 1.8 MB)

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15nwchem.pdf (pdf: 1.1 MB)

Costin Iancu, Nicholas Chaimov, Khaled Z. Ibrahim, Samuel Williams, "Exploiting Communication Concurrency on High Performance Computing Systems", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15-servers.pdf (pdf: 1.2 MB)

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Download File: International-Journal-of-High-Performance-Computing-Applications-2015-Unat-209-32.pdf (pdf: 4.3 MB)

Khaled Z. Ibrahim, Samuel W. Williams, Evgeny Epifanovsky, Anna I. Krylov, "Analysis and Tuning of Libtensor Framework on Multicore Architectures", High Performance Computing Conference (HIPC), December 2014,

Download File: HIPC14-libtensor.pdf (pdf: 277 KB)

Samuel Williams, HPGMG-FV, FastForward2 Proxy App Presentation, December 2014,

Download File: HPGMG-FV-FF2-Proxy-App.pptx (pptx: 985 KB)
Download File: HPGMG-FV-FF2-Proxy-App.pdf (pdf: 1.9 MB)

Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

Download File: SC14HPGMGBoF.pdf (pdf: 1.9 MB)

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/978-3-319-17248-4_7

Download File: PMBS14-Roofline.pdf (pdf: 340 KB)

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, Mary Hall, "Converting Stencils to Accumulations for Communication-Avoiding Optimization in Geometric Multigrid", Workshop on Stencil Computations (WOSC), October 2014,

Download File: wosc14chill.pdf (pdf: 973 KB)

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,

Download File: rpt83549.PDF (PDF: 615 KB)

Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

George Michelogiannakis, Alexander Williams, Samuel Williams, John Shalf, "Collective Memory Transfers for Multi-Core Chips", International Conference on Supercomputing (ICS), June 2014, doi: 10.1145/2597652.2597654

Download File: cms2.pdf (pdf: 613 KB)

H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125

Download File: ipdps14mfdnfinal.pdf (pdf: 631 KB)

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Download File: hpgmg.pdf (pdf: 183 KB)

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, Leonid Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid", 20th International Conference on High Performance Computing (HiPC), December 2013, 452--461,

Download File: hipc13chill.pdf (pdf: 989 KB)

Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker, "Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2013, doi: 10.1145/2503210.2503258

Download File: sc13gtc.pdf (pdf: 1.3 MB)

Samuel Williams, At Exascale, Will Bandwidth Be Free?, DOE ModSim Workshop, 2013,

Download File: modsim2013SWWilliams.pdf (pdf: 408 KB)

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

Khaled Z Ibrahim, Kamesh Madduri, Samuel Williams, Bei Wang, Stephane Ethier, Leonid Oliker, "Analysis and optimization of gyrokinetic toroidal simulations on homogeneous and heterogeneous platforms", International Journal of High Performance Computing Applications (IJHPCA), July 2013, doi: 10.1177/1094342013492446

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, L. Oliker, "Compiler Generation and Autotuning of Communication-Avoiding Operators for Geometric Multigrid", Workshop on Stencil Computations (WOSC), 2013,

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, Sam Williams, "Loop chaining: A programming abstraction for balancing locality and parallelism", Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, May 2013, 375--384, doi: 10.1109/IPDPSW.2013.68

Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Download File: miniGMGLBNL-6676E.pdf (pdf: 906 KB)

Samuel Williams, Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors, Supercomputing (SC), November 2012,

Download File: sc12-mg-talk.pdf (pdf: 1.9 MB)

S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker, "Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2012, doi: 10.1109/SC.2012.85

Download File: sc12-mg.pdf (pdf: 808 KB)
Download File: sc12mgtalk.pdf (pdf: 1.9 MB)

B. Wang, S. Ethier, W. Tang, K. Ibrahim, K. Madduri, S. Williams, "Advances in gyrokinetic particle in cell simulation for fusion plasmas to Extreme scale", Supercomputing (SC), 2012,

A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

J. Krueger, P. Micikevicius, S. Williams, "Optimization of Forward Wave Modeling on Contemporary HPC Architectures", LBNL Technical Report, 2012, LBNL 5751E,

K Madduri, J Su, S Williams, L Oliker, S Ethier, K Yelick, "Optimization of parallel particle-to-grid interpolation on leading multicore platforms", IEEE Transactions on Parallel and Distributed Systems, January 1, 2012, 23:1915--1922, doi: 10.1109/TPDS.2012.28

S. Williams, et al., Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning, Supercomputing (SC), 2011,

Download File: sc11-lbmhd-talk.pptx (pptx: 933 KB)

S. Williams, et al., Stencil Computations on CPUs, Stanford Earth Sciences Algorithms and Architectures Initiative (SESAAI), 2011,

Download File: SESAAI11-stencilsonCPUs-talk.pptx (pptx: 2.3 MB)

S. Williams, et al., Performance Optimization of HPC Applications on Multi- and Manycore Processors, Workshop on Hybrid Technologies for NASA Applications, 4th Internation Conference on Space Mission Challenges for Information Technology, 2011,

Download File: smc11-lbnl-talk.pptx (pptx: 3 MB)

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

S. Williams, et al, Stencil Computations on CPUs, Society of Exploration Geophysicists High-Performance Computing Workshop (SEG), July 2011,

Download File: SEG11-stencilsonCPUs-talk.pptx (pptx: 1.5 MB)

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

Download File: cug11-fusion.pdf (pdf: 377 KB)

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

Kamesh Madduri, Khaled Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker, "Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 23, doi: 10.1145/2063384.2063415

Download File: sc11-gtc.pdf (pdf: 1.3 MB)

Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458

Download File: sc11-lbmhd.pdf (pdf: 666 KB)
Download File: sc11lbmhdtalk.pdf (pdf: 1.4 MB)

Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund, "Hardware/software co-design for energy-efficient seismic modeling", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), January 2011, 73, doi: 10.1145/2063384.2063482

Download File: sc11-greenwave.pdf (pdf: 614 KB)

Kamesh Madduri, Eun-Jin Im, Khaled Z. Ibrahim, Samuel Williams, Stephane Ethier, Leonid Oliker, "Gyrokinetic Particle-in-cell Optimization on Emerging Multi- and Manycore Platforms", Parallel Computing (PARCO), January 2011, 37:501 - 520, doi: 10.1016/j.parco.2011.02.001

Download File: parco11-gtc.pdf (pdf: 2 MB)

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

M. Christen, N. Keen, T. Ligocki, L. Oliker, J. Shalf, B. van Straalen, S. Williams, "Automatic Thread-Level Parallelization in the Chombo AMR Library", LBNL Technical Report, 2011, LBNL 5109E,

Samuel W. Williams, David H. Bailey, "Parallel Computer Architecture", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010) Pages: 11-33

S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, R. Vuduc, "Sparse Matrix-Vector Multiplication on Multicore and Accelerators", chapter in Scientific Computing with Multicore and Accelerators, edited by Jack Dongarra, David A. Bader, Jakub Kurzak, ( 2010)

S. Williams, "The Roofline Model", chapter in Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010)

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

S. Williams, et al., Lattice Boltzmann Hybrid Auto-tuning on High-End Computational Platforms, Workshop on Programming Environments for Emerging Parallel Systems (PEEPS), 2010,

Download File: peeps10-lbmhd-talk.pdf (pdf: 1.2 MB)
Download File: peeps10-lbmhd-talk.pptx (pptx: 1.3 MB)

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams, "An auto-tuning framework for parallel multicore stencil computations", International Parallel & Distributed Processing Symposium (IPDPS), January 1, 2010, 1-12, doi: 10.1109/IPDPS.2010.5470421

Download File: ipdps10-ast.pdf (pdf: 789 KB)

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc, "Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures", International Parallel & Distributed Processing Symposium (IPDPS), 2010, doi: 10.1109/IPDPS.2010.5470415

Download File: ipdps10-fmm.pdf (pdf: 671 KB)

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

S. Zhou, D. Duffy, T. Clune, M. Suarez, S. Williams, M. Halem, "The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models", Concurrency and Computation:Practice and Experience (CCPE), August 2009, doi: 10.1002/cpe.1482

Shoaib Kamil, Cy Chan, Samuel Williams, Leonid Oliker, John Shalf, Mark Howison, E. Wes Bethel, Prabhat, "A Generalized Framework for Auto-tuning Stencil Computations", BEST PAPER AWARD - Cray User Group Conference (CUG), Atlanta, GA, May 4, 2009, LBNL 2078E,

Download File: cug09-autotune.pdf (pdf: 354 KB)

Best Paper Award

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

Download File: cug09-lbmhd.pdf (pdf: 443 KB)

S. Williams, et al., A Generalized Framework for Auto-tuning Stencil Computations, Cray User Group (CUG), 2009,

Download File: cug09-ast-talk.pdf (pdf: 835 KB)
Download File: cug09-ast-talk.pptx (pptx: 814 KB)

S. Williams, et al., Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4, Cray User Group (CUG), 2009,

Download File: cug09-hybridLBMHD-talk.pdf (pdf: 911 KB)
Download File: cug09-hybridLBMHD-talk.pptx (pptx: 981 KB)

S. Williams, A. Waterman, D. Patterson, "Roofline: an insightful visual performance model for multicore architectures", Communications of the ACM (CACM), April 2009, doi: 10.1145/1498765.1498785

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

Marghoob Mohiyuddin, Murphy, Oliker, Shalf, Wawrzynek, Samuel Williams, "A design methodology for domain-optimized power-efficient supercomputing", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009, doi: 10.1145/1654059.1654072

Download File: sc09-cotuning.pdf (pdf: 912 KB)

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Download File: arcs09-viva.pdf (pdf: 448 KB)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Download File: iwapt09-27pt.pdf (pdf: 465 KB)

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

Download File: sirev09-stencil.pdf (pdf: 2.8 MB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Download File: siampp10-gtc-talk.pdf (pdf: 2.7 MB)
Download File: siampp10-gtc-talk.pptx (pptx: 1.3 MB)

Auto-tuning Performance on Multicore Computers, Samuel Williams, PhD, 2008,

S. Williams, Auto-tuning Performance on Multicore Computers, Ph.D. Thesis Dissertation Talk, University of California at Berkeley, 2008,

Download File: SWWilliams-Thesis-Talk.pdf (pdf: 9.8 MB)
Download File: SWWilliams-Thesis-Talk.ppt (ppt: 5 MB)

Samuel Webb Williams, Andrew Waterman, David A. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", EECS Tech Report UCB/EECS-2008-134, October 2008,

S. Williams, et al, "Auto-tuning and the Roofline model", View From the Top: Craig Mundie (Ph.D student poster session), 2008,

Download File: swwilliams-mundie-poster.pdf (pdf: 866 KB)

S. Williams, et al., The Roofline Model: A Pedagogical Tool for Auto-tuning Kernels on Multicore Architectures, Hot Chips 20, August 10, 2008,

Download File: hotchips08-roofline-talk.pdf (pdf: 8 MB)

S. Williams, et al., A Vision for Integrating Performance Counters into the Roofline model, UPCRC PMU Workshop (Performance Counters), 2008,

Download File: pmu08-roofline-talk.pdf (pdf: 2.9 MB)
Download File: pmu08-roofline-talk.ppt (ppt: 2 MB)

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

Download File: jpconf8125012038.pdf (pdf: 1.2 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

Download File: scidac08-peri-talk.pdf (pdf: 9.5 MB)
Download File: scidac08-peri-talk.ppt (ppt: 5.5 MB)

K. Datta, S. Williams, V. Volkov, M. Murphy, "Autotuning Structured Grid Kernels", ParLab Summer Retreat, 2008,

Download File: parlab08-stencillbmhd-poster.pdf (pdf: 3.6 MB)

S. Zhou, D. Duffy, T. Clune, M. Suarez, S. Williams, M. Halem, "Impacts of the IBM Cell Processor on Supporting Climate Models", International Supercomputing Conference (ISC), 2008,

S. Williams, et. al, "The Roofline Model: A Pedagogical Tool for Program Analysis and Optimization", Parlab Summer Retreat, 2008,

Download File: parlab08-roofline-poster.pdf (pdf: 1.3 MB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, Lattice Boltzmann simulation optimization on leading multicore platforms, IEEE International Symposium on Parallel & Distributed Processing (IPDPS)., Pages: 1-14 2008,

Download File: ipdps08-lbmhd-talk.pdf (pdf: 10 MB)
Download File: ipdps08-lbmhd-talk.ppt (ppt: 2.6 MB)

K. Datta, S. Williams, S. Kamil, "Autotuning Structured Grid Kernels", Parlab Winter Retreat, 2008,

Download File: parlab08-structured-poster.pdf (pdf: 1.8 MB)

S. Williams, et al., Autotuning Sparse and Structured Grid Kernels, Parlab Winter Retreat, 2008,

Download File: parlab08-spmvstructured-talk.pdf (pdf: 8.1 MB)
Download File: parlab08-spmvstructured-talk.ppt (ppt: 2.9 MB)

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, DOE/DOD Workshop on Emerging High-Performance Architectures and Applications, 2007,

Download File: hpa07-spmv-talk.pdf (pdf: 7.9 MB)
Download File: hpa07-spmv-talk.ppt (ppt: 2.7 MB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

S. Williams, et al., Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms, Supercomputing (SC), 2007,

Download File: sc07-spmv-talk.pdf (pdf: 6.4 MB)
Download File: sc07-spmv-talk.ppt (ppt: 2.5 MB)

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core processors, Center for Scalable Application Development Software (CScADS), 2007,

Download File: cscads07-spmv-talk.pdf (pdf: 1.4 MB)
Download File: cscads07-spmv-talk.ppt (ppt: 754 KB)

S. Williams, et al., Tuning Sparse Matrix Vector Multiplication for multi-core SMPs, Parlab Seminar, 2007,

Download File: parlab07-spmv-talk.pdf (pdf: 1.2 MB)
Download File: parlab07-spmv-talk.ppt (ppt: 1.6 MB)

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

Download File: ijpp07-cell.pdf (pdf: 1000 KB)

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

S. Williams, et al., 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D), UTK Summit on Software and Algorithms for the Cell Processor, 2006,

Download File: utk06-lbmhd-talk.pdf (pdf: 3.7 MB)
Download File: utk06-lbmhd-talk.ppt (ppt: 784 KB)

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

Download File: cf06-cell-potential.pdf (pdf: 213 KB)

S. Williams, et al., The Potential of the Cell Processor for Scientific Computing, LBL Scientific Computing Seminar, 2006,

Download File: lbl06-cell-talk.pdf (pdf: 4.8 MB)

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, The potential of the cell processor for scientific computing, Proceedings of the 3rd Conference on Computing Frontiers 2006, CF 06, Pages: 9--20 2006, doi: 10.1145/1128022.1128027

Download File: transmeta06-cell-talk.ppt (ppt: 896 KB)

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

Download File: mspc06-stencil.pdf (pdf: 421 KB)

Samuel Williams, Shalf, Oliker, Kamil, Husbands, Katherine A. Yelick, The potential of the cell processor for scientific computing, Conf. Computing Frontiers, Pages: 9-20 2006,

Download File: edge06-handout.pdf (pdf: 270 KB)

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, "Hardware/Compiler Co-development for an Embedded Media Processor", Proceedings of the IEEE, 2001, doi: 10.1109/5.964446

C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, K. Yelick, Vector IRAM: A media-oriented vector processor with embedded DRAM, Hot Chips 12, 2000,

Download File: hotchips00-viram-talk.pdf (pdf: 57 KB)

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Download File: Workflow_roofline-6.pdf (pdf: 1.2 MB)

Shashank Subramanian, Ermal Rrapaj, Peter Harrington, Smeet Chheda, Steven Farrell, Brian Austin, Samuel Williams, Nicholas Wright, Wahid Bhimji, "Comprehensive Performance Modeling and System Design Insights for Foundation Models", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_ModelingTransformerTraining_final.pdf (pdf: 736 KB)

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

Download File: PMBS24_DCGM_final.pdf (pdf: 319 KB)

Nan Ding, Pieter Maris, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, Samuel Williams, "Evaluating the potential of disaggregated memory systems for HPC applications", Concurrency and Computation, Practice and Experience (CCPE), May 2024, doi: https://doi.org/10.1002/cpe.8147

Khaled Z. Ibrahim, Tan Nguyen, Hai Ah Nam, Wahid Bhimji, Steven Farrell, Leonid Oliker, Michael Rowan, Nicholas J. Wright, Samuel Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", (BEST PAPER), Performance Modeling, Benchmarking, and Simulation (PMBS), November 2021,

Download File: pmbs21-DL-final.pdf (pdf: 632 KB)

Tan Nguyen, Colin MacLean, Marco Siracusa, Douglas Doerfler, Nicholas J. Wright, Samuel Williams, "FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency", CCPE, August 22, 2021, doi: 10.1002/cpe.6570

Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/978-3-319-10214-6_13

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

P. Narayanan, A. Koniges, L. Oliker, R. Preissl, S. Williams, N. Wright, M. Umansky, X. Xu, S. Ethier, W. Wang, J. Candy, J. Cary, "Performance Characterization for Fusion Co-design Applications", Cray Users Group (CUG), May 2011,

Download File: cug11-fusion.pdf (pdf: 377 KB)

Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Download File: PMBS20-NVSHMEM-final.pdf (pdf: 659 KB)

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Download File: pgas15-hpgmg.pdf (pdf: 803 KB)

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "MerAligner: A Fully Parallel Sequence Aligner", IEEE 29th International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 561--570, doi: 10.1109/IPDPS.2015.96

Aligning a set of query sequences to a set of target sequences is an important task in bioinformatics. In this work we present merAligner, a highly parallel sequence aligner that implements a seed -- and -- extend algorithm and employs parallelism in all of its components. MerAligner relies on a high performance distributed hash table (seed index) and uses one-sided communication capabilities of the Unified Parallel C to facilitate a fine-grained parallelism. We leverage communication optimizations at the construction of the distributed hash table and software caching schemes to reduce communication during the aligning phase. Additionally, merAligner preprocesses the target sequences to extract properties enabling exact sequence matching with minimal communication. Finally, we efficiently parallelize the I/O intensive phases and implement an effective load balancing scheme. Results show that merAligner exhibits efficient scaling up to thousands of cores on a Cray XC30 supercomputer using real human and wheat genome data while significantly outperforming existing parallel alignment tools.

E Georganas, A Buluç, J Chapman, S Hofmeyr, C Aluru, R Egan, L Oliker, D Rokhsar, K Yelick, "HipMer: An extreme-scale de novo genome assembler", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2015, 15-20-No, doi: 10.1145/2807591.2807664

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

K Madduri, J Su, S Williams, L Oliker, S Ethier, K Yelick, "Optimization of parallel particle-to-grid interpolation on leading multicore platforms", IEEE Transactions on Parallel and Distributed Systems, January 1, 2012, 23:1915--1922, doi: 10.1109/TPDS.2012.28

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

Download File: cug09-lbmhd.pdf (pdf: 443 KB)

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Download File: arcs09-viva.pdf (pdf: 448 KB)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Download File: iwapt09-27pt.pdf (pdf: 465 KB)

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

Download File: sirev09-stencil.pdf (pdf: 2.8 MB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Download File: siampp10-gtc-talk.pdf (pdf: 2.7 MB)
Download File: siampp10-gtc-talk.pptx (pptx: 1.3 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

Download File: scidac08-peri-talk.pdf (pdf: 9.5 MB)
Download File: scidac08-peri-talk.ppt (ppt: 5.5 MB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

Download File: ijpp07-cell.pdf (pdf: 1000 KB)

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

Download File: cf06-cell-potential.pdf (pdf: 213 KB)

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

Download File: mspc06-stencil.pdf (pdf: 421 KB)

H Shan, E Strohmaier, J Qiang, DH Bailey, K Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 06, January 2006, doi: 10.1145/1188455.1188557

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, "Hardware/Compiler Co-development for an Embedded Media Processor", Proceedings of the IEEE, 2001, doi: 10.1109/5.964446

C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, K. Yelick, Vector IRAM: A media-oriented vector processor with embedded DRAM, Hot Chips 12, 2000,

Download File: hotchips00-viram-talk.pdf (pdf: 57 KB)

H Shan, S Williams, Y Zheng, W Zhang, B Wang, S Ethier, Z Zhao, IEEE, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PROCEEDINGS OF PAW 2016: 1ST PGAS APPLICATIONS WORKSHOP (PAW), January 2016, 17--24, doi: 10.1109/PAW.2016.008

Download File: PAW16-stencil.pdf (pdf: 601 KB)

D Unat, C Chan, W Zhang, S Williams, J Bachan, J Bell, J Shalf, "ExaSAT: An exascale co-design tool for performance modeling", International Journal of High Performance Computing Applications, January 2015, 29:209--232, doi: 10.1177/1094342014568690

Download File: International-Journal-of-High-Performance-Computing-Applications-2015-Unat-209-32.pdf (pdf: 4.3 MB)

H Shan, S Williams, Y Zheng, W Zhang, B Wang, S Ethier, Z Zhao, IEEE, "Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication", PROCEEDINGS OF PAW 2016: 1ST PGAS APPLICATIONS WORKSHOP (PAW), January 2016, 17--24, doi: 10.1109/PAW.2016.008

Download File: PAW16-stencil.pdf (pdf: 601 KB)

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Download File: pgas15-hpgmg.pdf (pdf: 803 KB)

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,

Download File: pmam15nwchem.pdf (pdf: 1.1 MB)

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,

Download File: rpt83549.PDF (PDF: 615 KB)

Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/978-3-319-10214-6_13

Neil Mehta, Roofline on NVIDIA at NERSC, ECP Annual Meeting, May 2022,

Download File: ECP22-Roofline-2-NVIDIA-and-NERSC.pdf (pdf: 2.6 MB)

JaeHyuk Kwack, ROOFLINE PERFORMANCE ANALYSIS W/ INTEL ADVISOR ON INTEL CPUS & GPUS, ECP Annual Meeting, May 2022,

Download File: ECP22-Roofline-4-Intel-and-ALCF.pdf (pdf: 14 MB)

Charlene Yang, Hierarchical Roofline Analysis on GPUs, ECP Annual Meeting, February 2020,

Download File: ECP20-Roofline-2-gpu.pdf (pdf: 38 MB)

Charlene Yang, Hierarchical Roofline Analysis on CPUs, ECP Annual Meeting, February 2020,

Download File: ECP20-Roofline-4-cpu.pdf (pdf: 26 MB)

Jack Deslippe, Guiding Optimization with the Roofline Model, ECP Annual Meeting, February 2020,

Download File: ECP20-Roofline-5-CaseStudies-Conclusions.pdf (pdf: 62 MB)

Charlene Yang, Intel Advisor on Cori, ECP Annual Meeting, February 8, 2018,

Download File: ECP18-Roofline-5-Advisor.pdf (pdf: 2.7 MB)

Charlene Yang, LIKWID at NERSC, ECP Annual Meeting, February 8, 2018,

Download File: ECP18-Roofline-3-LIKWID.pdf (pdf: 4.8 MB)

Jack Deslippe, Guiding Optimization on KNL with the Roofline Model, ECP Annual Meeting, February 8, 2018,

Download File: ECP18-Roofline-2-NESAP.pdf (pdf: 7.1 MB)

Vladimir Marjanovic, HPC Benchmarking, HPGMG BoF, Supercomputing, November 2015,

Download File: SC15HPGMGBoFProfiling.pdf (pdf: 1.7 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, PERI -- Auto-tuning Memory-intensive Kernels for Multicore, Journal of Physics: Conference Series, Pages: 012038 2008,

Publications

Mark F Adams

2016

Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

2014

Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

Hadia Ahmed

2020

Christopher Daley, Hadia Ahmed, Samuel Williams, Nicholas Wright, "A case study of porting HPGMG from CUDA to OpenMP target offload", The International Workshop on OpenMP (IWOMP), September 2020,

Christopher Daley, Hadia Ahmed, Samuel Williams, Nicholas Wright, "A case study of porting HPGMG from CUDA to OpenMP target offload", The International Workshop on OpenMP (IWOMP), September 2020,

Hasan Metin Aktulga

2016

2014

H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse Matrix-Multiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125

Ann S. Almgren

2014

2012

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Oscar Antepara

2024

Oscar Antepara, Samuel Williams, Hans Johansen, Mary Hall, "High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs", Performance, Portability & Productivity in HPC (P3HPC), November 10, 2024,

Oscar Antepara, Samuel Williams, Max Carlson, Jerry Watkins, "Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers", Performance, Portability & Productivity in HPC (P3HPC), November 2024,

Mahesh Lakshminarasimhan, Mary Hall, Samuel Williams, Oscar Antepara, "BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs", Proceedings of the 53rd International Conference on Parallel Processing (ICPP), August 12, 2024,

2023

Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,

Oscar Antepara, Samuel Williams, Scott Kruger, Torrin Bechtel, Joseph McClenaghan, Lang Lao, "Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code", Workshop on Accelerator Programming and Directives (WACCPD), November 2023,

Brian Austin

2024

Nan Ding, Brian Austin, Yang Liu, Neil Mehta, Steven Farrell, Johannes P. Blaschke, Leonid Oliker, Hai Ah Nam, Nicholas J. Wright, Samuel Williams, "A Workflow Roofline Model for End-to-End Workflow Performance Analysis", Supercomputing (SC), November 17, 2024,

Brian Austin, Dhruva Kulkarni, Brandon Cook, Samuel Williams, Nicholas J. Wright, "System-Wide Roofline Profiling - a Case Study on NERSC’s Perlmutter Supercomputer", Performance Modeling, Benchmarking, and Simulation (PMBS), November 2024,

2017

Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, Jack Deslippe, "Performance Variability on Xeon Phi", Intel Xeon Phi Users Group (IXPUG), June 2017,

2013

2012

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Ariful Azad

2018

Ariful Azad, Georgios A. Pavlopoulos, Christos A. Ouzounis, Nikos C. Kyrpides, Aydin Buluç, "HipMCL: A high-performance parallel implementation of the Markov cluster algorithm for large scale networks", Nucleic Acids Research, April 2018,

2017

Ariful Azad, Aydin Buluc, "Towards a GraphBLAS Library in Chapel", IPDPS Workshops, Orlando, FL, May 2017,

Ariful Azad, Aydin Buluc, "A work-efficient parallel sparse matrix-sparse vector multiplication algorithm", IEEE International Parallel & Distributed Processing Symposium (IPDPS), Orlando, FL, May 2017,

Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

2016

Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

Ariful Azad, Bartek Rajwa, Alex Pothen, "flowVS: Channel-Speci c Variance Stabilization in Flow Cytometry", BMC Bioinformatics, June 2016,

Ariful Azad, Aydın Buluç, "A matrix-algebraic formulation of distributed-memory maximal cardinality matching algorithms in bipartite graphs", Parallel Computing, June 2016,

Ariful Azad, Aydin Buluç, "Distributed-Memory Algorithms for Maximum Cardinality Matching in Bipartite Graphs", IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2016,

Ariful Azad, Aydın Buluç, Alex Pothen, "Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting", IEEE Transactions on Parallel and Distributed Systems (TPDS), May 2016,

Ariful Azad, Aydın Buluç, Distributed-memory algorithms for cardinality matching using matrix algebra, SIAM Conference on Parallel Processing for Scientific Computing (PP), Paris, France, April 2016,

2015

Ariful Azad, Aydin Buluc, "Distributed-Memory Algorithms for Maximal Cardinality Matching using Matrix Algebra", IEEE Cluster, Chicago, IL, September 2015,

Mahantesh Halappanavar, Alex Pothen, Ariful Azad, Fredrik Manne, Johannes Langguth, Arif Khan, "Codesign Lessons Learned from Implementing Graph Matching on Multithreaded Architectures", IEEE Computer, August 2015,

Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

Ariful Azad, Aydin Buluç, Alex Pothen, "A Parallel Tree Grafting Algorithm for Maximum Cardinality Matching in Bipartite Graphs", International Parallel and Distributed Processing Symposium (IPDPS), May 2015,

Zhe Bai

2024

Mustafa Mutiur Rahman, Zhe Bai, Jacob Robert King, Carl R. Sovinec, Xishuo Wei, Samuel Williams, Yang Liu, "Sparsified time-dependent Fourier neural operators for fusion simulations", Phys. Plasmas, December 4, 2024, 31:12, doi: 10.1063/5.0232503

David H. Bailey

2013

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

2011

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

David H. Bailey, Robert F. Lucas, Samuel W. Williams, ed., Performance Tuning of Scientific Applications, (CRC Press: 2011)

2010

Samuel W. Williams, David H. Bailey, "Parallel Computer Architecture", Performance Tuning of Scientific Applications, edited by David H. Bailey, Robert F. Lucas, Samuel W. Williams, (CRC Press: 2010) Pages: 11-33

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

2009

Zhengji Zhao, Juan Meza, Byounghak Lee, Hongzhang Shan, Eric Strohmaier, David H. Bailey, Lin-Wang Wang, "The linearly scaling 3D fragment method for large scale electronic structure calculations", Journal of Physics: Conference Series, July 1, 2009,

2008

Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, David H. Bailey, "Linearly scaling 3D fragment method for large-scale electronic structure calculations", Proceedings of SC08, November 2008,

D. Bailey, J. Chame, C. Chen, J. Dongarra, M. Hall, J. Hollingsworth, P. Hovland, S. Moore, K. Seymour, J. Shin, A. Tiwari, S. Williams, H. You, "PERI Auto-tuning", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012001, 2008,

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

2007

John Shalf, Shoaib Kamil, David Bailey, Erich Strohmaier, Power Efficiency and the Top500, 2007,

2006

H Shan, E Strohmaier, J Qiang, DH Bailey, K Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 06, January 2006, doi: 10.1145/1188455.1188557

Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy,
Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker, "Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark", December 2012, LBNL 6676E,

Ariful Azad, Bartek Rajwa, Alex Pothen, "flowVS: Channel-Specic Variance Stabilization in Flow Cytometry", BMC Bioinformatics, June 2016,