SciDAC3
Researchers from FTG are engaged in a number of activities in the Scientific Discovery through Advanced Computing (SciDAC) initiative. The SciDAC program was initiated in 2001 in order to develop the Scientific Computing Software and Hardware Infrastructure needed to advance scientific discovery using supercomputers. As supercomputers continuously evolve, direct engagement of computer scientists and applied mathematicians with the scientists of targeted application domains becomes ever more necessary for taking full advantage of these new systems. In this regard, SciDAC is a partnership involving all of the Department of Energy (DOE) Office of Science (SC) programs  Advanced Scientific Computing Research (ASCR), Basic Energy Sciences (BES), Biological and Environmental Research (BER), Fusion Energy Sciences (FES), HighEnergy Physics (HEP) and Nuclear Physics (NP)  to dramatically accelerate progress in scientific computing that delivers breakthrough scientific results through partnerships comprised of applied mathematicians, computer scientists, and scientists from other disciplines.
Researchers from FTG leverage their skills in autotuning, optimization, and modeling to improve the performance of the respective applications. Common techniques and solutions are centralized and encapsulated through the Institute for Sustained Performance, Energy, and Resilience (SUPER). The table below summarizes the FTG personnel involved with each of the SciDAC partnerships.
SUPER  Institute for Sustained Performance, Energy, and Resilience  
Atmospheric and Ocean Climate Modeling  Greenland and Antarctic Ice Sheet Modeling  Quantum Chromodynamics  Chemistry (LibTensor) 
Chemistry (NWChem) 
Roofline Toolkit 
Leonid Oliker Abhinav Sarje Samuel Williams

Samuel Williams Hongzhang Shan

Samuel Williams Khaled Ibrahim 
Leonid Oliker Hongzhang Shan 
Publications
Journal Article
2017
Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Crossscale efficient tensor contractions for coupled cluster computations through multiple programming model backends", Journal of Parallel and Distributed Computing (JPDC), February 2017, doi: 10.1016/j.jpdc.2017.02.010
2016
Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699
 Download File: ieeetpdsmfdnlobpcgrev.pdf (pdf: 889 KB)
Pieter Ghysels, Xiaoye S. Li, FrançoisHenry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSSStructured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 385, pp. S358S384, October 2016, doi: 10.1137/15M1010117
J. R. Jones, F.H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262
The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolutionoftheidentity approximation renders the primitive one and twoelectron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted twoelectron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one (He^+, H_2^+ ), two (H_2, He), ten (CH_4) and 56electron (C_8H_8) systems.
2015
Thorsten Kurth, Andrew Pochinsky, Abhinav Sarje, Sergey Syritsyn, Andre WalkerLoud, "HighPerformance I/O: HDF5 for Lattice QCD", arXiv:1501.06992, January 2015,
Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the stateoftheart highperformancecomputing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of highperformancecomputing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available highperformance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design and implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 1020% level and leaves room for further improvement by utilizing appropriate dataset chunking.
Conference Paper
2018
Hongzhang Shan, Samuel Williams, Calvin W. Johnson, "Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2018,
 Download File: pmbs18reducefinal.pdf (pdf: 572 KB)
2017
Philip C. Roth, Hongzhang Shan, David Riegner, Nikolas Antolin, Sarat Sreepathi, Leonid Oliker, Samuel Williams, Shirley Moore, Wolfgang Windl, "Performance Analysis and Optimization of the RAMPAGE Metal Alloy Potential Generation Software", SIGPLAN International Workshop on Software Engineering for Parallel Systems (SEPS), October 2017,
Hongzhang Shan, Samuel Williams, Calvin Johnson, Kenneth McElvain, "A Localitybased Threading Algorithm for the ConfigurationInteraction Method", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,
 Download File: pdsec17bigstick.pdf (pdf: 715 KB)
Bryce Adelstein Lelbach, Hans Johansen, Samuel Williams, "Simultaneously Solving Swarms of Small Sparse Systems on SIMD Silicon", Parallel and Distributed Scientific and Engineering Computing (PDSEC), June 2017,
Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, Jack Deslippe, "Performance Variability on Xeon Phi", Intel Xeon Phi Users Group (IXPUG), June 2017,
Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samuel Williams, WooSun Yang, and Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", Intel Xeon Phi Users Group (IXPUG), June 2017,
 Download File: ixpug17nesap.pdf (pdf: 395 KB)
2016
Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, JeanLuc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,
 Download File: PMBS16KNL.pdf (pdf: 789 KB)
Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, and Andrea L. Bertozzi, "OpenMP Parallelization and Optimization of GraphBased Machine Learning Algorithms", 12th International Workshop on OpenMP (iWOMP), October 2016, doi: 10.1007/9783319455501_2
Douglas Doerfer, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, JeanLuc Vay, and Henri Vincenti, "Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor", Intel Xeon Phi User Group Workshop (IXPUG), June 2016,
 Download File: ixpug16roofline.pdf (pdf: 575 KB)
Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,
2015
Hongzhang Shan, Kenneth McElvain, Calvin Johnson, Samuel Williams, W. Erich Ormand, "Parallel Implementation and Performance Optimization of the ConfigurationInteraction Method", Supercomputing (SC), November 2015, doi: 10.1145/2807591.2807618
 Download File: sc15bigstick.pdf (pdf: 864 KB)
Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/9783319321493_12
Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured MeshBased Simulations", Procedia Computer Science, 18770509, June 2015, 51:20162025, doi: 10.1016/j.procs.2015.05.466
This paper addresses two key parallelization challenges the unstructured meshbased ocean modeling code, MPASOcean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra and internode performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter and intra node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured gridbased computations.
Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "ThreadLevel Parallelization and Optimization of NWChem for the Intel MIC Architecture", Programming Models and Applications for Multicores and Manycores (PMAM), February 2015,
 Download File: pmam15nwchem.pdf (pdf: 1.1 MB)
2014
Khaled Z. Ibrahim, Samuel W. Williams, Evgeny Epifanovsky, Anna I. Krylov, "Analysis and Tuning of Libtensor Framework on Multicore Architectures", High Performance Computing Conference (HIPC), December 2014,
 Download File: HIPC14libtensor.pdf (pdf: 277 KB)
Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J. Ligocki, Matthew J. Cordery, Leonid Oliker, Mary W. Hall, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2014, doi: 10.1007/9783319172484_7
 Download File: PMBS14Roofline.pdf (pdf: 340 KB)
W.A. de Jong, L. Lin, H. Shan, C. Yang and L. Oliker, "Towards modelling complex mesoscale molecular environments", International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), 2014,
H. M. Aktulga, A. Buluc, S. Williams, C. Yang, "Optimizing Sparse MatrixMultiple Vector Multiplication for Nuclear Configuration Interaction Calculations", International Parallel and Distributed Processing Symposium (IPDPS 2014), May 2014, doi: 10.1109/IPDPS.2014.125
 Download File: ipdps14mfdnfinal.pdf (pdf: 631 KB)
2013
Hongzhang Shan, Brian Austin, Wibe de Jong, Leonid Oliker, Nick Wright, Edoardo Apra, "Performance Tuning of Fock Matrix and Two Electron Integral Calculations for NWChem on Leading HPC Platforms", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2013, doi: 10.1007/9783319102146_13
2011
Samuel Williams, Oliker, Carter, John Shalf, "Extracting ultrascale Lattice Boltzmann performance via hierarchical and distributed autotuning", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New York, NY, USA, ACM, January 2011, 55, doi: 10.1145/2063384.2063458
 Download File: sc11lbmhd.pdf (pdf: 666 KB)
 Download File: sc11lbmhdtalk.pdf (pdf: 1.4 MB)
Presentation/Talk
2017
Samuel Williams, Introduction to the Roofline Model, Roofline Training, November 2017,
 Download File: rooflineintro.pptx (pptx: 3.1 MB)
 Download File: rooflineintro.pdf (pdf: 3.6 MB)
2016
Abhinav Sarje, Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers, Cray Users Group (CUG), May 12, 2016,
2015
Abhinav Sarje, Parallel Performance Optimizations on Unstructured MeshBased Simulations, International Conference on Computational Science, June 2015,
 Download File: SarjeICCS2015.pdf (pdf: 4.6 MB)
Report
2016
Khaled Z. Ibrahim, Evgeny Epifanovsky, Samuel Williams, Anna I. Krylov, "Crossscale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends (tech report version)", LBNL.  Report Number: LBNL1005853, July 1, 2016, LBNL 1005853, doi: 10.2172/1274416
2014
Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "ThreadLevel Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E,
 Download File: rpt83549.PDF (PDF: 615 KB)