# Publications

### 2024

### Jan Balewski, Mercy G Amankwah, Roel Van Beeumen, E Wes Bethel, Talita Perciano, Daan Camps, "Quantum-parallel vectorized data encodings and computations on trapped-ion and transmon QPUs", Journal, February 10, 2024, 14, doi: https://doi.org/10.1038/s41598-024-53720-x

### 2023

### John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.9.0", Lawrence Berkeley National Laboratory Tech Report LBNL-2001560, December 2023, doi: 10.25344/S4P01J

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes. UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

### Yang Liu, Nan Ding, Piyush Sao, Samuel Williams, Xiaoye Sherry Li, "Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters", Supercomputing (SC), November 2023,

- Download File: SC23_3DSpTRSV_final.pdf (pdf: 2.9 MB)

### Julian Bellavita, Mathias Jacquelin, Esmond G. Ng, Dan Bonachea, Johnny Corbino, Paul H. Hargrove, "symPACK: A GPU-Capable Fan-Out Sparse Cholesky Solver", 2023 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM'23), ACM, November 13, 2023, doi: 10.1145/3624062.3624600

Sparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.

We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.

### E Wes Bethel, Mercy G Amankwah, Jan Balewski, Roel Van Beeumen, Daan Camps, Daniel Huang, Talita Perciano, "Quantum computing and visualization: A disruptive technological change ahead", Journal, November 6, 2023, 43, doi: https://doi.org/10.1109/MCG.2023.3316932

### John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2023.3.0", Lawrence Berkeley National Laboratory Tech Report, March 30, 2023, LBNL 2001517, doi: 10.25344/S43591

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

### 2022

### X. Li, Y. Liu, P. Lin, P. Sao, "Newly released capabilities in distributed-memory SuperLU sparse direct solver", ACM Transactions on Mathematical Software, November 19, 2022,

- Download File: 3577197.pdf (pdf: 1.1 MB)

### M. Wang, Y. Liu, P. Ghysels, A. C. Yucel, "VoxImp: Impedance Extraction Simulator for Voxelized Structures", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2, 2022, doi: 10.1109/TCAD.2022.3218768

### John Bachan, Scott B. Baden, Dan Bonachea, Johnny Corbino, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.9.0", Lawrence Berkeley National Laboratory Tech Report, September 30, 2022, LBNL 2001479, doi: 10.25344/S4QW26

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

### Hengrui Luo, Younghyun Cho, James W. Demmel, Xiaoye S. Li, Yang Liu, "Hybrid models for mixed variables in Bayesian optimization", June 6, 2022,

### M. G. Amankwah, D. Camps, E. W. Bethel, R. Van Beeumen, T. Perciano, "Quantum pixel representations and compression for N-dimensional images", Nature Scientific Reports, May 11, 2022, 12:7712, doi: 10.1038/s41598-022-11024-y

### Lipeng Wan, Axel Huebl, Junmin Gu, Franz Poeschel, Ana Gainaru, Ruonan Wang, Jieyang Chen, Xin Liang, Dmitry Ganyushin, Todd Munson, Ian Foster, Jean-Luc Vay, Norbert Podhorszki, Kesheng Wu, Scott Klasky, "Improving I/O Performance for Exascale Applications Through Online Data Layout Reorganization", IEEE Transactions on Parallel and Distributed Systems, 2022, 33:878-890, doi: 10.1109/TPDS.2021.3100784

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2022, LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

### X. Zhu, Y. Liu, P. Ghysels, D. Bindal, X. S. Li, "GPTuneBand: multi-task and multi-fidelity Bayesian optimization for autotuning large-scale high performance computing applications", SIAM PP, February 23, 2022,

- Download File: GPTuneBand.pdf (pdf: 1.4 MB)

### E. Wes Bethel, Burlen Loring, Utkarsh Ayachit, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, Dave Pugmire, Silvio Rizzi, Thompson, Will Usher, Gunther H. Weber, Brad Whitlock, Wolf, Kesheng Wu, "Proximity Portability and In Transit, M-to-N Data Partitioning and Movement in SENSEI", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_20

### E. Wes Bethel, Burlen Loring, Utkarsh Ayatchit, David Camp, P. N. Duque, Nicola Ferrier, Joseph Insley, Junmin Gu, Kress, Patrick O’Leary, David Pugmire, Silvio Rizzi, Thompson, Gunther H. Weber, Brad Whitlock, Matthew Wolf, Kesheng Wu, "The SENSEI Generic In Situ Interface: Tool and Processing Portability at Scale", In Situ Visualization for Computational Science, ( 2022) doi: 10.1007/978-3-030-81627-8_13

### 2021

### Y. Cho, J. W. Demmel, X. S. Li, Y. Liu, H. Luo, "Enhancing autotuning capability with a history database", IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), December 20, 2021,

- Download File: GPTuneHistoryDB.pdf (pdf: 390 KB)

### Franz Poeschel, Juncheng E, William F. Godoy, Norbert Podhorszki, Scott Klasky, Greg Eisenhauer, Philip E. Davis, Lipeng Wan, Ana Gainaru, Junmin Gu, Fabian Koller, René Widera, Michael Bussmann, Axel Huebl, "Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2", Smoky Mountains Computational Sciences and Engineering Conference (SMC2021), 2021,

### Pietro Benedusi, Michael L Minion, Rolf Krause, "An experimental comparison of a space-time multigrid method with PFASST for a reaction-diffusion problem", Computers & Mathematics with Applications, October 1, 2021,

- Download File: Benedusi-Minion-Krause.pdf (pdf: 372 KB)

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters, "UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2021, LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

### H. Luo, J.W. Demmel, Y. Cho, X. S. Li, Y. Liu, "Non-smooth Bayesian optimization in tuning problems", arxiv-preprint, September 21, 2021,

### Tommaso Buvoli, Michael Minion, "IMEX Runge-Kutta Parareal for Non-diffusive Equations", Springer Proceedings in Mathematics & Statistics, August 25, 2021,

### Sebastian Götschel, Michael Minion, Daniel Ruprecht, Robert Speck, "Twelve Ways To Fool The Masses When Giving Parallel-In-Time Results Authors", Springer Proceedings in Mathematics & Statistics, August 25, 2021,

- Download File: Twelve-Ways.pdf (pdf: 847 KB)

### Nan Ding, Yang Liu, Samuel Williams, Xiaoye S. Li, "A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), July 19, 2021,

- Download File: Multi-GPU-SpTRSV-ACDA21-.pdf (pdf: 897 KB)

### Yang Liu, Pieter Ghysels, Lisa Claus, Xiaoye Sherry Li, "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations", SIAM J. Sci. Comput., June 22, 2021,

### J. Goings, H. Hu, C. Yang, X. Li, "Reinforcement Learning Configuration Interaction", March 31, 2021,

### Karol Kowalski, Raymond Bair, Nicholas P. Bauman, Jeffery S. Boschen, Eric J. Bylaska, Jeff Daily, Wibe A. de Jong, Thom Dunning, Niranjan Govind, Robert J. Harrison, Murat Keceli, Kristopher Keipert, Sriram Krishnamoorthy, Suraj Kumar, Erdal Mutlu, Bruce Palmer, Ajay Panyala, Bo Peng, Ryan M. Richard, T. P. Straatsma, Peter Sushko, Edward F. Valeev, Marat Valiev, Hubertus J. J. van Dam, Jonathan M. Waldrop, David B. Williams-Young, Chao Yang, Marcin Zalewski, Theresa L. Windus, "From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape", Chemical Reviews, March 31, 2021, doi: 10.1021/acs.chemrev.0c00998

### Yang Liu, Xin Xing, Han Guo, Eric Michielssen, Pieter Ghysels, Xiaoye Sherry Li, "Butterfly factorization via randomized matrix-vector multiplications", SIAM J. Sci. Comput., March 9, 2021,

### Thijs Steel, Daan Camps, Karl Meerbergen, Raf Vandebril, "A Multishift, Multipole Rational QZ Method with Aggressive Early Deflation", SIAM Journal on Matrix Analysis and Applications, February 19, 2021, 42:753-774, doi: 10.1137/19M1249631

In the article “A Rational QZ Method” by D. Camps, K. Meerbergen, and R. Vandebril [SIAM J. Matrix Anal. Appl., 40 (2019), pp. 943--972], we introduced rational QZ (RQZ) methods. Our theoretical examinations revealed that the convergence of the RQZ method is governed by rational subspace iteration, thereby generalizing the classical QZ method, whose convergence relies on polynomial subspace iteration. Moreover the RQZ method operates on a pencil more general than Hessenberg---upper triangular, namely, a Hessenberg pencil, which is a pencil consisting of two Hessenberg matrices. However, the RQZ method can only be made competitive to advanced QZ implementations by using crucial add-ons such as small bulge multishift sweeps, aggressive early deflation, and optimal packing. In this paper we develop these techniques for the RQZ method. In the numerical experiments we compare the results with state-of-the-art routines for the generalized eigenvalue problem and show that the presented method is competitive in terms of speed and accuracy.

### Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, X. S. Li, "GPTune: multitask learning for autotuning exascale applications", PPoPP, February 17, 2021, doi: 10.1145/3437801.3441621

### R. Van Beeumen, L. Perisa, D. Kressner, C. Yang, "A Flexible Power Method for Solving Infinite Dimensional Tensor Eigenvalue Problems", January 30, 2021,

### Jean Luca Bez, Houjun Tang, Bing Xie, David Williams-Young, Rob Latham, Rob Ross, Sarp Oral, Suren Byna, "I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis", 2021 IEEE/ACM Sixth International Parallel Data Systems Workshop (PDSW), January 1, 2021, 15-22, doi: 10.1109/PDSW54622.2021.00008

### 2020

### C. Yang, J. Brabec, L. Veis, D. B. Williams-Young, K. Kolwaski, "Solving Coupled Cluster Equations by the Newton Krylov Method", Frontiers in Chemistr, December 10, 2020, 8:987, doi: 10.3389/fchem.2020.590184

### D. B. Williams-Young, W. A. de Jong, H. J. J. van Dam and C. Yang, "On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters", Frontiers in Chemistry, December 10, 2020, 8:951, doi: 10.3389/fchem.2020.581058

### Roel Van Beeumen, Khaled Z. Ibrahim, Gregory D. Kahanamoku-Meyer, Norman Y. Yao, Chao Yang, "Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization", December 1, 2020,

### William F.Godoy, Norbert Podhorszki, Ruonan Wang, Chuck Atkins, Greg Eisenhauer, Junmin Gu,Philip Davis,J ong Choi, Kai Germaschewski, Kevin Huck, Axel Huebl, Mark Kim, James Kress, Tahsin Kurc, Qing Liu, Jeremy Logan, Kshitij Mehta, George Ostrouchov, Manish Parashar, Franz Poeschel, David Pugmire, Eric Suchyta, KeichiTakahashi, NickThompson, Seiji Tsutsumi, Lipeng Wan, Matthew Wolf, Kesheng Wu, Scott Klasky, "ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management", SoftwareX, 2020, 12,

### Daan Camps, Roel Van Beeumen, "Approximate quantum circuit synthesis using block encodings", PHYSICAL REVIEW A, November 11, 2020, 102, doi: 10.1103/PhysRevA.102.052411

One of the challenges in quantum computing is the synthesis of unitary operators into quantum circuits with polylogarithmic gate complexity. Exact synthesis of generic unitaries requires an exponential number of gates in general. We propose a novel approximate quantum circuit synthesis technique by relaxing the unitary constraints and interchanging them for ancilla qubits via block encodings. This approach combines smaller block encodings, which are easier to synthesize, into quantum circuits for larger operators. Due to the use of block encodings, our technique is not limited to unitary operators and can be applied for the synthesis of arbitrary operators. We show that operators which can be approximated by a canonical polyadic expression with a polylogarithmic number of terms can be synthesized with polylogarithmic gate complexity with respect to the matrix dimension.

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Programmer’s Guide, Revision 2020.10.0", Lawrence Berkeley National Laboratory Tech Report, October 2020, LBNL 2001368, doi: 10.25344/S4HG6Q

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### T. Hernandez, R. Van Beeumen, M. Caprio, C. Yang, "A greedy algorithm for computing eigenvalues of a symmetric matrix with localized eigenvectors", Numerical Linear Algebra and Applications, October 9, 2020, 28:e2341, doi: https://doi.org/10.1002/nla.2341

### D. B. Williams-Young, P. G. Beckman, C. Yang, "A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computatio", ACM Tran. Math Software, October 1, 2020, 46, doi: 10.1145/3409571

### Daan Camps, Thomas Mach, Raf Vandebril, David Watkins, "On pole-swapping algorithms for the eigenvalue problem", ETNA - Electronic Transactions on Numerical Analysis, September 18, 2020, 52:480-508, doi: 10.1553/etna_vol52s480

Pole-swapping algorithms, which are generalizations of the QZ algorithm for the generalized eigenvalue problem, are studied. A new modular (and therefore more flexible) convergence theory that applies to all pole-swapping algorithms is developed. A key component of all such algorithms is a procedure that swaps two adjacent eigenvalues in a triangular pencil. An improved swapping routine is developed, and its superiority over existing methods is demonstrated by a backward error analysis and numerical tests. The modularity of the new convergence theory and the generality of the pole-swapping approach shed new light on bi-directional chasing algorithms, optimally packed shifts, and bulge pencils, and allow the design of novel algorithms.

### D. Camps, R. Van Beeumen, C. Yang, "Quantum Fourier Transform Revisited", Numerical Linear Algebra and Applications, September 15, 2020, 28:e2331, doi: https://doi.org/10.1002/nla.2331

### Li Zhou, Lihao Yan, Mark A. Caprio, Weiguo Gao, Chao Yang, "Solving the k-sparse Eigenvalue Problem with Reinforcement Learning", September 9, 2020,

### Miroslav Urbanek, Daan Camps, Roel Van Beeumen, Wibe A. de Jong, "Chemistry on quantum computers with virtual quantum subspace expansion", Journal of Chemical Theory and Computation, 2020, 16:5425–5431, doi: 10.1021/acs.jctc.0c00447

### D. B. Williams-Young, C. Yang, "Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators", ICPP20, ACM, August 1, 2020, 1-11, doi: https://doi.org/10.1145/3404397.3404416

### Gustavo Chavez, Elizaveta Rebrova, Yang Liu, Pieter Ghysels, Xiaoye Sherry Li, "Scalable and memory-efficient kernel ridge regression", 34th IEEE International Parallel and Distributed Processing Symposium, July 14, 2020,

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (ALCF'20), Argonne Leadership Computing Facility (ALCF) Webinar Series, May 27, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this webinar, hosted by DOE’s Exascale Computing Project and the ALCF, we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

### Shao-Jun Dong, Chao Wang, Yong-Jian Han, Chao Yang and Lixin He, "Stable diagonal stripes in the t–J model at nhbar = 1/8 doping from fPEPS calculations", npj Quantum Materials, May 8, 2020, 5:28, doi: https://doi.org/10.1038/s41535-020-0226-4

### C. T. Kelley, J. Bernholc, E. L. Briggs, S. Hamilton, L. Lin and C. Yang, "Mesh Independence of the Generalized Davidson Algorithm", Journal of Computational Physics, May 1, 2020, 409:109322, doi: https://doi.org/10.1016/j.jcp.2020.109322

### Kai-Hsin Liou, Chao Yang and James R.Chelikowsky, "Scalable Implementation of Polynomial Filering for Density Functional Theory Calculation in PARSEC", Computer Physics Communications, April 28, 2020, In press, doi: https://doi.org/10.1016/j.cpc.2020.107330

### Li Zhou, Chao Yang, Weiguo Gao, Talita Perciano, Karen M. Davies, Nicholas K. Sauter, "Subcellular structure segmentation from cryo-electron tomograms via machine learning", PLOS Journal of Computational Biology, April 2, 2020, submitte, doi: doi: https://doi.org/10.1101/2020.04.09.034025

### F. Henneke, L. Lin, C. Vorwerk, C. Draxl, R. Klein and C. Yang, "Fast optical absorption spectra calculations for periodic solid state systems", Communications in Applied Mathematics and Computational Science, March 16, 2020, in press,

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2020, LBNL 2001269, doi: 10.25344/S4P88Z

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

- Download File: One-side-SPTRS-SIAM-PP20-.pdf (pdf: 2.9 MB)

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++: A PGAS/RPC Library for Asynchronous Exascale Communication in C++ (ECP'20), Tutorial at Exascale Computing Project (ECP) Annual Meeting 2020, February 6, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

### R. Van Beeumen, G. D. Kahanamoku-Meyer, N. Y. Yao and C. Yang, "A scalable matrix-free iterative eigensolver for studying many-body localization", HPCAsia2020: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, ACM, January 7, 2020, 179-187, doi: 10.1145/3368474.3368497

### W. Hu, J. Liu, Y. Li, Z. Ding, C. Yang, J. Yang, "Accelerating Excitation Energy Computation in Molecules and Solids within Linear-Response Time-Dependent Density Functional Theory via Interpolative Separable Density Fitting Decomposition", J. Chem. Theory Comput., January 3, 2020, 16:964–973, doi: https://doi.org/10.1021/acs.jctc.9b01019

### 2019

### Junmin Gu, Burlen Loring, Kesheng Wu, E. Wes Bethel, "HDF5 as a vehicle for in transit data movement", The Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV'19), 2019, doi: 10.1145/3364228.3364237

### L. Yang, Z. Wen, C. Yang and Y. Zhang, "`Block Algorithms with Augmented Rayleigh-Ritz Projections for Large-Scale Eigenpair Computation", Journal of Computational Mathematics, November 1, 2019, 37:889-915, doi: 10.4208/jcm.1910-m2019-0034

### Mark Adams, Stephen Cornford, Daniel Martin, Peter McCorquodale, "Composite matrix construction for structured grid adaptive mesh refinement", Computer Physics Communications, November 2019, 244:35-39, doi: 10.1016/j.cpc.2019.07.006

- Download File: AdamsCornfordMartinMcCorquodale.pdf (pdf: 1.2 MB)

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++ Tutorial (NERSC Nov 2019), National Energy Research Scientific Computing Center (NERSC), November 1, 2019,

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. UPC++ provides mechanisms for low-overhead one-sided communication, moving computation to data through remote-procedure calls, and expressing dependencies between asynchronous computations and data movement. It is particularly well-suited for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces are designed to be composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds.

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through implementing basic algorithms in UPC++. We will also look at irregular applications and how to take advantage of UPC++ features to optimize their performance.

### Brandon Krull, Michael Minion, "Parallel-In-Time Magnus Integrators", SIAM Journal on Scientific Computing, October 1, 2019,

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Programmer’s Guide, Revision 2019.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2019, LBNL 2001236, doi: 10.25344/S4V30R

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Specification, Revision 2019.9.0", Lawrence Berkeley National Laboratory Tech Report, September 14, 2019, LBNL 2001237, doi: 10.25344/S4ZW2C

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### M. Zingale, M.P. Katz, J.B. Bell, M.L. Minion, A.J. Nonaka, W. Zhang, "Improved Coupling of Hydrodynamics and Nuclear Reactions via Spectral Deferred Corrections", August 14, 2019,

### John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed, "UPC++: A High-Performance Communication Framework for Asynchronous Computation", 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'19), Rio de Janeiro, Brazil, IEEE, May 2019, doi: 10.25344/S4V88H

### Francois P. Hamon, Martin Schreiber, Michael L. Minion, "Parallel-in-Time Multi-Level Integration of the Shallow-Water Equations on the Rotating Sphere", April 12, 2019,

Submitted to Journal of Computational Physics

### B. Peng, R. Van Beeumen, D.B. Williams-Young, K. Kowalski, C. Yang, "Approximate Green’s function coupled cluster method employing effective dimension reduction", Journal of Chemical Theory and Computation, 2019, 15:3185-3196, doi: 10.1021/acs.jctc.9b00172

### P. Benner, V. Khoromskaia, B. N. Khoromskij and C. Yang, "Computing the density of states for optical spectra of molecules by low-rank and QTT tensor approximation", Journal of Computational Physics, April 1, 2019, 382:221-239, doi: https://doi.org/10.1016/j.jcp.2019.01.011

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer's Guide, v1.0-2019.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2019, LBNL 2001191, doi: 10.25344/S4F301

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Specification v1.0, Draft 10", Lawrence Berkeley National Laboratory Tech Report, March 15, 2019, LBNL 2001192, doi: 10.25344/S4JS30

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Sebastian Götschel , Michael Minion, "An Efficient Parallel-in-Time Method for Optimization with Parabolic PDEs", SIAM Journal on Scientific Computing, January 21, 2019,

In submission

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "Pagoda: Lightweight Communications and Global Address Space Support for Exascale Applications - UPC++ (ECP'19)", Poster at Exascale Computing Project (ECP) Annual Meeting 2019, January 2019,

### M. Emmett, E. Motheau, W. Zhang, M. Minion, J. B. Bell, "A Fourth-Order Adaptive Mesh Refinement Algorithm for the Multicomponent, Reacting Compressible Navier-Stokes Equations", Combustion Theory and Modeling, 2019,

### Y. Liu, W. Sid-Lakhdar, E. Rebrova, P. Ghysels, X. Sherry Li, "A parallel hierarchical blocked adaptive cross approximation algorithm", The International Journal of High Performance Computing Applications, January 1, 2019,

### Victor Yu, William Dawson, Alberto Garcia, Ville Havu, Ben Hourahine, William Huhn, Mathias Jacquelin, Weile Jia, Murat Keceli, Raul Laasner, others, Large-Scale Benchmark of Electronic Structure Solvers with the ELSI Infrastructure, Bulletin of the American Physical Society, 2019,

### Junmin Gu, Burlen Loring, Kesheng Wu, E Wes Bethel, "HDF5 as a vehicle for in transit data movement", Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, 2019, 39--43,

### Francois P. Hamon, Martin Schreiber, Michael L. Minion, "Multi-Level Spectral Deferred Corrections Scheme for the Shallow Water Equations on the Rotating Sphere", Journal of Computational Physics, January 1, 2019, 376:435-454,

- Download File: HMM.pdf (pdf: 1.4 MB)

### Daan Camps, Karl Meerbergen, Raf Vandebril, "A rational QZ method", SIAM J. Matrix Anal. Appl., 2019, 40:943--972, doi: 10.1137/18M1170480

### Daan Camps, Karl Meerbergen, Raf Vandebril, "An implicit filter for rational Krylov using core transformations", Linear Algebra Appl., 2019, 561:113--140, doi: 10.1016/j.laa.2018.09.021

### Pole swapping methods for the eigenvalue problem - Rational QR algorithms, Daan Camps, 2019,

### Daan Camps, Nicola Mastronardi, Raf Vandebril, Paul Van Dooren, "Swapping 2 × 2 blocks in the Schur and generalized Schur form", Journal of Computational and Applied Mathematics, 2019, doi: https://doi.org/10.1016/j.cam.2019.05.022

### 2018

### Y. Li, Z. Wen, C. Yang, Y. Yuan, "A Semi-smooth Newton Method For semidefinite programs and its applications in electronic structure calculations", SIAM J. Sci. Comput., December 18, 2018, 40:A4131–A415, doi: 10.1137/18M1188069

### ML Minion, RI Saye, "Higher-order temporal integration for the incompressible Navier–Stokes equations in bounded domains", Journal of Computational Physics, 2018, 375:797--822, doi: 10.1016/j.jcp.2018.08.054

### R. Van Beeumen, O. Marques, E.G. Ng, C. Yang, Z. Bai, L. Ge, O. Kononenko, Z. Li, C.-K. Ng, L. Xiao, "Computing resonant modes of accelerator cavities by solving nonlinear eigenvalue problems via rational approximation", Journal of Computational Physics, 2018, 374:1031-1043, doi: 10.1016/j.jcp.2018.08.017

### Shichao Sun, David B. Williams-Young, Torin F. Stetina, Xiaosong Li, "Generalized Hartree-Fock with a Non-perturbative Treatment of Strong Magnetic Fields: Application to Molecular Spin Phase Transitions", Journal of Chemical Theory and Computation, 2018, 51:348-356, doi: 10.1021/acs.jctc.8b01140

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runtimes", The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18) Research Poster, November 2018,

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work is driven by the emerging need for adaptive, lightweight communication in irregular applications at exascale. We present an overview of UPC++ and GASNet-EX, including examples and performance results.

GASNet-EX is a portable, high-performance communication library, leveraging hardware support to efficiently implement Active Messages and Remote Memory Access (RMA). UPC++ provides higher-level abstractions appropriate for PGAS programming such as: one-sided communication (RMA), remote procedure call, locality-aware APIs for user-defined distributed objects, and robust support for asynchronous execution to hide latency. Both libraries have been redesigned relative to their predecessors to meet the needs of exascale computing. While both libraries continue to evolve, the system already demonstrates improvements in microbenchmarks and application proxies.

### Gianina Alina Negoita, James P. Vary, Glenn R. Luecke, Pieter Maris, Andrey M. Shirokov, Ik Jae Shin, Youngman Kim, Esmond G. Ng, Chao Yang, Matthew Lockner, Gurpur M. Prabhu, "Deep Learning: Extrapolation Tool for Ab Initio Nuclear Theory", CoRR, November 10, 2018,

### Hongyuan Zhan, Gabriel Gomes, Xiaoye S Li, Kamesh Madduri, Kesheng Wu, "Efficient Online Hyperparameter Optimization for Kernel Ridge Regression with Applications to Traffic Time Series Prediction", arXiv preprint arXiv:1811.00620, 2018,

### J. Deusch, M. Shao, C. Yang, M. Gu, "A Robust and Efficient Implementation of LOBPCG", SIAM J. Sc. Comput., October 4, 2018, 40:C655–C676, doi: 10.1137/17M1129830

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer's Guide, v1.0-2018.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2018, LBNL 2001180, doi: 10.25344/S49G6V

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Specification v1.0, Draft 8", Lawrence Berkeley National Laboratory Tech Report, September 26, 2018, LBNL 2001179, doi: 10.25344/S45P4X

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### M. C. Clement, J. Zhang, C. A. Lewis, C. Yang, Edward F. Valeev, "Optimized Pair Natural Orbitals for the Coupled Cluster Methods", J. Chem. Theory Comput., August 1, 2018, 14:4581–4589, doi: 10.1021/acs.jctc.8b00294

### Alessio Petrone, David B. Williams-Young, Shichao Sun, F. Stetina, Xiaosong Li, "An Efficient Implementation of Two-Component Relativistic Density Functional Theory with Torque-Free Auxiliary Variables", European Physical Journal B, 2018, 91:169, doi: 10.1140/epjb/e2018-90170-1

### R. Huang, J. Sun, C. Yang, "Recursive integral method with Cayley transformation", Numerical Linear Algebra with Applications, July 10, 2018, 25:1-12, doi: 10.1002/nla.2199

### W. Hu, M. Shao, A. Cepelloti, F. H. Jornada, L. Lin, K. Thicke, C. Yang, S. Louie, "Accelerating Optical Absorption Spectra and Exciton Energy Computation via Interpolative Separable Density Fitting", International Conference on Computational Science (ICCS2018), Lecture Notes in Computer Science, Springer, Cham, June 12, 2018, 10861:604-617, doi: 10.1007/978-3-319-93701-4_48

### Meiyue Shao, Felipe H. da Jornada, Lin Lin, Chao Yang, Jack Deslippe, Steven G. Louie, "A structure preserving Lanczos algorithm for computing the optical absorption spectrum", SIAM Journal on Matrix Analysis and Applications, 2018, 39:683--711, doi: 10.1137/16M1102641

### T. Ke, A. S. Brewster, S. X. Yu, D. Ushizima, C. Yang, N. K. Sauter, "A convolutional neural network-based screening tool for X-ray serial crystallography", JOURNAL OF SYNCHROTRON RADIATION, April 24, 2018, 25:665-670, doi: 10.1107/S1600577518004873

### A. S. Banerjee, L. Lin, P. Suryanarayana, C. Yang, J. E. Pask, "Two-level Chebyshev filter based complementary subspace method for pushing the envelope of large-scale electronic structure calculations", J. Chem. Theory Comput., April 16, 2018, 14:2930–2946, doi: 10.1021/acs.jctc.7b01243

### John Bachan, Scott Baden, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian Van Straalen, "UPC++ Specification v1.0, Draft 6", Lawrence Berkeley National Laboratory Tech Report, March 26, 2018, LBNL 2001135, doi: 10.2172/1430689

### John Bachan, Scott Baden, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, "UPC++ Programmer’s Guide, v1.0-2018.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2018, LBNL 2001136, doi: 10.2172/1430693

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### J. M. Kasper, D. B. Williams-Young, E. Vecharynski, C. Yang, X. Li, "A Well-Tempered Hybrid Method for Solving Challenging Time-Dependent Density Functional Theory (TDDFT) Systems", J. Chem. Theory Comput., March 16, 2018, 14:2034–2041, doi: 10.1021/acs.jctc.8b00141

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'18)", Poster at Exascale Computing Project (ECP) Annual Meeting 2018, February 2018,

### M. Papadopoulos, R. Van Beeumen, S. François, G. Degrande, G. Lombaert, "Modal characteristics of structures considering dynamic soil-structure interaction effects", Soil Dynamics and Earthquake Engineering, 2018, 105:114-118, doi: 10.1016/j.soildyn.2017.11.012

### P. Benner, H. Fessbender, C. Yang, "Some remarks on the complex J-symmetric eigenproblem", Linear Algebra and its Applications, January 14, 2018, 544:407-442, doi: 10.1016/j.laa.2018.01.014

### Patrick J. Lestrange, David B. Williams-Young, Alessio Petrone, Carlos A. Jimenez-Hoyos, Xiaosong Li, "Efficient Implementation of Variation after Projection Generalized Hartree-Fock", Journal of Chemical Theory and Computation, 2018, 14:588-596, doi: 10.1021/acs.jctc.7b00832

### Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Nicholas Knight, "A 3D Parallel Algorithm for QR Decomposition", SPAA '18, 2018,

### Hongyuan Zhan, Gabriel Gomes, Xiaoye S Li, Kamesh Madduri, Alex Sim, Kesheng Wu, "Consensus ensemble system for traffic flow prediction", IEEE Transactions on Intelligent Transportation Systems, 2018, 19:3903--3914,

### E. Rebrova, G. Chavez, Y. Liu, P. Ghysels, X. S. Li, "A Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression", IEEE IPDPSW, 2018,

### Yang Liu, Mathias Jacquelin, Pieter Ghysels, Xiaoye S Li, "Highly scalable distributed-memory sparse triangular solution algorithms", 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018, 87--96,

### Hongyuan Zhan, Gabriel Gomes, Xiaoye S Li, Kamesh Madduri, Kesheng Wu, "Efficient online hyperparameter learning for traffic flow prediction", 2018 21st International Conference on Intelligent Transportation Systems (ITSC), 2018, 164--169,

### Mathias Jacquelin, Lin Lin, Weile Jia, Yonghua Zhao, Chao Yang, "A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems", Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, January 1, 2018, 54--63,

### Meiyue Shao, Hasan Metin Aktulga, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver", Computer Physics Communications, 2018, 222:1--13, doi: 10.1016/j.cpc.2017.09.004

### Mathias Jacquelin, Lin Lin, Chao Yang, "PSelInv--A distributed memory parallel algorithm for selected inversion: The non-symmetric case", Parallel Computing, 2018, 74:84--98,

### Victor Wen-zhe Yu, Fabiano Corsetti, Alberto Garcia, William P Huhn, Mathias Jacquelin, Weile Jia, Bjorn Lange, Lin Lin, Jianfeng Lu, Wenhui Mi, others, "ELSI: A unified software interface for Kohn--Sham electronic structure solvers", Computer Physics Communications, 2018, 222:267--285,

### William Huhn, Alberto Garcia, Luigi Genovese, Ville Havu, Mathias Jacquelin, Weile Jia, Murat Keceli, Raul Laasner, Yingzhou Li, Lin Lin, others, "Unified Access To Kohn-Sham DFT Solvers for Different Scales and HPC: The ELSI Project", Bulletin of the American Physical Society, American Physical Society, 2018,

### Junmin Gu, Scott Klasky, Norbert Podhorszki, Ji Qiang, Kesheng Wu, "Querying large scientific data sets with adaptable IO system ADIOS", Asian Conference on Supercomputing Frontiers, 2018, 51--69,

### FP Hamon, MS Day, ML Minion, "Concurrent implicit spectral deferred correction scheme for low-Mach number combustion with detailed chemistry", Combustion Theory and Modelling, 2018, doi: 10.1080/13647830.2018.1524156

### M Minion, S Goetschel, "Parallel-in-Time for Parabolic Optimal Control Problems Using PFASST", Domain Decomposition Methods in Science and Engineering XXIV, (Springer: 2018)

### Mathias Jacquelin, Esmond G Ng, Barry W Peyton, "Fast and effective reordering of columns within supernodes using partition refinement", 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018, 76--86,

### 2017

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, "UPC++: a PGAS C++ Library", The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'17) Research Poster, November 2017,

### John Bachan, Dan Bonachea, Paul H Hargrove, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Scott B Baden, "The UPC++ PGAS library for Exascale Computing", Proceedings of the Second Annual PGAS Applications Workshop (PAW17), November 13, 2017, doi: 10.1145/3144779.3169108

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### Yang You, Aydin Buluc, James Demmel, "Scaling deep learning on GPU and Knights Landing clusters", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), 2017,

### M. Jacquelin, L. Lin and C. Yang, "PSelInv – A distributed memory parallel algorithm for selected inversion : the symmetric case", Parallel Computing, November 9, 2017, 74:84-98, doi: 10.1016/j.parco.2017.11.009

### E. J. Bylaska, E. Apra, K. Kowalski, M. Jacquelin, W.A. de Jong, A. Vishnu, B. Palmer, J. Daily, T.P. Straatsma, J.R. Hammond, M. Klemm, "Transitioning NWChem to the Next Generation of Many Core Machines", Exascale Scientific Applications Scalability and Performance Portability, edited by Tjerk P. Straatsma, Katerina B. Antypas, Timothy J. Williams, (Taylor & Francis: November 9, 2017)

### E.J. Bylaska, J. Hammond, M. Jacquelin, W.A. de Jong, M. Klemm, "Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi© Processor", High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science, Springer, Cham, October 21, 2017, 404-418, doi: 10.1007/978-3-319-67630-2_30

### W. Hu, L. Lin, R. Zhang, C. Yang, J. Yang, "Highly efficient photocatalytic water splitting over edge-modified phosphorene nanoribbons", J. Am. Chem. Soc., October 13, 2017, 139:15429–1543, doi: 10.1021/jacs.7b08474

### Meiyue Shao and Chao Yang, "Properties of Definite Bethe--Salpeter Eigenvalue Problems", Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing. EPASA 2015. Lecture Notes in Computational Science and Engineering, vol 117., 2017, 91--105, doi: 10.1007/978-3-319-62426-6_7

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer’s Guide, v1.0-2017.9", Lawrence Berkeley National Laboratory Tech Report, September 2017, LBNL 2001065, doi: 10.2172/1398522

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### W. Hu, L. Lin, C. Yang, "Interpolative Separable Density Fitting Decomposition for Accelerating Hybrid Density Functional Calculations with Applications to Defects in Silicon", J. Chem. Theory Comput., September 29, 2017, 13:5420–5431, doi: 10.1021/acs.jctc.7b00807

### John Bachan, Scott Baden, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian Van Straalen, "UPC++ Specification v1.0, Draft 4", Lawrence Berkeley National Laboratory Tech Report, September 27, 2017, LBNL 2001066, doi: 10.2172/1398521

UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Matthew S. Barclay, Timothy J. Quincy, David B. Williams-Young, Marco Caricato, Christopher G. Elles, "Accurate Assignments of Excited-State Resonance Raman Spectra: A Benchmark Study Combining Experiment and Theory", Journal of Physical Chemistry A, 2017, 121:7937-7946, doi: 10.1021/acs.jpca.7b09467

### W. Hu, L. Lin, C. Yang, "Projected Commutator DIIS Method for Accelerating Hybrid Functional Electronic Structure Calculations", J. Chem. Theory Comput, September 22, 2017, 13:5458–5467, doi: 10.1021/acs.jctc.7b00892

### V. Yu. F. Corsetti, A. García, W. P. Huhn, M. Jacquelin, W. Jia, B. Lange, L. Lin, J. Lu, W. Mi, A. Seifitokaldan, Á. Vazquez-Mayagoitia, C. Yang, H. Yang, V. Blum, "ELSI: A unified software interface for Kohn–Sham electronic structure solvers", Computer Physics Communications, September 7, 2017, 222:267-285, doi: 10.1016/j.cpc.2017.09.007

### R. Van Beeumen, D.B. Williams-Young, J.M. Kasper, C. Yang, E.G. Ng, X. Li, "Model order reduction algorithm for estimating the absorption spectrum", Journal of Chemical Theory and Computation, 2017, 13:4950-4961, doi: 10.1021/acs.jctc.7b00402

### E. Vecharynski, J. Brabec, M. Shao, N. Govind, C. Yang, "Efficient Block Preconditioned Eigensolvers for Linear Response Time-dependent Density Functional Theory", Computer Physics Communications, 2017, 221:42-52, doi: https://doi.org/10.1016/j.cpc.2017.07.017

We present two efficient iterative algorithms for solving the linear response eigenvalue problem arising fromthe time dependent density functional theory. Although the matrix to be diagonalized is nonsymmetric, it has a special structure that can be exploited to save both memory and floating point operations. In particular, the nonsymmetric eigenvalue problem can be transformed into a product eigenvalue problem that is self-adjoint with respect to a K-inner product. This product eigenvalue problem can be solved efficiently by a modified Davidson algorithm and a modified locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm that make use of the K-inner product. The solution of the product eigenvalue problem yields one component of the eigenvector associated with the original eigenvalue problem. However, the other component of the eigenvector can be easily recovered in a postprocessing procedure. Therefore, the algorithms we present here are more efficient than existing algorithms that try to approximate both components of the eigenvectors simultaneously.The efficiency of the new algorithms is demonstrated by numerical examples.

### Alessio Petrone, David B. Williams-Young, David B. Lingerfelt, Xiaosong Li, "Ab Initio Transient Raman Analysis", Journal of Physical Chemistry A, 2017, 121:3958-3965, doi: 10.1021/acs.jpca.7b02905

### Franco Egidi, David B. Williams-Young, Alberto Baiardi, Julien Bloino, Giovanni Scalmani, Michael J. Frisch, Xiaosong Li, Vincenzo Barone, "Effective Inclusion of MEchanical and Electrical Anharmonivity in Excited Electronic States: VPT2-TDDFT Route", Journal of Chemical Theory and Computation, 2017, 13:2789-2803, doi: 10.1021/acs.jctc.7b00218

### W. Hu, L. Lin, A. Banerjee, E. Vecharynski, C. Yang, "Adaptively compressed exchange operator for large scale hybrid density functional calculations with applications to the adsorption of water on silicene", J. Chem. Theory Comput., February 8, 2017, 13:1188–1198,

### K. Meerbergen, W. Michiels, R. Van Beeumen, E. Mengi, "Computation of pseudospectral abscissa for large-scale nonlinear eigenvalue problems", IMA Journal of Numerical Analysis, 2017, 37:1831-1863, doi: 10.1093/imanum/drw065

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'17)", Poster at Exascale Computing Project (ECP) Annual Meeting 2017, January 2, 2017,

### E. Vecharynski and C. Yang, "Preconditioned iterative methods for eigenvalue counts", Lecture Notes in Computational Science, January 1, 2017,

### Mathias Jacquelin, Lin Lin, Chao Yang, "PSelInv—A distributed memory parallel algorithm for selected inversion: The symmetric case", ACM Transactions on Mathematical Software (TOMS), 2017, 43:21,

### MF Adams, E Hirvijoki, MG Knepley, J Brown, T Isaac, R Mills, "Landau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures", SIAM J. Sci. Comput., 2017, 39:C452--C465, doi: 10.1137/17M1118828

### R Hager, J Lang, CS Chang, S Ku, Y Chen, SE Parker, MF Adams, "Verification of long wavelength electromagnetic modes with a gyrokinetic-fluid hybrid model in the XGC code", Physics of Plasmas, 2017, 24, doi: 10.1063/1.4983320

### E Hirvijoki, MF Adams, "Conservative discretization of the Landau collision integral", Physics of Plasmas, 2017, 24, doi: 10.1063/1.4979122

### RW Grout, H Kolla, ML Minion, JB Bell, "Achieving algorithmic resilience for temporal integration through spectral deferred corrections", Communications in Applied Mathematics and Computational Science, 2017, 12:25--50, doi: 10.2140/camcos.2017.12.25

### Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

- Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

### Mathias Jacquelin, Wibe De Jong, Eric Bylaska, "Towards highly scalable Ab initio molecular dynamics (AIMD) simulations on the Intel knights landing manycore processor", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 1, 2017, 234--243,

### 2016

### M. Jacquelin, L. Lin and C. Yang, "A Distributed Memory Parallel Algorithm for Selected Inversion: the non-symmetric case", PMAA, December 30, 2016,

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

### Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

- Download File: SC16-HPGMG-BoF-Intro.pdf (pdf: 1020 KB)

### Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

- Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

### Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

- Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

### Alessio Petrone, David Lingerfelt, David Williams-Young, Xiaosong Li, "Ab Initio Transient Vibrational Spectral Analysis", Journal of Physical Chemistry Letters, 2016, 7:4501-4508, doi: 10.1021/acs.jpclett.6b02292

### David Bruce Williams-Young, Joshua J. Goings, Xiaosong Li, "Accelerating Real-Time Time-Dependent Density Functional Theory with a Non-Recursive Chebyshev Expansion of the Quantum Propagator", Journal of Chemical Theory and Computation, 2016, 12:5333-5338, doi: 10.1021/acs.jctc.6b00693

### Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

### A. S. Banerjee, L. Lin, W. Hu, C. Yang and J. E. Pask, "Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for large-scale electronic structure calculations", Journal of Chemical Physics, October 1, 2016,

### David Bruce Williams-Young, Franco Egidi, Xiaosong Li, "Relativistic Particle-Particle Tamm-Dancoff Approximation", Journal of Chemical Theory and Computation, 2016, 12:5379-5384, doi: 10.1021/acs.jctc.6b00833

### W.A. de Jong, M. Jacquelin, E.J. Bylaska, "Advancing Algorithms to Increase Performance of Correlated and Dynamical Electronic Structure Simulation", CMMSE 2016: Proceedings of the 16th International Conference on Mathematical Methods in Science and Engineering, September 1, 2016, 5:1342-1346,

### Veronika Strnadova-Neeley, Aydin Buluc, John R. Gilbert, Leonid Oliker, Weimin Ouyang, "LiRa: A New Likelihood-Based Similarity Score for Collaborative Filtering", August 30, 2016,

### Mathias Jacquelin, Yili Zheng, Esmond Ng, Katherine Yelick, "An Asynchronous Task-based Fan-Both Sparse Cholesky Solver", August 23, 2016,

Systems of linear equations arise at the heart of many scientific and engineering applications. Many of these linear systems are sparse; i.e., most of the elements in the coefficient matrix are zero. Direct methods based on matrix factorizations are sometimes needed to ensure accurate solutions. For example, accurate solution of sparse linear systems is needed in shift-invert Lanczos to compute interior eigenvalues. The performance and resource usage of sparse matrix factorizations are critical to time-to-solution and maximum problem size solvable on a given platform. In many applications, the coefficient matrices are symmetric, and exploiting symmetry will reduce both the amount of work and storage cost required for factorization. When the factorization is performed on large-scale distributed memory platforms, communication cost is critical to the performance of the algorithm. At the same time, network topologies have become increasingly complex, so that modern platforms exhibit a high level of performance variability. This makes scheduling of computations an intricate and performance-critical task. In this paper, we investigate the use of an asynchronous task paradigm, one-sided communication and dynamic scheduling in implementing sparse Cholesky factorization (symPACK) on large-scale distributed memory platforms. Our solver symPACK relies on efficient and flexible communication primitives provided by the UPC++ library. Performance evaluation shows good scalability and that symPACK outperforms state-of-the-art parallel distributed memory factorization packages, validating our approach on practical cases.

### Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

### R. Li, Y. Xi, E. Vecharynski, C. Yang, and Y. Saad, "A Thick-Restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems", SIAM Journal on Scientific Computing, Vol. 38, Issue 4, pp. A2512–A2534, 2016, doi: 10.1137/15M1054493

Polynomial filtering can provide a highly effective means of computing all eigenvalues of a real symmetric (or complex Hermitian) matrix that are located in a given interval, anywhere in the spectrum. This paper describes a technique for tackling this problem by combining a Thick-Restart version of the Lanczos algorithm with deflation ('locking') and a new type of polynomial filters obtained from a least-squares technique. The resulting algorithm can be utilized in a 'spectrum-slicing' approach whereby a very large number of eigenvalues and associated eigenvectors of the matrix are computed by extracting eigenpairs located in different sub-intervals independently from one another.

### Mark Adams, Jed Brown, Matt Knepley, Ravi Samtaney, "Segmental Refinement: A Multigrid Technique for Data Locality", SIAM J. Sci. Comput., 2016, 38:4,

### Osni Marques, Paulo B. Vasconcelos, "Computing the Bidiagonal SVD through an Associated Tridiagonal Eigenproblem", VECPAR 2016, Porto, Portugal, Springer, June 2016,

### Naoya Nomura, Akihiro Fujii, Teruo Tanaka, Kengo Nakajima, Osni Marques, "Performance Analysis of SA-AMG Method by Setting Extracted Near-kernel Vectors", VECPAR 2016, Porto, Portugal, Springer, June 2016,

### Fabien Bruneval, Tonatiuh Rangel, Samia M. Hamed, Meiyue Shao, Chao Yang, Jeffrey B. Neaton, "MOLGW 1: many-body perturbation theory software for atoms, molecules, and clusters", Computer Physics Communications, 2016, 208:149–161, doi: 10.1016/j.cpc.2016.06.019

### Osni Marques, Alex Druinsky, Xiaoye S. Li, Andrew T. Barker, Panayot Vassilevski, Delyan Kalchev, "Tuning the Coarse Space Construction in a Spectral AMG Solver", ICCS 2016 (The International Conference on Computational Science), San Diego, CA, Elsevier, June 2016,

### Meiyue Shao, Lin Lin, Chao Yang, Fang Liu, Felipe H. da Jornada, Jack Deslippe and Steven G. Louie, "Low rank approximation in G0W0 calculations", Science China Mathematics, June 4, 2016, 59:1593–1612, doi: 10.1007/s11425-016-0296-x

### Mathias Jacquelin, Scheduling Sparse Symmetric Fan-Both Cholesky Factorization, The 11th Scheduling for Large Scale Systems Workshop, May 18, 2016,

### Mathias Jacquelin, Scheduling Sparse Symmetric Fan-Both Cholesky Factorization, SIAM PP'16, April 15, 2016,

### M. Jacquelin, L. Lin, W. Jia, Y. Zhao and C. Yang, "A Left-looking selected inversion algorithm and task parallelism on shared memory systems", April 9, 2016,

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

- Download File: CU16SWWilliams.pptx (pptx: 1 MB)

### R. Van Beeumen, E. Jarlebring, W. Michiels, "A rank-exploiting infinite Arnoldi algorithm for nonlinear eigenvalue problems", Numerical Linear Algebra with Applications, 2016, 23:607-628, doi: 10.1002/nla.2043

### Z. Wen, C. Yang, X. Liu and Y. Zhang, "A Penalty-based Trace Minimization Method for Large-scale Eigenspace Computation", J. Sci. Comp., March 1, 2016, 66:1175-1203, doi: 10.1007/s10915-015-0061-0

### E. Vecharynski, C. Yang, and F. Xue, "Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems", SIAM Journal on Scientific Computing, Vol. 38, No. 1, pp. A500–A527, 2016, doi: 10.1137/15M1027413

We introduce the Generalized Preconditioned Locally Harmonic Residual (GPLHR) method for solving standard and generalized non-Hermitian eigenproblems. The method is particularly useful for computing a subset of eigenvalues, and their eigen- or Schur vectors, closest to a given shift. The proposed method is based on block iterations and can take advantage of a preconditioner if it is available. It does not need to perform exact shift-and-invert transformation. Standard and generalized eigenproblems are handled in a unified framework. Our numerical experiments demonstrate that GPLHR is generally more robust and efficient than existing methods, especially if the available memory is limited.

### David Lingerfelt, David Bruce Williams-Young, Alessio Petrone, Xiaosong Li, "Direct Ab Initio (Meta-)Surface-Hopping Dynamics", Journal of Chemical Theory and Computation, 2016, 12:935-945, doi: 10.1021/acs.jctc.5b00697

### E. Vecharynski and C. Yang, "Preconditioned iterative methods for eigenvalue counts", to appear in Proceedings of International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, in Lecture Notes in Computational Science and Engineering, Springer, 2016,

We describe preconditioned iterative methods for estimating the number of eigenvalues of a Hermitian matrix within a given interval. Such estimation is useful in a number of applications.In particular, it can be used to develop an efficient spectrum-slicing strategy to compute many eigenpairs of a Hermitian matrix. Our method is based on the Lanczos- and Arnoldi-type of iterations. We show that with a properly defined preconditioner, only a few iterations may be needed to obtain a good estimate of the number of eigenvalues within a prescribed interval. We also demonstrate that the number of iterations required by the proposed preconditioned schemes is independent of the size and condition number of the matrix. The efficiency of the methods is illustrated on several problems arising from density functional theory based electronic structure calculations.

### Wei Hu, Lin Lin, Chao Yang, Jun Dai and Jinlong Yang, "Edge-Modied Phosphorene Nano ake Heterojunctions as Highly Efficient Solar Cells", Nano Lett, February 5, 2016, 16:1675–1682, doi: 10.1021/acs.nanolett.5b04593

### L. Lin, Y. Saad and C. Yang, "Approximating spectral densities of large matrices", SIAM Review, February 1, 2016, 58:34–65, doi: 10.1137/130934283

### P. Li, X. Liu, M. Chen, P. Lin, X. Ren, L. Lin, C. Yang, L. He, "Large-scale ab initio simulations based on systematically improvable atomic basis", Computational Materials Science, February 1, 2016, 112:503–517, doi: doi:10.1016/j.commatsci.2015.07.004

### J. Brabec, C. Yang, E. Epifanovsky, A.I. Krylov, and E. Ng, "Reduced-cost sparsity-exploiting algorithm for solving coupled-cluster equations", Journal of Computational Chemistry, January 24, 2016, 37:1059–1067, doi: 10.1002/jcc.24293

### Burlen Loring, Suren Byna, Prabhat, Junmin Gu, Hari Krishnan, Michael Wehner, and Oliver Ruebel, "TECA an Extreme Event Detection and Climate Analysis Package for High Performance Computing", The AMS (American Meteorological Society) 96th Annual Meeting, January 6, 2016,

### Meiyue Shao, Felipe H. da Jornada, Chao Yang, Jack Deslippe, Steven G. Louie, "Structure preserving parallel algorithms for solving the Bethe–Salpeter eigenvalue problem", Linear Algebra and its Applications, 2016, 488:148–167, doi: 10.1016/j.laa.2015.09.036

### David Pugmire, James Kress, Jong Choi, Scott Klasky, Tahsin Kurc, Randy Michael Churchill, Matthew Wolf, Greg Eisenhower, Hank Childs, Kesheng Wu, others, "Visualization and analysis for near-real-time decision making in distributed workflows", 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016, 1007--1013,

### Utkarsh Ayachit, Andrew Bauer, Earl PN Duque, Greg Eisenhauer, Nicola Ferrier, Junmin Gu, Kenneth E Jansen, Burlen Loring, Zarija Lukic, Suresh Menon, others, "Performance analysis, design considerations, and applications of extreme-scale in situ infrastructures", SC 16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016, 921--932, LBNL 1007264,

### D. Pugmire, J. Kress, J. Choi, S. Klasky, Kurc, R. M. Churchill, M. Wolf, G., H. Childs, K. Wu, A. Sim, J. Gu, J. Low, "Visualization and Analysis for Near-Real-Time Decision in Distributed Workflows", 2016 IEEE International Parallel and Distributed Symposium Workshops (IPDPSW), 2016, 1007--1013, doi: 10.1109/IPDPSW.2016.175

### WE Pazner, A Nonaka, JB Bell, MS Day, ML Minion, "A high-order spectral deferred correction strategy for low Mach number flow with complex chemistry", Combustion Theory and Modelling, 2016, 20:521--547, doi: 10.1080/13647830.2016.1150519

### R Speck, D Ruprecht, M Minion, M Emmett, R Krause, "Inexact Spectral Deferred Corrections", DOMAIN DECOMPOSITION METHODS IN SCIENCE AND ENGINEERING XXII, ( 2016) Pages: 389--396 doi: 10.1007/978-3-319-18827-0_39

### Mathias Jacquelin, Lin Lin, Nathan Wichmann, Chao Yang, Enhancing scalability and load balancing of Parallel Selected Inversion via tree-based asynchronous communication, Parallel and Distributed Processing Symposium, 2016 IEEE International, Pages: 192--201 January 1, 2016,

### 2015

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", Springer International Journal of Parallel Programming, December 2015, 43:6:1218-1243, doi: 10.1007/s10766-014-0326-5

### M. Jacquelin, L. Lin, N. Wichmann and C. Yang, "Enhancing the scalability tree-based asynchronous communication", accepted IPDPS16, November 25, 2015,

### Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism", Supercomputing (SC'15), November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.

We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.

We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

### M. van Setten; F. Carouso; S. Sharifzadeh; X. Ren; M. Scheffler; F. Liu; J. Lischner; L. Lin; J. Deslippe; S. Louie; C. Yang; F. Weigend; J. Neaton; F. Evers; P. Rinke, "GW 100: Benchmarking G0W0 for molecular systems", Journal of Chemical Theory and Computation, October 22, 2015,

### Jiri Brabec, Lin Lin, Meiyue Shao, Niranjan Govind, Chao Yang, Yousef Saad, Esmond G. Ng, "Fast Algorithms for Estimating the Absorption Spectrum within Linear Response Time-dependent Density Functional Theory", Journal of Chemical Theory and Computation, 2015, 11:5197–5208, doi: 10.1021/acs.jctc.5b00887

### Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

### Abhinav Sarje, Xiaoye Li, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Reconstructing Nanostructures from X-Ray Scattering Data", OLCF User Meeting, June 2015,

### M. Ulbrich, Z. Wen, C. Yang, D. Klockner, Z. Lu, "A proximal gradient method for ensemble density functional theory", SIAM J. Sci. Comp., June 20, 2015, 37:A1975--A20, doi: 10.1137/14098973X

### R. Van Beeumen, K. Meerbergen, W. Michiels, "Compact rational Krylov methods for nonlinear eigenvalue problems", SIAM Journal on Matrix Analysis and Applications, 2015, 36:820-838, doi: 10.1137/140976698

### Mathias Jacquelin, Lin Lin, Chao Yang, "A Distributed Memory Parallel Algorithm for Selected Inversion : the Symmetric Case", To appear in ACM Transactions on Mathematical Software (TOMS), May 28, 2015,

### Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

- Download File: triangles-gabb.pdf (pdf: 384 KB)

### C. Yang, Absorption Spectrum Estimation via Linear Response TDDFT, Applied Math Seminar, Stanford University, May 13, 2015,

### C. Yang, Fast Numerical Algorithms for Large-scale Electronic Structure Calculations, DOE BES Computational and Theoretical Chemistry PI Meeting, April 28, 2015,

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Math Colloquium, Michigan Tech University, April 24, 2015,

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Applied math & PDE seminar, UC Davis, April 14, 2015,

### Fang Liu, Lin Lin , Derek Vigil-Fowlerd , Johannes Lischnerd, Alexander F. Kemper, , Sahar Sharifzadehe, Felipe H. da Jornadad, Jack Deslippef, Chao Yangc, Jeffrey B. Neaton, Steven G. Louied,, "Numerical integration for ab initio many-electron self energy calculations within the GW approximation", Journal of Computational Physics, April 1, 2015,

### R. Van Beeumen, W. Michiels, K. Meerbergen, "Linearization of Lagrange and Hermite interpolating matrix polynomials", IMA Journal of Numerical Analysis, 2015, 35:909-930, doi: 10.1093/imanum/dru019

### Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer, "Recovering Nanostructures from X-Ray Scattering Data", Nvidia GPU Technology Conference (GTC), March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

### C. Yang, Fast Numerical Methods for Computational Materials Science and Chemistry, CRD All-hands meeting, March 4, 2015,

### Marc Baboulin, Xiaoye S. Li, Francois-Henry Rouet, "Using random butterfly transformations to avoid pivoting in sparse direct methods", High Performance Computing for Computational Science - VECPAR 2014, Lecture Notes in Computer Science, Springer. Preprint, 2015,

### E. Vecharynski, C. Yang, J. E. Pask, "A projected preconditioned conjugate gradient algorithm for computing many extreme eigenpairs of a Hermitian matrix", Journal of Computational Physics, Vol. 290, pp. 73–89, 2015,

We present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix *A*. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of *A*. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimal block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.

### Wei Hu, Lin Lin and Chao Yang, "Edge reconstruction in armchair phosphorene nanoribbons revealed by discontinuous Galerkin density functional theory", Phys. Chem. Chem. Phys., 2015, Advance Article, February 11, 2015, doi: 10.1039/C5CP00333D

With the help of our recently developed massively parallel DGDFT (Discontinuous Galerkin Density Functional Theory) methodology, we perform large-scale Kohn–Sham density functional theory calculations on phosphorene nanoribbons with armchair edges (ACPNRs) containing a few thousands to ten thousand atoms. The use of DGDFT allows us to systematically achieve a conventional plane wave basis set type of accuracy, but with a much smaller number (about 15) of adaptive local basis (ALB) functions per atom for this system. The relatively small number of degrees of freedom required to represent the Kohn–Sham Hamiltonian, together with the use of the pole expansion the selected inversion (PEXSI) technique that circumvents the need to diagonalize the Hamiltonian, results in a highly efficient and scalable computational scheme for analyzing the electronic structures of ACPNRs as well as their dynamics. The total wall clock time for calculating the electronic structures of large-scale ACPNRs containing 1080–10 800 atoms is only 10–25 s per self-consistent field (SCF) iteration, with accuracy fully comparable to that obtained from conventional planewave DFT calculations. For the ACPNR system, we observe that the DGDFT methodology can scale to 5000–50 000 processors. We use DGDFT based ab initio molecular dynamics (AIMD) calculations to study the thermodynamic stability of ACPNRs. Our calculations reveal that a 2 × 1 edge reconstruction appears in ACPNRs at room temperature.

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Workshop on High Performance and Parallel Computing Methods and Algorithms for Materials Defects, Singapore, February 9, 2015,

### M. Adams, P. Colella, D. T. Graves, J.N. Johnson, N.D. Keen, T. J. Ligocki. D. F. Martin. P.W. McCorquodale, D. Modiano. P.O. Schwartz, T.D. Sternberg, B. Van Straalen, "Chombo Software Package for AMR Applications - Design Document", Lawrence Berkeley National Laboratory Technical Report LBNL-6616E, January 9, 2015,

- Download File: chomboDesign.pdf (pdf: 994 KB)

### D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A.I. Krylov, "New algorithms for iterative matrix-free eigensolvers in quantum chemistry", Journal of Computational Chemistry, Vol. 36, Issue 5, pp. 273–284, 2015,

New algorithms for iterative diagonalization procedures that solve for a small set of eigen-states of a large matrix are described. The performance of the algorithms is illustrated by calculations of low and high-lying ionized and electronically excited states using equation-of-motion coupled-cluster methods with single and double substitutions (EOM-IP-CCSD and EOM-EE-CCSD). We present two algorithms suitable for calculating excited states that are close to a specified energy shift (interior eigenvalues). One solver is based on the Davidson algorithm, a diagonalization procedure commonly used in quantum-chemical calculations. The second is a recently developed solver, called the “Generalized Preconditioned Locally Harmonic Residual (GPLHR) method.” We also present a modification of the Davidson procedure that allows one to solve for a specific transition. The details of the algorithms, their computational scaling, and memory requirements are described. The new algorithms are implemented within the EOM-CC suite of methods in the Q-Chem electronic structure program.

### ML Minion, R Speck, M Bolten, M Emmett, D Ruprecht, "Interweaving pfasst and parallel multigrid", SIAM Journal on Scientific Computing, 2015, 37:S244--SS26, doi: 10.1137/14097536X

### R Speck, D Ruprecht, M Emmett, M Minion, M Bolten, R Krause, "A multi-level spectral deferred correction method", BIT NUMERICAL MATHEMATICS, 2015, 55:843--867, doi: 10.1007/s10543-014-0517-x

### 2014

### Siegfried Cools, Pieter Ghysels, Wim van Aarle, Wim Vanroose, "A multi-level preconditioned Krylov method for the efficient solution of algebraic tomographic reconstruction problems", To appear in Journal of Computational and Applied Mathematics, December 28, 2014,

### François-Henry Rouet, Xiaoye S. Li, Pieter Ghysels, Artem Napov, "A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization", Submitted to ACM Transactions on Mathematical Software, December 2014,

### S. Güttel, R. Van Beeumen, K. Meerbergen, W. Michiels, "NLEIGS: A class of fully rational Krylov methods for nonlinear eigenvalue problems", SIAM Journal on Scientific Computing, 2014, 36:A2842-A286, doi: 10.1137/130935045

### Wei Hu, Lin Lin, Chao Yang and Jinlong Yang, "Electronic structure and aromaticity of large-scale hexagonal graphene nanoflakes", J. Chem. Phys. 141, 214704 (2014), December 2, 2014, 141:214704, doi: 10.1063/1.4902806

- Download File: JCPGNFs.pdf (pdf: 3.7 MB)

With the help of the recently developed SIESTA-PEXSI method [L. Lin, A. García, G. Huhs, and C. Yang, J. Phys.: Condens. Matter26, 305503 (2014)], we perform Kohn-Sham density functional theory calculations to study the stability and electronic structure of hydrogen passivated hexagonal graphene nanoflakes (GNFs) with up to 11 700 atoms. We find the electronic properties of GNFs, including their cohesive energy, edge formation energy, highest occupied molecular orbital-lowest unoccupied molecular orbital energy gap, edge states, and aromaticity, depend sensitively on the type of edges (armchair graphene nanoflakes (ACGNFs) and zigzag graphene nanoflakes (ZZGNFs)), size and the number of electrons. We observe that, due to the edge-induced strain effect in ACGNFs, large-scale ACGNFs’ edge formation energydecreases as their size increases. This trend does not hold for ZZGNFs due to the presence of many edge states in ZZGNFs. We find that the energy gaps E g of GNFs all decay with respect to 1/L, where L is the size of the GNF, in a linear fashion. But as their size increases, ZZGNFs exhibit more localized edge states. We believe the presence of these states makes their gap decrease more rapidly. In particular, when L is larger than 6.40 nm, we find that ZZGNFs exhibit metallic characteristics. Furthermore, we find that the aromatic structures of GNFs appear to depend only on whether the system has 4N or 4N + 2 electrons, where N is an integer.

### David Trebotich, Mark F. Adams, Sergi Molins, Carl I. Steefel, Chaopeng Shen, "High-Resolution Simulation of Pore-Scale Reactive Transport Processes Associated with Carbon Sequestration", Computing in Science and Engineering, December 2014, 16:22-31, doi: 10.1109/MCSE.2014.77

- Download File: CISE-16-06-Trebotichappeared.pdf (pdf: 2.7 MB)

### Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

- Download File: SC14HPGMGBoF.pdf (pdf: 1.9 MB)

### Alex Druinsky, Brian Austin, Sherry Li, Osni Marques, Eric Roman, Samuel Williams, "A Roofline Performance Analysis of an Algebraic Multigrid Solver", Supercomputing (SC), November 2014,

### Veronika Strnadova, Aydın Buluç, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, John Gilbert, Daniel Rokhsar, Leonid Oliker, "Efficient and accurate clustering for large-scale genetic mapping", IEEE International Conference on Bioinformatics and Biomedicine (BIBM'14), November 1, 2014,

- Download File: bibm14.pdf (pdf: 764 KB)

### A. L. Chervenak, A. Sim, J. Gu, R. Schuler, N. Hirpathak, "Adaptation and Policy-Based Resource Allocation for Efficient Bulk Data Transfers in High Performance Computing Environments", 4th International Workshop on Network-aware Data Management (NDM'14), 2014,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

### Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

### James Demmel, Hong-Diep Nguyen, "Parallel Reproducible Summation", IEEE Transactions on Computers, Special Section on Computer Arithmetic 2014, August 11, 2014, doi: 10.1109/TC.2014.2345391

Reproducibility, i.e. getting bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [10]. However, the combination of dynamic scheduling of parallel computing resources, and floating point non-associativity, makes attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for error-free vector transformation [7], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm reproducibly computes highly accurate results with an absolute error bound of (formula) at a cost of 7n FLOPs and a small constant amount of extra memory usage. Higher accuracies are also possible by increasing the number of error-free transformations. As long as all operations are performed in to-nearest rounding mode, results computed by the proposed algorithms are reproducible for any run on any platform. In particular, our algorithm requires the minimum number of reductions, i.e. one reduction of an array of six double precision floating point numbers per sum, and hence is well suited for massively parallel environments.

### W.G. Vandenberghe, M.V. Fischetti, R. Van Beeumen, K. Meerbergen, W. Michiels, C. Effenberger, "Determining bound states in a semiconductor device with contacts using a nonlinear eigenvalue solver", Journal of Computational Electronics, 2014, 13:753-762, doi: 10.1007/s10825-014-0597-5

### W.A. de Jong, L. Lin, H. Shan, C. Yang and L. Oliker, "Towards modelling complex mesoscale molecular environments", International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), 2014,

### Pieter Ghysels, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Jianlin Xia, Hierarchically Low-Rank Structured Sparse Factorization with Reduced Communication and Synchronization, Householder Symposium XIX, June 2014,

### Pieter Ghysels, Wim Vanroose, Karl Meerbergen, High Performance Implementation of Deflated Preconditioned Conjugate Gradients with Approximate Eigenvectors, Householder Symposium XIX June 8-13, Spa Belgium, Pages: 84 June 2014,

### Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

- Download File: hpgmg.pdf (pdf: 183 KB)

### Abhinav Sarje, Xiaoye Li, Slim Chourou, Alexander Hexemer, "Petascale X-Ray Scattering Simulations With GPUs", GPU Technology Conference, March 2014,

### Abhinav Sarje, Xiaoye Li, Alexander Hexemer, "Inverse Modeling of X-Ray Scattering Data With Reverse Monte Carlo Simulations", GPU Technology Conference, March 2014,

### Xiaoye S. Li, Artem Napov, Francois-Henry Rouet, Designing multifrontal solvers using hierarchically semiseparable structures, SIAM Conference on Parallel Processing for Scientific Computing (PP12), Portland, OR, USA, February 2014,

### D. Verhees, R. Van Beeumen, K. Meerbergen, N. Guglielmi, W. Michiels, "Fast algorithms for computing the distance to instability of nonlinear eigenvalue problems, with application to time-delay systems", International Journal of Dynamics and Control, 2014, 2:133-142, doi: 10.1007/s40435-014-0059-8

### E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, U. V. Catalyurek, "An Out-of-core Task-based Middleware for Data Intensive Scientific Computing", Handbook on Data Centers, in press, (Springer: February 1, 2014)

### A. L. Chervenak, A. Sim, J. Gu, R. Schuler, N. Hirpathak, "Efficient Data Staging Using Performance-Based Adaptation and Policy-Based Resource Allocation", 22nd Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2014,

### J. Kaye, L. Lin and C. Yang, "A posteriori error estimator for adaptive local basis functions to solve Kohn-Sham density functional theory", Comm. Math. Sci., January 5, 2014, 13:1741--1740, doi: http://dx.doi.org/10.4310/CMS.2015.v13.n7.a5

### Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

- Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
- Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

### G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, Hong Diep Nguyen, E. Solomonik, "Reconstructing Householder Vectors from Tall-Skinny QR", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014, 1159-1170, doi: 10.1109/IPDPS.2014.120

### J. González-Domínguez, O. Marques, M. J. Martín and J. Touriño, "A 2D Algorithm with Asymmetric Workload for the UPC Conjugate Gradient Method", The Journal of Supercomputing, 2014, 70:816-829,

### A. Fujii, O. Marques, "Axis Communication Method for Algebraic Multigrid Solver", IEICE Transactions on Information and Systems, 2014, E97-D:2955-2958,

### M Emmett, ML Minion, "Efficient Implementation of a Multi-Level Parallel in Time Algorithm", DOMAIN DECOMPOSITION METHODS IN SCIENCE AND ENGINEERING XXI, 2014, 98:359--366, doi: 10.1007/978-3-319-05789-7_33

### ML Minion, R Speck, M Bolten, M Emmett, D Ruprecht, "Interweaving PFASST and Parallel Multigrid.", SIAM Journal on Scientific Computing, 2014, abs/1407,

### Laura Grigori, Mathias Jacquelin, Amal Khabou, "Performance predictions of multilevel communication optimal LU and QR factorizations on hierarchical platforms", International Supercomputing Conference, 2014, 76--92,

### Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Hong Diep Nguyen, Edgar Solomonik, "Reconstructing Householder vectors from tall-skinny QR", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 1, 2014, 1159--1170,

### Borbala Hunyadi, Daan Camps, Laurent Sorber, Wim Van Paesschen, Maarten De Vos, Sabine Van Huffel, Lieven De Lathauwer, "Block term decomposition for modelling epileptic seizures", EURASIP Journal on Advances in Signal Processing, 2014, 2014, doi: 10.1186/1687-6180-2014-139

### 2013

### H. M. Aktulga, L. Lin, C. Haine, E. G. Ng, C. Yang, "Parallel Eigenvalue Calculation based on Multiple Shift-invert Lanczos and Contour Integral based Spectral Projection Method", Parallel Computing, December 6, 2013, in press,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

- Download File: sarje-thmmcs-pmbs.pdf (pdf: 2 MB)

### M. Jung, E. H. Wilson III, W. Choi, J. Shalf, H. M. Aktulga, C. Yang, E. Saule, U. V. Catalyurek, M. Kandemir, "Exploring the Future of Out-of-core Computing with Compute-Local Non-Volatile Memory", International Conference for High Performance Computing, Networking, Storage and Analysis 2013 (SC13), NY, USA, ACM New York, November 17, 2013, doi: 10.1145/2503210.2503261

### Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

- Download File: sbac2013personal.pdf (pdf: 195 KB)

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

### H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, "Improving the Scalability of a Symmetric Iterative Eigensolver for Multi-core Platforms", Concurrency and Computation: Practice & Experience, September 12, 2013, online, doi: 10.1002/cpe.3129

### Alfredo Buttari, Serge Gratton, Xiaoye S. Li, Marième Ngom, François-Henry Rouet, David Titley-Peloquin, Clément Weisbecker, "Error Analysis of the Block Low-Rank LU factorization of dense matrices", IRIT-CERFACS, RT-APO-13-7, August 2013,

### Emmanuel Agullo, Patrick R. Amestoy, Alfredo Buttari, Abdou Guermouche, Guillaume Joslin, Jean-Yves L'Excellent, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Mohamed Sid-Lakhdar, Shen Wang, Clément Weisbecker, Ichitaro Yamazaki., "Recent Advances in Sparse Direct Solvers", 22nd Conference on Structural Mechanics in Reactor Technology, August 18, 2013,

- Download File: paper3.pdf (pdf: 243 KB)

### Shen Wang, Xiaoye S. Li, François-Henry Rouet, Jianlin Xia, Maarten V. de Hoop, "A parallel geometric multifrontal solver using hierarchically semiseparable structure", Submitted to ACM Transaction on Mathematical Software, 2013,

### James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

### P. Maris, H. M. Aktulga, S. Binder, A. Calci, U. V. Catalyurek, J. Langhammer, E. G. Ng, E. Saule, R. Roth, J. P. Vary, C. Yang, "No Core CI calculations for light nuclei with chiral 2- and 3-body forces", J. Phys. Conf. Ser., IOP Publishing, August 1, 2013, 454:012063, doi: 10.1088/1742-6596/454/1/012063

### P. Ghysels, W. Vanroose, "Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm", Parallel Computing, June 24, 2013, doi: 10.1016/j.parco.2013.06.001

### Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, Sivan Toledo, "Communication optimal parallel multiplication of sparse random matrices", SPAA 2013: The 25th ACM Symposium on Parallelism in Algorithms and Architectures, Montreal, Canada, 2013, 222-231, doi: 10.1145/2486159.2486196

- Download File: spaa134-ballard.pdf (pdf: 301 KB)

### E. Solomonik, A. Buluç, J. Demmel, "Minimizing communication in all-pairs shortest paths", International Parallel and Distributed Processing Symposium (IPDPS), 2013,

- Download File: 25dapspipdps13.pdf (pdf: 256 KB)

### Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

- Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

### James Demmel, Hong-Diep Nguyen, "Fast Reproducible Floating-Point Summation", Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), April 10, 2013, doi: 10.1109/ARITH.2013.9

Reproducibility, i.e. getting the bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [1]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, make attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump's algorithm for error-free vector transformation [2], and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm trades off efficiency and accuracy: we reproducibly attain reasonably accurate results (with an absolute error bound c · n ^{2} · macheps · max |v _{i} | for a small constant c) with just 2n + O(1) floating-point operations, and quite accurate results (with an absolute error bound c · n ^{3} · macheps ^{2} · max |v _{i} | with 5n + O(1) floating point operations, both with just two reduction operations. Higher accuracies are also possible by increasing the number of error-free transformations. As long as the same rounding mode is used, results computed by the proposed algorithms are reproducible for any run on any platform.

### James Demmel, Hong-Diep Nguyen, Numerical Accuracy and Reproducibility at Exascale, Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH'13), April 10, 2013,

- Download File: pres_33.pdf (pdf: 300 KB)

### R. Van Beeumen, K. Meerbergen, W. Michiels, "A rational Krylov method based on Hermite interpolation for nonlinear eigenvalue problems", SIAM Journal on Scientific Computing, 2013, 35:A327-A350, doi: 10.1137/120877556

### P. Ghysels, T. J. Ashby, K. Meerbergen, W. Vanroose, "Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines", SIAM Journal on Scientific Computing, January 8, 2013, 35:1, doi: 10.1137/12086563X

### L. Lin, M. Chen, C. Yang, L. He, "Accelerating Atomic Orbital-based Electronic Structure Calculation via Pole Expansion and Selected Inversion", J Phsy: Condens Matter, 2013,

### L. Lin, C. Yang, "Elliptic preconditioner for accelerating the self-consistent field iteration in Kohn-Sham Density Functional Theory", SIAM J. Sci. Comp., 2013,

### AS Almgren, AJ Aspden, JB Bell, ML Minion, "On the use of higher-order projection methods for incompressible turbulent flow", SIAM Journal on Scientific Computing, 2013, 35, doi: 10.1137/110829386

### Jack Dongarra, Mathieu Faverge, Thomas Herault, Mathias Jacquelin, Julien Langou, Yves Robert, "Hierarchical QR factorization algorithms for multi-core clusters", Parallel Computing, 2013, 39:212--232,

### 2012

###
P. Maris, H. M. Aktulga, M. A. Caprio, U. V. Catalyurek, E. G. Ng, D. Oryspayev, H. Potter, E.

Saule, M. Sosonkina, J. P. Vary, C. Yang, Z. Zhou,
"Large-scale Ab-initio Configuration Interaction Calculations for Light Nuclei",
J. Phys. Conf. Ser.,
IOP Publishing,
December 18, 2012,
403:012019,
doi: doi:10.1088/1742-6596/403/1/012019

### H. Hu, C. Yang, K. Zhao, "Absorption correction A* for cylindrical and spherical crystals with extended range and high accuracy calculated by Thorkildsen & Larsen analytical method", in press Acta Crystallographica, A, 2012,

### Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

### Junmin Gu, David Smith, Ann L. Chervenak, Alex Sim, "Adaptive Data Transfers that Utilize Policies for Resource Sharing", The 2nd International Workshop on Network-Aware Data Management Workshop (NDM2012), 2012,

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

### A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

### Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, U. V. Catalyurek, "An Out-of-core Eigensolver on SSD-equipped Clusters", 2012 IEEE International Conference on Cluster Computing (CLUSTER), Beijing, China, September 26, 2012, 248 - 256, doi: 10.1109/CLUSTER.2012.76

### Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, U. V. Catalyurek, "An Out-Of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers", 2012 41st International Conference on Parallel Processing Workshops (ICPPW), Pittsburgh, PA, September 10, 2012, 71 - 80, doi: 10.1109/ICPPW.2012.13

### H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, E. G. Ng, "Topology-Aware Mappings for Large-Scale Eigenvalue Problems", Euro-Par 2012 Parallel Processing Conference, Rhode Island, Greece, August 31, 2012, LNCS 748:830-842, doi: 10.1007/978-3-642-32820-6_82

### R. Van Beeumen, K. Van Nimmen, G. Lombaert, K. Meerbergen, "Model reduction for dynamical systems with quadratic output", International Journal for Numerical Methods in Engineering, 2012, 91:229-248, doi: 10.1002/nme.4255

### A. Buluç, J. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments", SIAM Journal on Scientific Computing (SISC), 2012,

- Download File: spgemmsisc12.pdf (pdf: 1.2 MB)

### Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

### A. Lugowski, D. Alber, A. Buluç, J. Gilbert, S. Reinhardt, Y. Teng, A. Waranis, "A flexible open-source toolbox for scalable complex graph analysis", SIAM Conference on Data Mining (SDM), 2012,

- Download File: kdt-final.pdf (pdf: 753 KB)

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

- Download File: SRPS-2012-ABSTRACT-CHOUROU-rev.pdf (pdf: 764 KB)

### A. Lugowski, A. Buluç, J. Gilbert, S. Reinhardt, "Scalable complex graph analysis with the knowledge discovery toolbox", International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2012,

### Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

### D. Y. Parkinson, C. Yang, C. Knoechel, C. A. Larabell, M. Le Gros, "Automatic alignment and reconstruction of images for soft X-ray tomography", J Struct Biol, February 2012, 177:259--266, doi: 10.1016/j.jsb.2011.11.027

### D. Yu, D. Katramatos, A. Shoshani, A. Sim, J. Gu, V. Natarajan, "StorNet: Integrating Storage Resource Management with Dynamic Network Provisioning for Automated Data Transfer", International Committee for Future Accelerators (ICFA) Standing Committee on Inter-Regional Connectivity (SCIC) 2012 Report: Networking for High Energy Physics, 2012,

### P. Ghysels, P. Kłosiewicz, W. Vanroose, "Improving the arithmetic intensity of multigrid with the help of polynomial smoothers", Numerical Linear Algebra with Applications, February 1, 2012, 19:2, doi: 10.1002/nla.1808

### J. Gonzalez-Domınguez, O. Marques, M. Martın, G. Taboada, J. Tourino, "Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS", 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, Madrid, 2012,

### M. Kawai, T. Iwashita, H. Nakashima and O. Marques, "Parallel Smoother Based on Block Red-Black Ordering for Multigrid Poisson Solver", LNCS, Proc. VECPAR 2012, Kobe, Japan, Springer, 2012, 7851:292-299,

### Zaiwen Wen, Chao Yang, Xin Liu, Stefano Marchesini, "Alternating direction methods for classical and ptychographic phase retrieval", Inverse Problems, January 2012, 28:115010,

### A Nonaka, JB Bell, MS Day, C Gilet, AS Almgren, ML Minion, "A deferred correction coupling strategy for low Mach number flow with complex chemistry", Combustion Theory and Modelling, 2012, 16:1053--1088, doi: 10.1080/13647830.2012.701019

- Download File: LMCSDC.pdf (pdf: 601 KB)

### R. Speck, D. Ruprecht, R. Krause, M. Emmett, M. Minion, M. Winkel, and P. Gibbon, "Integrating an N-body problem with SDC and PFASST", Proceedings of the 21st International Conference on Domain Decomposition Methods, DD21, 2012,

### R Speck, D Ruprecht, R Krause, M Emmett, M Minion, M Winkel, P Gibbon, "A massively space-time parallel N-body solver", 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012, doi: 10.1109/sc.2012.6

### M Emmett, ML Minion, "Toward an efficient parallel in time method for partial differential equations", Communications in Applied Mathematics and Computational Science, 2012, 7:105--132, doi: 10.2140/camcos.2012.7.105

### Tudor David, Mathias Jacquelin, Loris Marchal, "Scheduling streaming applications on a complex multicore platform", Concurrency and Computation: Practice and Experience, 2012, 24:1726--1750,

### 2011

### Ichitaro Yamazaki, Xiaoye Sherry Li, François-Henry Rouet, Bora Uçar, "Partitioning, Ordering and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver", Institut National Polytechnique de Toulouse, RT-APO-12-2, November 2011,

- Download File: reportPDSLin.pdf (pdf: 634 KB)

### L. Lin, C. Yang, J. Lu, L. Ying, W. E, "A fast parallel algorithm for selected inversion of structured sparse matrices with application to 2D electronic structure calculations", SIAM J. Sci. Comput., 2011, 33:1329,

### R. Ryne, B. Austin, J. Byrd, J. Corlett, E. Esarey, C. G. R. Geddes, W. Leemans, X. Li, Prabhat, J. Qiang, O. Rübel, J.-L. Vay, M. Venturini, K. Wu, B. Carlsten, D. Higdon and N. Yampolsky, "High Performance Computing in Accelerator Science: Past Successes, Future Challenges", Workshop on Data and Communications in Basic Energy Sciences: Creating a Pathway for Scientific Discovery, October 2011,

### H. M. Aktulga, C. Yang, U. V. Catalyurek, P. Maris, J. P. Vary, E. G. Ng, "On Reducing I/O Overheads in Large-Scale Invariant Subspace Projections", Euro-Par 2011: Parallel Processing Workshops, Bordeaux, France, August 29, 2011, LNCS 715:305-314, doi: 10.1007/978-3-642-29737-3_35

### L. Lin, C. Yang, J. Meza, J. Lu, L. Ying, W. E, "SelInv -- An algorithm for selected inversion of a sparse symmetric matrix", ACM Trans. Math. Software, 2011, 37:40,

### J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

### E. G. Ng, J. Sarich, S. M.Wild, T. Munson, H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, N. Schunck, M. G. Bertolli, M. Kortelainen, W. Nazarewicz, T. Papenbrock, M. V. Stoitsov, "Advancing Nuclear Physics Through TOPS Solvers and Tools", SciDAC 2011 Conference, Denver, CO, July 10, 2011, arXiv:1110.1708,

### H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, E. G. Ng, "Large-scale Parallel Null Space Calculation for Nuclear Configuration Interaction", 2011 International Conference on High Performance Computing and Simulation (HPCS), Istanbul, Turkey, July 8, 2011, 176 - 185, doi: 10.1109/HPCSim.2011.5999822

### A. Buluç, J. R. Gilbert, V. B. Shah, "Implementing Sparse Matrices for Graph Algorithms", Graph Algorithms in the Language of Linear Algebra. SIAM Press, ( 2011)

### A. Buluç, J. R. Gilbert, "New Ideas in Sparse Matrix-Matrix Multiplication", Graph Algorithms in the Language of Linear Algebra. SIAM Press, ( 2011)

### J. Gu, D. Katramatos, X. Liu, V. Natarajan, A. Shoshani, A. Sim, D. Yu, S. Bradley, S. McKee, "StorNet: Integrated Dynamic Storage and Network Resource Provisioning and Management for Automated Data Transfers", Journal of Physics: Conf. Ser., 2011, 331, doi: 10.1088/1742- 6596/331/1/012002

### G. Garzoglio, J. Bester, K. Chadwick, D. Dykstra, D. Groep, J. Gu, T. Hesselroth, O. Koeroo, T. Levshina, S. Martin, M. Salle, N. Sharma, A. Sim, S. Timm, A. Verstegen, "Adoption of a SAML-XACML Profile for Authorization Interoperability across Grid Middleware in OSG and EGEE", Journal of Physics: Conf. Ser., 2011, 331, doi: 10.1088/1742-6596/331/6/062011

### A. Buluç, J. Gilbert, "The Combinatorial BLAS: Design, implementation, and applications", International Journal of High-Perormance Computing Applications (IJHPCA), 2011,

- Download File: combblas-r2.pdf (pdf: 288 KB)

### Aydın Buluç, Samuel Williams, Leonid Oliker, James Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", IPDPS, IEEE, 2011, doi: https://doi.org/10.1109/IPDPS.2011.73

- Download File: ipdps2011.pdf (pdf: 770 KB)

### Junmin Gu, Dimitrios Katramatos, Xin Liu, Vijaya Natarajan, Arie Shoshani, Alex Sim, Dantong Yu, Scott Bradley, Shawn McKee, "StorNet: Co-Scheduling of End-to-End Bandwidth Reservation on Storage and Network Systems for High Performance Data Transfers", IEEE INFOCOM HSN 2011, 2011,

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

### Xiaoye S. Li, Meiyue Shao, "A supernodal approach to incomplete LU factorization with partial pivoting", ACM Transactions on Mathematical Software, 2011, 37:43:1--43:2, doi: 10.1145/1916461.1916467

### Filipe RNC Maia, Chao Yang, Stefano Marchesini, "Compressive auto-indexing in femtosecond nanocrystallography", Ultramicroscopy, 2011, 111:807--811, LBNL 4598E,

### Dean N. Williams, Ian T. Foster, Don E. Middleton, Rachana Ananthakrishnan, Neill Miller, Mehmet Balman, Junmin Gu, Vijaya Natarajan, Arie Shoshani, Alex Sim, Gavin Bell, Robert Drach, Michael Ganzberger, Jim Ahrens, Phil Jones, Daniel Crichton, Luca Cinquini, David Brown, Danielle Harper, Nathan Hook, Eric Nienhouse, Gary Strand, Hannah Wilcox, Nathan Wilhelmi, Stephan Zednik, Steve Hankin, Roland Schweitzer, John Harney, Ross Miller, Galen Shipman, Feiyi Wang, Peter Fox, Patrick West, Stephan Zednik, Ann Chervenak, Craig Ward, "Earth System Grid Center for Enabling Technologies (ESG-CET): A Data Infrastructure for Data-Intensive Climate Research", SciDAC Conference, 2011,

### Mathias Jacquelin, "Memory-aware algorithms and scheduling techniques: from multicore processors to petascale supercomputers", Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, Pages: 2038--2041 2011,

### Mathias Jacquelin, Loris Marchal, Yves Robert, Bora U\ccar, "On optimal tree traversals for sparse matrix factorization", Parallel \& Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, 556--567,

### Henricus Bouwmeester, Mathias Jacquelin, Julien Langou, Yves Robert, "Tiled QR factorization algorithms", Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, 7,

### Franck Cappello, Mathias Jacquelin, Loris Marchal, Yves Robert, Marc Snir, "Comparing archival policies for Blue Waters", High Performance Computing (HiPC), 2011 18th International Conference on, 2011, 1--10,

### 2010

### P. Ghysels, G. Samaey, P. Van Liedekerke, E. Tijskens, H. Ramon, D. Roose, "Multiscale Modeling of Viscoelastic Plant Tissue", International Journal for Multiscale Computational Engineering, 2010, 8:4, doi: 10.1615/IntJMultCompEng.v8.i4.30

### P. Ghysels, G. Samaey, P. Van Liedekerke, E. Tijskens, H. Ramon, D. Roose, "Coarse Implicit Time Integration of a Cellular Scale Particle Model for Plant Tissue Deformation", International Journal for Multiscale Computational Engineering, 2010, 8, doi: 10.1615/IntJMultCompEng.v8.i4.50

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

- Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

### P. Van Liedekerke, E. Tijskens, H. Ramon, P. Ghysels, G. Samaey, D. Roose, "Particle-based model to simulate the micromechanics of biological cells", Physical Review E, June 3, 2010, 81:6, doi: 10.1103/PhysRevE.81.061906

### E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

### A. Buluç, J. R. Gilbert, C. Budak, "Solving path problems on the GPU", Parallel Computing, 36(5-6):241 - 253., 2010, doi: http://dx.doi.org/10.1016/j.parco.2009.12.002

- Download File: parcoapsp.pdf (pdf: 160 KB)

### P. Van Liedekerke, P. Ghysels, E. Tijskens, G. Samaey, B. Smeedts, D. Roose, H. Ramon, "A particle-based model to simulate the micromechanics of single-plant parenchyma cells and aggregates", Physical Biology, May 26, 2010, 7:2, doi: 10.1088/1478-3975/7/2/026006

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

- Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

### Ichitaro Yamazaki, Zhaojun Bai, Horst D. Simon Lin-Wang Wang, Kesheng Wu, "Adaptive Projection Subspace Dimension for the Lanczos Method", ACM Transactions on Mathematical Software, 2010, 37, doi: 10.1145/1824801.1824805

### Matthieu Gallet, Mathias Jacquelin, Loris Marchal, "Scheduling complex streaming applications on the Cell processor", Parallel \& Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, 1--8,

### 2009

### "Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

### A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks", SPAA '09 Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, doi: http://dx.doi.org/10.1145/1583991.1584053

- Download File: csb2009.pdf (pdf: 347 KB)

### M. Riedel, E. Laure, Th. Soddemann, L. Field, J. P. Navarro, J. Casey, M. Litmaath, J. Ph. Baud, B. Koblitz, C. Catlett, D. Skow, C. Zheng, P. M. Papadopoulos, M. Katz, N. Sharma, O. Smirnova, B. Kónya, P. Arzberger, F. Würthwein, A. S. Rana, T. Martin, M. Wan, V. Welch, T. Rimovsky, S. Newhouse, A. Vanni, Y. Tanaka, Y. Tanimura, T. Ikegami, D. Abramson, C. Enticott, G. Jenkins, R. Pordes, N. Sharma, S. Timm, N. Sharma, G. Moont, M. Aggarwal, D. Colling, O. van der Aa, A. Sim, V. Natarajan, A. Shoshani, J. Gu, S. Chen, G. Galang, R. Zappi, L. Magnoni, V. Ciaschini, M. Pace, V. Venturi, M. Marzolla, P. Andreetto, B. Cowles, S. Wang, Y. Saeki, H. Sato, S. Matsuoka, P. Uthayopas, S. Sriprayoonsakul, O. Koeroo, M. Viljoen, L. Pearlman, S. Pickles, David Wallom, G. Moloney, J. Lauret, J. Marsteller, P. Sheldon, S. Pathak, S. De Witt, J. Mencák, J. Jensen, M. Hodges, D. Ross, S. Phatanapherom, G. Netzer, A. R. Gregersen, M. Jones, S. Chen, P. Kacsuk, A. Streit, D. Mallmann, F. Wolf, T. Lippert, Th. Delaitre, E. Huedo, N. Geddes, "Interoperation of world-wide production e-Science infrastructures", Concurrency and Computation: Practice and Experience, 2009, 21(8):961-990,

### Arie Shoshani, Flavia Donno, Junmin Gu, Jason Hick, Maarten Litmaath, Alex Sim, "Dynamic Storage Management", Scientific Data Management: Challenges, Technology, and Deployment, edited by Arie Shoshani, Doron Rotem, (Chapman & Hall/CRC Computational Science: 2009)

### P. Ghysels, G. Samaey, B. Tijskens, P Van Liedekerke, H Ramon, D Roose, "Multi-scale simulation of plant tissue deformation using a model for individual cell mechanics", Physical Biology, March 25, 2009, 6:1, doi: 10.1088/1478-3975/6/1/016009

### Xiaoye S. Li, Meiyue Shao, Ichitaro Yamazaki, Esmond G. Ng, "Factorization-based sparse solvers and preconditioners", (SciDAC 2009) Journal of Physics: Conference Series 180(2009) 012015, 2009, doi: 10.1088/1742-6596/180/1/012015

### K Wu, S Ahern, EW Bethel, J Chen, H Childs, C Geddes, J Gu, H Hagen, B Hamann, J Lauret, others, "FastBit: Interactively Searching Massive Data", Proc. of SciDAC 2009, 2009, LBNL 2164E,

- Download File: LBNL-2164E.pdf (pdf: 3.2 MB)

### Mathias Jacquelin, Loris Marchal, Yves Robert, "Complexity analysis and performance evaluation of matrix product on multicore architectures", Parallel Processing, 2009. ICPP 09. International Conference on, 2009, 196--203,

### 2008

### A, Buluç, J. Gilbert, "Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication", Proceedings of the 37th International Conference on Parallel Processing (ICPP), 2008, doi: 10.1109/ICPP.2008.45

- Download File: spgemmicpp08.pdf (pdf: 206 KB)

### S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

- Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

### O. Marques, J. Demmel, C. Voemel, B. Parlett, "A Testing Infrastructure for Symmetric Tridiagonal Eigensolvers", ACM TOMS, 2008, 35,

### P. Jakl, J. Lauret, A. Hanushevsky, A. Shoshani, A. Sim, J. Gu, "Grid data access on widely distributed worker nodes using scalla and SRM", Journal of Physics: Conf. Ser., 2008, 119, doi: 10.1088/1742-6596/119/7/072019

### Alex Sim, Arie Shoshani (Editors), Paolo Badino, Olof Barring, Jean‐Philippe Baud, Ezio Corso, Shaun De Witt, Flavia Donno, Junmin Gu, Michael Haddox‐Schatz, Bryan Hess, Jens Jensen, Andy Kowalski, Maarten Litmaath, Luca Magnoni, Timur Perelmutov, Don Petravick, Chip Watson, The Storage Resource Manager Interface Specification Version 2.2, Open Grid Forum, Document in Full Recommendation, GFD.129, 2008,

### A. Buluç, J.R. Gilbert, "On the Representation and Multiplication of Hypersparse Matrices", IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2008, doi: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2008.4536313

- Download File: hypersparse-ipdps08.pdf (pdf: 194 KB)

### J. Demmel, O. Marques, C. Voemel, B. Parlett, "Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers", SIAM Journal on Scientific Computing, 2008, 30:1508–1526,

### C. Vömel, S.Z. Tomov, O.A. Marques, A. Canning, L.-W. Wang, J.J. Dongarra, "State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems", Journal of Computational Physics, 2008, 227:7113-7124, doi: 10.1016/j.jcp.2008.01.018

### A. Canning, O. Marques, C. Voemel, L.-W. Wang, J. Dongarra, J. Langou, S. Tomov, "New eigensolvers for large-scale nanoscience simulations", Journal of Physics: Conference Series, 2008, 125, doi: 10.1088/1742-6596/125/1/012074

### 2007

### Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

- Download File: sc07-spmv.pdf (pdf: 438 KB)

###
L. Abadie, P. Badino, J. Baud, E. Corso, M. Crawford, S. De Witt, F. Donno, A. Forti, P. Fuhrmann,

G. Grosdidier, J. Gu , J. Jensen, S. Lemaitre, M. Litmaath, D. Litvinsev, G. Lo Presti, L. Magnoni, T. Mkrtchan, A. Moibenko, V. Natarajan, G. Oleynik, T. Perelmutov, D. Petravick, A. Shoshani, A. Sim, M. Sponza, R. Zappi,
"Storage Resource Managers: Recent International Experience on Requirements and Multiple Co-Operating Implementations",
the 24th IEEE Conference on Mass Storage Systems and Technologies,
2007,

### F. Donno, L. Abadie, P. Badino, J. Baud, E. Corso, M. Crawford, S. De Witt, A. Forti, P. Fuhrmann, G. Grosdidier, J. Gu , J. Jensen, S. Lemaitre, M. Litmaath, D. Litvinsev, G. Lo Presti, L. Magnoni, T. Mkrtchan, A. Moibenko, V. Natarajan, G. Oleynik, T. Perelmutov, D. Petravick, A. Shoshani, A. Sim, M. Sponza, R. Zappi, "Storage Resource Manager version 2.2: design, implementation, and testing experience", Journal of Physics: Conf. Ser., 2007, 119, doi: 10.1088/1742-6596/119/6/062028

### S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

- Download File: parco08-spmv.pdf (pdf: 1.5 MB)

### 2006

### W. Kramer, J. Carter, D. Skinner, L. Oliker, P. Husbands, P. Hargrove, J. Shalf, O. Marques, E. Ng, A. Drummond, K. Yelick, "Software Roadmap to Plug and Play Petaflop/s", 2006,

### O. Marques, C. Voemel, J. Riedy, "Benefits of IEEE-754 Features in Modern Symmetric Tridiagonal Eigensolvers", SIAM J. Sci. Comput., 2006, 28:1613-1633,

### J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li, O. Marques, J. Riedy, C. Vömel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Langou, S. Tomov, "Prospectus for the next LAPACK and ScaLAPACK Libraries", PARA 2006, Umeå, Sweden, 2006,

### O. Marques, B. Parlett, C. Voemel, "Computations of Eigenpair Subsets with the MRRR Algorithm", Numer. Linear Algebra Appl., 2006, 13:643–653,

### 2005

### Kesheng Wu, Junmin Gu, Jerome Lauret, Arthur Poskanzer, Arie Shoshani, Alexander Sim, Zhang, "Grid Collector: Facilitating Efficient Selective from Data Grids", International Supercomputer Conference 2005, 2005,

### 2004

### Alex Sim, Junmin Gu, Arie Shoshani, Vijaya Natarajan, "DataMover: Robust Terabytes-Scale Multi-file Replication over Wide-Area Networks", the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), 2004,

### 2003

### Arie Shoshani, Alexander Sim, Junmin Gu, "Storage Resource Managers: Essential Components for the Grid", Grid Resource Management: State of the Art and Future Trends, edited by Jarek Nabrzyski, Jennifer M. Schopf, Jan Weglarz, (Kluwer Academic Publishers: 2003)

### A. Sim, J. Gu, A. Shoshani, E. Hjort, D. Olson, "Experience with Deploying Storage Resource Managers to Achieve Robust File Replication", Computing in High Energy Physics, 2003,

### Arie Shoshani, Alex Sim, Junmin Gu, Storage Resource Managers: Essential Components for Grid Applications, Globus World, 2003,

### D. Vasco, L. Johnson, O. Marques, "Resolution, Uncertainty and Whole Earth Tomography", Journal of Geophysical Research, Solid Earth, 2003, 108,

### Kesheng Wu, Wei-Ming Zhang, Alexander Sim, Gu, Arie Shoshani, "Grid Collector: An Event Catalog With Automated File", Proceedings of IEEE Nuclear Science Symposium 2003, 2003, doi: 10.1109/NSSMIC.2003.1351830

### Kesheng Wu, Wei-Ming Zlang, Alexander Sim, Junmin Gu, Arie Shoshani, "Grid collector: An event catalog with automated file management", 2003 IEEE Nuclear Science Symposium. Conference Record (IEEE Cat. No. 03CH37515), 2003, LBNL 55563,

### 2002

### A. Shoshani, A. Sim, J. Gu, "Storage Resource Managers: Middleware components for Grid Storage", the 19th IEEE Symposium on Mass Storage Systems, 2002,

### BR Gaeke, P Husbands, XS Li, L Oliker, KA Yelick, R Biswas, "Memory-intensive benchmarks: IRAM vs. cache-based machines", Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2002, 2002, 290--296, doi: 10.1109/IPDPS.2002.1015506

- Download File: ipdps02-iram.pdf (pdf: 91 KB)

### L. Oliker. X. Li, P. Husbands, R. Biswas, "Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations", SIAM Review Journal, 2002,

- Download File: sirev02-sparse.pdf (pdf: 475 KB)

### 2001

### L. Oliker, R. Biswas, P. Husbands, X. Li, Ordering Sparse Matrices for Cache-Based Systems, SIAM Conference on Parallel Processing, 2001,

- Download File: siampp01abstactb.pdf (pdf: 2.1 MB)

### L. Oliker, X. Li, P. Husbands, R. Biswas, "Ordering Schemes for Sparse Matrices using Modern Programming Paradigms", The IASTED International Conference on Applied Informatics (AI), 2001,

- Download File: ai01.pdf (pdf: 163 KB)

### DL Brown, R Cortez, ML Minion, "Accurate Projection Methods for the Incompressible Navier-Stokes Equations", Journal of Computational Physics, 2001, 168:464--499, doi: 10.1006/jcph.2001.6715

- Download File: BCMIIJCP.pdf (pdf: 281 KB)

### 2000

### L. Oliker, X. Li. G. Heber, R. Biswas, "Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms", 13th International Conference on Parallel and Distributed Computing Systems, 2000,

- Download File: pdcs00-pcg.pdf (pdf: 167 KB)

### L. Oliker, X. Li, G. Heber, R. Biswas, "Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems", Seventh International Workshop on solving Irregularly Structured Problems in Parallel, 2000,

- Download File: irr00awk.pdf (pdf: 130 KB)

### B. Parlett, O. Marques, "An Implementation of the dqds Algorithm (positive case)", Linear Algebra and its Applications, 2000, 309:217-259,

### 1997

### ML Minion, DL Brown, "Performance of under-resolved two-dimensional incompressible flow simulations, II", Journal of Computational Physics, 1997, 138:734--765, doi: 10.1006/jcph.1997.5843

- Download File: underIIJCP.pdf (pdf: 882 KB)

### 1996

### S. Chatterjee, J. Gilbert, L. Oliker, R. Schreiber, and T. Sheffler, "Algorithms for Automatic Alignment of Arrays", Journal of Parallel and Distributed Computing (JPDC), July 1996,

- Download File: jpdc96.ps.gz (gz: 89 KB)

### 1995

### DL Brown, ML Minion, "Performance of under-resolved two-dimensional incompressible flow simulations", Journal of Computational Physics, 1995, 122:165--183, doi: 10.1006/jcph.1995.1205

- Download File: underjcp.pdf (pdf: 3.2 MB)