# Publications

### 2020

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications, Argonne Leadership Computing Facility (ALCF) Webinar Series, May 27, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this webinar, hosted by DOE’s Exascale Computing Project and the ALCF, we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

### Shao-Jun Dong, Chao Wang, Yong-Jian Han, Chao Yang and Lixin He, "Stable diagonal stripes in the t–J model at nhbar = 1/8 doping from fPEPS calculations", npj Quantum Materials, May 8, 2020, 5:28, doi: https://doi.org/10.1038/s41535-020-0226-4

### D. B. Williams-Young, P. Beckman and C. Yang, "A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computation", submitted, May 7, 2020,

### C. T. Kelley, J. Bernholc, E. L. Briggs, S. Hamilton, L. Lin and C. Yang, "Mesh Independence of the Generalized Davidson Algorithm", Journal of Computational Physics, May 1, 2020, 409:109322, doi: https://doi.org/10.1016/j.jcp.2020.109322

### Kai-Hsin Liou, Chao Yang and James R.Chelikowsky, "Scalable Implementation of Polynomial Filering for Density Functional Theory Calculation in PARSEC", Computer Physics Communications, April 28, 2020, In press, doi: https://doi.org/10.1016/j.cpc.2020.107330

### Li Zhou, Chao Yang, Weiguo Gao, Talita Perciano, Karen M. Davies, Nicholas K. Sauter, "Subcellular structure segmentation from cryo-electron tomograms via machine learning", PLOS Journal of Computational Biology, April 2, 2020, submitte, doi: doi: https://doi.org/10.1101/2020.04.09.034025

### F. Henneke, L. Lin, C. Vorwerk, C. Draxl, R. Klein and C. Yang, "Fast optical absorption spectra calculations for periodic solid state systems", Communications in Applied Mathematics and Computational Science, March 16, 2020, in press,

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2020, LBNL 2001269, doi: 10.25344/S4P88Z

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### D. Camps, R. Van Beeumen and C. Yang, "Quantum Fourier Transform Revisited", March 6, 2020,

### Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li, "Leveraging One-Sided Communication for Sparse Triangular Solvers", 2020 SIAM Conference on Parallel Processing for Scientific Computing, February 14, 2020,

- Download File: One-side-SPTRS-SIAM-PP20-.pdf (pdf: 2.9 MB)

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++: A PGAS/RPC Library for Asynchronous Exascale Communication in C++, Exascale Computing Project (ECP) Annual Meeting 2020, February 6, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

### R. Van Beeumen, G. D. Kahanamoku-Meyer, N. Y. Yao and C. Yang, "A scalable matrix-free iterative eigensolver for studying many-body localization", HPCAsia2020: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, ACM, January 7, 2020, 179-187, doi: 10.1145/3368474.3368497

### 2019

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++ Tutorial, National Energy Research Scientific Computing Center (NERSC), November 1, 2019,

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. UPC++ provides mechanisms for low-overhead one-sided communication, moving computation to data through remote-procedure calls, and expressing dependencies between asynchronous computations and data movement. It is particularly well-suited for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces are designed to be composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds.

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through implementing basic algorithms in UPC++. We will also look at irregular applications and how to take advantage of UPC++ features to optimize their performance.

### L. Yang, Z. Wen, C. Yang and Y. Zhang, "`Block Algorithms with Augmented Rayleigh-Ritz Projections for Large-Scale Eigenpair Computation", Journal of Computational Mathematics, November 1, 2019, 37:889-915, doi: 10.4208/jcm.1910-m2019-0034

### Mark Adams, Stephen Cornford, Daniel Martin, Peter McCorquodale, "Composite matrix construction for structured grid adaptive mesh refinement", Computer Physics Communications, November 2019, 244:35-39, doi: 10.1016/j.cpc.2019.07.006

- Download File: AdamsCornfordMartinMcCorquodale.pdf (pdf: 1.2 MB)

### T. Hernandez, R. Van Beeumen, M. Caprio and C. Yang, "A greedy algorithm for computing eigenvalues of a symmetric matrix", submitted, October 1, 2019,

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Programmer’s Guide, Revision 2019.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2019, LBNL 2001236, doi: 10.25344/S4V30R

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ v1.0 Specification, Revision 2019.9.0", Lawrence Berkeley National Laboratory Tech Report, September 14, 2019, LBNL 2001237, doi: 10.25344/S4ZW2C

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed, "UPC++: A High-Performance Communication Framework for Asynchronous Computation", 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'19), Rio de Janeiro, Brazil, IEEE, May 2019, doi: 10.25344/S4V88H

### B. Peng, R. Van Beeumen, D.B. Williams-Young, K. Kowalski, C. Yang, "Approximate Green’s function coupled cluster method employing effective dimension reduction", Journal of Chemical Theory and Computation, 2019, 15:3185-3196, doi: 10.1021/acs.jctc.9b00172

### P. Benner, V. Khoromskaia, B. N. Khoromskij and C. Yang, "Computing the density of states for optical spectra of molecules by low-rank and QTT tensor approximation", Journal of Computational Physics, April 1, 2019, 382:221-239, doi: https://doi.org/10.1016/j.jcp.2019.01.011

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer's Guide, v1.0-2019.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2019, LBNL 2001191, doi: 10.25344/S4F301

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Specification v1.0, Draft 10", Lawrence Berkeley National Laboratory Tech Report, March 15, 2019, LBNL 2001192, doi: 10.25344/S4JS30

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "Pagoda: Lightweight Communications and Global Address Space Support for Exascale Applications - UPC++", Poster at Exascale Computing Project (ECP) Annual Meeting 2019, January 2019,

### Victor Yu, William Dawson, Alberto Garcia, Ville Havu, Ben Hourahine, William Huhn, Mathias Jacquelin, Weile Jia, Murat Keceli, Raul Laasner, others, Large-Scale Benchmark of Electronic Structure Solvers with the ELSI Infrastructure, Bulletin of the American Physical Society, 2019,

### Y. Liu, W. Sid-Lakhdar, E. Rebrova, P. Ghysels, X. Sherry Li, "A Hierarchical Low-Rank Decomposition Algorithm Based on Blocked Adaptive Cross Approximation Algorithms", arXiv e-prints, January 1, 2019,

### 2018

### Y. Li, Z. Wen, C. Yang, Y. Yuan, "A Semi-smooth Newton Method For semidefinite programs and its applications in electronic structure calculations", SIAM J. Sci. Comput., December 18, 2018, 40:A4131–A415, doi: 10.1137/18M1188069

### R. Van Beeumen, O. Marques, E.G. Ng, C. Yang, Z. Bai, L. Ge, O. Kononenko, Z. Li, C.-K. Ng, L. Xiao, "Computing resonant modes of accelerator cavities by solving nonlinear eigenvalue problems via rational approximation", Journal of Computational Physics, 2018, 374:1031-1043, doi: 10.1016/j.jcp.2018.08.017

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runtimes", The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'18), November 13, 2018,

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work is driven by the emerging need for adaptive, lightweight communication in irregular applications at exascale. We present an overview of UPC++ and GASNet-EX, including examples and performance results.

GASNet-EX is a portable, high-performance communication library, leveraging hardware support to efficiently implement Active Messages and Remote Memory Access (RMA). UPC++ provides higher-level abstractions appropriate for PGAS programming such as: one-sided communication (RMA), remote procedure call, locality-aware APIs for user-defined distributed objects, and robust support for asynchronous execution to hide latency. Both libraries have been redesigned relative to their predecessors to meet the needs of exascale computing. While both libraries continue to evolve, the system already demonstrates improvements in microbenchmarks and application proxies.

### Gianina Alina Negoita, James P. Vary, Glenn R. Luecke, Pieter Maris, Andrey M. Shirokov, Ik Jae Shin, Youngman Kim, Esmond G. Ng, Chao Yang, Matthew Lockner, Gurpur M. Prabhu, "Deep Learning: Extrapolation Tool for Ab Initio Nuclear Theory", CoRR, November 10, 2018,

### J. Deusch, M. Shao, C. Yang, M. Gu, "A Robust and Efficient Implementation of LOBPCG", SIAM J. Sc. Comput., October 4, 2018, 40:C655–C676, doi: 10.1137/17M1129830

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer's Guide, v1.0-2018.9.0", Lawrence Berkeley National Laboratory Tech Report, September 2018, LBNL 2001180, doi: 10.25344/S49G6V

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Specification v1.0, Draft 8", Lawrence Berkeley National Laboratory Tech Report, September 26, 2018, LBNL 2001179, doi: 10.25344/S45P4X

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### M. C. Clement, J. Zhang, C. A. Lewis, C. Yang, Edward F. Valeev, "Optimized Pair Natural Orbitals for the Coupled Cluster Methods", J. Chem. Theory Comput., August 1, 2018, 14:4581–4589, doi: 10.1021/acs.jctc.8b00294

### R. Huang, J. Sun, C. Yang, "Recursive integral method with Cayley transformation", Numerical Linear Algebra with Applications, July 10, 2018, 25:1-12, doi: 10.1002/nla.2199

### R. Huang, J. Sun and C. Yang, "Recursive Integral Method with Cayley Transformation", Numerical Linear Algebra and Applications, July 3, 2018, doi: https://doi.org/10.1002/nla.2199

### W. Hu, M. Shao, A. Cepelloti, F. H. Jornada, L. Lin, K. Thicke, C. Yang, S. Louie, "Accelerating Optical Absorption Spectra and Exciton Energy Computation via Interpolative Separable Density Fitting", International Conference on Computational Science (ICCS2018), Lecture Notes in Computer Science, Springer, Cham, June 12, 2018, 10861:604-617, doi: 10.1007/978-3-319-93701-4_48

### Meiyue Shao, Felipe H. da Jornada, Lin Lin, Chao Yang, Jack Deslippe, Steven G. Louie, "A structure preserving Lanczos algorithm for computing the optical absorption spectrum", SIAM Journal on Matrix Analysis and Applications, 2018, 39:683--711, doi: 10.1137/16M1102641

### T. Ke, A. S. Brewster, S. X. Yu, D. Ushizima, C. Yang, N. K. Sauter, "A convolutional neural network-based screening tool for X-ray serial crystallography", JOURNAL OF SYNCHROTRON RADIATION, April 24, 2018, 25:665-670, doi: 10.1107/S1600577518004873

### A. S. Banerjee, L. Lin, P. Suryanarayana, C. Yang, J. E. Pask, "Two-level Chebyshev filter based complementary subspace method for pushing the envelope of large-scale electronic structure calculations", J. Chem. Theory Comput., April 16, 2018, 14:2930–2946, doi: 10.1021/acs.jctc.7b01243

### Junmin Gu, Scott Klasky, Norbert Podhorszki, Ji Qiang, Kesheng Wu, "Querying Large Scientific Data Sets with Adaptable IO System ADIOS", Supercomputing Frontiers (Best Paper Award), Springer International Publishing, 2018, 51-69,

### J Bachan, S Baden, D Bonachea, P Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B Lelbach, B van Straalen, "UPC++ Specification v1.0, Draft 6", Lawrence Berkeley National Laboratory Tech Report, March 26, 2018, LBNL 2001135, doi: 10.2172/1430689

### J Bachan, S Baden, D Bonachea, PH Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B van Straalen, "UPC++ Programmer’s Guide, v1.0-2018.3.0", Lawrence Berkeley National Laboratory Tech Report, March 2018, LBNL 2001136, doi: 10.2172/1430693

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### J. M. Kasper, D. B. Williams-Young, E. Vecharynski, C. Yang, X. Li, "A Well-Tempered Hybrid Method for Solving Challenging Time-Dependent Density Functional Theory (TDDFT) Systems", J. Chem. Theory Comput., March 16, 2018, 14:2034–2041, doi: 10.1021/acs.jctc.8b00141

### H. Zhan, G. Gomes, X. S. Li, K. Madduri, A. Sim, K. Wu, "Consensus Ensemble System for Traffic Flow Prediction", IEEE Transactions on Intelligent Transportation Systems, 2018, doi: 10.1109/TITS.2018.2791505

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes", Poster at Exascale Computing Project (ECP) Annual Meeting 2018., February 2018,

### P. Benner, H. Fessbender, C. Yang, "Some remarks on the complex J-symmetric eigenproblem", Linear Algebra and its Applications, January 14, 2018, 544:407-442, doi: 10.1016/j.laa.2018.01.014

### E. Rebrova, G. Chavez, Y. Liu, P. Ghysels, X. S. Li, "A Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression", IEEE IPDPSW, 2018,

### Yang Liu, Mathias Jacquelin, Pieter Ghysels, Xiaoye S Li, "Highly scalable distributed-memory sparse triangular solution algorithms", 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018, 87--96,

### Mathias Jacquelin, Lin Lin, Chao Yang, "PSelInv--A distributed memory parallel algorithm for selected inversion: The non-symmetric case", Parallel Computing, 2018, 74:84--98,

### Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Nicholas Knight, "A 3D Parallel Algorithm for QR Decomposition", SPAA '18, 2018,

### Victor Wen-zhe Yu, Fabiano Corsetti, Alberto Garcia, William P Huhn, Mathias Jacquelin, Weile Jia, Bjorn Lange, Lin Lin, Jianfeng Lu, Wenhui Mi, others, "ELSI: A unified software interface for Kohn--Sham electronic structure solvers", Computer Physics Communications, 2018, 222:267--285,

### Mathias Jacquelin, Lin Lin, Weile Jia, Yonghua Zhao, Chao Yang, "A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems", Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, January 1, 2018, 54--63,

### William Huhn, Alberto Garcia, Luigi Genovese, Ville Havu, Mathias Jacquelin, Weile Jia, Murat Keceli, Raul Laasner, Yingzhou Li, Lin Lin, others, "Unified Access To Kohn-Sham DFT Solvers for Different Scales and HPC: The ELSI Project", Bulletin of the American Physical Society, American Physical Society, 2018,

### Meiyue Shao, Hasan Metin Aktulga, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver", Computer Physics Communications, 2018, 222:1--13, doi: 10.1016/j.cpc.2017.09.004

### Mathias Jacquelin, Esmond G Ng, Barry W Peyton, "Fast and effective reordering of columns within supernodes using partition refinement", 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018, 76--86,

### 2017

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, "UPC++: a PGAS C++ Library", ACM/IEEE Conference on Supercomputing, SC'17, November 2017,

### John Bachan, Dan Bonachea, Paul H Hargrove, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Scott B Baden, "The UPC++ PGAS library for Exascale Computing", Proceedings of the Second Annual PGAS Applications Workshop (PAW17), November 13, 2017, 7, doi: 10.1145/3144779.3169108

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### Yang You, Aydin Buluc, James Demmel, "Scaling deep learning on GPU and Knights Landing clusters", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), 2017,

### E. J. Bylaska, E. Apra, K. Kowalski, M. Jacquelin, W.A. de Jong, A. Vishnu, B. Palmer, J. Daily, T.P. Straatsma, J.R. Hammond, M. Klemm, "Transitioning NWChem to the Next Generation of Many Core Machines", Exascale Scientific Applications Scalability and Performance Portability, edited by Tjerk P. Straatsma, Katerina B. Antypas, Timothy J. Williams, (Taylor & Francis: November 9, 2017)

### M. Jacquelin, L. Lin and C. Yang, "PSelInv – A distributed memory parallel algorithm for selected inversion : the symmetric case", Parallel Computing, November 9, 2017, 74:84-98, doi: 10.1016/j.parco.2017.11.009

### E.J. Bylaska, J. Hammond, M. Jacquelin, W.A. de Jong, M. Klemm, "Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi© Processor", High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science, Springer, Cham, October 21, 2017, 404-418, doi: 10.1007/978-3-319-67630-2_30

### W. Hu, L. Lin, R. Zhang, C. Yang, J. Yang, "Highly efficient photocatalytic water splitting over edge-modified phosphorene nanoribbons", J. Am. Chem. Soc., October 13, 2017, 139:15429–1543, doi: 10.1021/jacs.7b08474

### Meiyue Shao and Chao Yang, "Properties of Definite Bethe--Salpeter Eigenvalue Problems", Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing. EPASA 2015. Lecture Notes in Computational Science and Engineering, vol 117., 2017, 91--105, doi: 10.1007/978-3-319-62426-6_7

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer’s Guide, v1.0-2017.9", Lawrence Berkeley National Laboratory Tech Report, September 2017, LBNL 2001065, doi: 10.2172/1398522

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### W. Hu, L. Lin, C. Yang, "Interpolative Separable Density Fitting Decomposition for Accelerating Hybrid Density Functional Calculations with Applications to Defects in Silicon", J. Chem. Theory Comput., September 29, 2017, 13:5420–5431, doi: 10.1021/acs.jctc.7b00807

### J Bachan, S Baden, D Bonachea, P Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B Lelbach, B van Straalen, "UPC++ Specification v1.0, Draft 4", Lawrence Berkeley National Laboratory Tech Report, September 27, 2017, LBNL 2001066, doi: 10.2172/1398521

UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### W. Hu, L. Lin, C. Yang, "Projected Commutator DIIS Method for Accelerating Hybrid Functional Electronic Structure Calculations", J. Chem. Theory Comput, September 22, 2017, 13:5458–5467, doi: 10.1021/acs.jctc.7b00892

### V. Yu. F. Corsetti, A. García, W. P. Huhn, M. Jacquelin, W. Jia, B. Lange, L. Lin, J. Lu, W. Mi, A. Seifitokaldan, Á. Vazquez-Mayagoitia, C. Yang, H. Yang, V. Blum, "ELSI: A unified software interface for Kohn–Sham electronic structure solvers", Computer Physics Communications, September 7, 2017, 222:267-285, doi: 10.1016/j.cpc.2017.09.007

### R. Van Beeumen, D.B. Williams-Young, J.M. Kasper, C. Yang, E.G. Ng, X. Li, "Model order reduction algorithm for estimating the absorption spectrum", Journal of Chemical Theory and Computation, 2017, 13:4950-4961, doi: 10.1021/acs.jctc.7b00402

### E. Vecharynski, J. Brabec, M. Shao, N. Govind, C. Yang, "Efficient Block Preconditioned Eigensolvers for Linear Response Time-dependent Density Functional Theory", Computer Physics Communications, 2017, 221:42-52, doi: https://doi.org/10.1016/j.cpc.2017.07.017

We present two efficient iterative algorithms for solving the linear response eigenvalue problem arising fromthe time dependent density functional theory. Although the matrix to be diagonalized is nonsymmetric, it has a special structure that can be exploited to save both memory and floating point operations. In particular, the nonsymmetric eigenvalue problem can be transformed into a product eigenvalue problem that is self-adjoint with respect to a K-inner product. This product eigenvalue problem can be solved efficiently by a modified Davidson algorithm and a modified locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm that make use of the K-inner product. The solution of the product eigenvalue problem yields one component of the eigenvector associated with the original eigenvalue problem. However, the other component of the eigenvector can be easily recovered in a postprocessing procedure. Therefore, the algorithms we present here are more efficient than existing algorithms that try to approximate both components of the eigenvectors simultaneously.The efficiency of the new algorithms is demonstrated by numerical examples.

### W. Hu, L. Lin, A. Banerjee, E. Vecharynski, C. Yang, "Adaptively compressed exchange operator for large scale hybrid density functional calculations with applications to the adsorption of water on silicene", J. Chem. Theory Comput., February 8, 2017, 13:1188–1198,

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes", Poster at Exascale Computing Project (ECP) Annual Meeting 2017., January 2017,

### Mathias Jacquelin, Lin Lin, Chao Yang, "PSelInv—A distributed memory parallel algorithm for selected inversion: The symmetric case", ACM Transactions on Mathematical Software (TOMS), 2017, 43:21,

### U Ayachit, A Bauer, EPN Duque, G Eisenhauer, N Ferrier, J Gu, KE Jansen, B Loring, Z Lukic, S Menon, D Morozov, P O Leary, R Ranjan, M Rasquin, CP Stone, V Vishwanath, GH Weber, B Whitlock, M Wolf, KJ Wu, EW Bethel, "Performance Analysis, Design Considerations, and Applications of Extreme-Scale in Situ Infrastructures", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, 2017, 921--932, LBNL 1007264, doi: 10.1109/SC.2016.78

### E. Vecharynski and C. Yang, "Preconditioned iterative methods for eigenvalue counts", Lecture Notes in Computational Science, January 1, 2017,

### Mathias Jacquelin, Wibe De Jong, Eric Bylaska, "Towards highly scalable Ab initio molecular dynamics (AIMD) simulations on the Intel knights landing manycore processor", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 1, 2017, 234--243,

### MF Adams, E Hirvijoki, MG Knepley, J Brown, T Isaac, R Mills, "Landau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures", SIAM J. Sci. Comput., 2017, 39:C452--C465, doi: 10.1137/17M1118828

### R Hager, J Lang, CS Chang, S Ku, Y Chen, SE Parker, MF Adams, "Verification of long wavelength electromagnetic modes with a gyrokinetic-fluid hybrid model in the XGC code", Physics of Plasmas, 2017, 24, doi: 10.1063/1.4983320

### E Hirvijoki, MF Adams, "Conservative discretization of the Landau collision integral", Physics of Plasmas, 2017, 24, doi: 10.1063/1.4979122

### Ariful Azad, Mathias Jacquelin, Aydin Bulu\cc, Esmond G Ng, "The reverse Cuthill-McKee algorithm in distributed-memory", Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, January 2017, 22--31,

- Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

### 2016

### M. Jacquelin, L. Lin and C. Yang, "A Distributed Memory Parallel Algorithm for Selected Inversion: the non-symmetric case", PMAA, December 30, 2016,

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

### Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

- Download File: SC16-HPGMG-BoF-Intro.pdf (pdf: 1020 KB)

### Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

- Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

### Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

- Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

### A. S. Banerjee, L. Lin, W. Hu, C. Yang and J. E. Pask, "Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for large-scale electronic structure calculations", Journal of Chemical Physics, October 1, 2016,

### Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

### W.A. de Jong, M. Jacquelin, E.J. Bylaska, "Advancing Algorithms to Increase Performance of Correlated and Dynamical Electronic Structure Simulation", CMMSE 2016: Proceedings of the 16th International Conference on Mathematical Methods in Science and Engineering, September 1, 2016, 5:1342-1346,

### Veronika Strnadova-Neeley, Aydin Buluc, John R. Gilbert, Leonid Oliker, Weimin Ouyang, "LiRa: A New Likelihood-Based Similarity Score for Collaborative Filtering", August 30, 2016,

### Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

### R. Li, Y. Xi, E. Vecharynski, C. Yang, and Y. Saad, "A Thick-Restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems", SIAM Journal on Scientific Computing, Vol. 38, Issue 4, pp. A2512–A2534, 2016, doi: 10.1137/15M1054493

Polynomial filtering can provide a highly effective means of computing all eigenvalues of a real symmetric (or complex Hermitian) matrix that are located in a given interval, anywhere in the spectrum. This paper describes a technique for tackling this problem by combining a Thick-Restart version of the Lanczos algorithm with deflation ('locking') and a new type of polynomial filters obtained from a least-squares technique. The resulting algorithm can be utilized in a 'spectrum-slicing' approach whereby a very large number of eigenvalues and associated eigenvectors of the matrix are computed by extracting eigenpairs located in different sub-intervals independently from one another.

### Mark Adams, Jed Brown, Matt Knepley, Ravi Samtaney, "Segmental Refinement: A Multigrid Technique for Data Locality", SIAM J. Sci. Comput., 2016, 38:4,

### Osni Marques, Paulo B. Vasconcelos, "Computing the Bidiagonal SVD through an Associated Tridiagonal Eigenproblem", VECPAR 2016, Porto, Portugal, Springer, June 2016,

### Naoya Nomura, Akihiro Fujii, Teruo Tanaka, Kengo Nakajima, Osni Marques, "Performance Analysis of SA-AMG Method by Setting Extracted Near-kernel Vectors", VECPAR 2016, Porto, Portugal, Springer, June 2016,

### Fabien Bruneval, Tonatiuh Rangel, Samia M. Hamed, Meiyue Shao, Chao Yang, Jeffrey B. Neaton, "MOLGW 1: many-body perturbation theory software for atoms, molecules, and clusters", Computer Physics Communications, 2016, 208:149–161, doi: 10.1016/j.cpc.2016.06.019

### Osni Marques, Alex Druinsky, Xiaoye S. Li, Andrew T. Barker, Panayot Vassilevski, Delyan Kalchev, "Tuning the Coarse Space Construction in a Spectral AMG Solver", ICCS 2016 (The International Conference on Computational Science), San Diego, CA, Elsevier, June 2016,

### Meiyue Shao, Lin Lin, Chao Yang, Fang Liu, Felipe H. da Jornada, Jack Deslippe and Steven G. Louie, "Low rank approximation in G0W0 calculations", Science China Mathematics, June 4, 2016, 59:1593–1612, doi: 10.1007/s11425-016-0296-x

### D. Pugmire, J. Kress, H. Childs, M. Wolf, G. Eisenhauer, J. Low, R. M. Churchill, T. Kurc, K. Wu, A. Sim, J. Gu, J. Choi, S. Klasky, "Visualization and Analysis for Near-Real-Time Decision Making in Distributed Workflows", High Performance Data Analysis and Visualization Workshop (HPDAV2016) in conjunction with the 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016), 2016, doi: 10.1109/IPDPSW.2016.175

### Mathias Jacquelin, Scheduling Sparse Symmetric Fan-Both Cholesky Factorization, The 11th Scheduling for Large Scale Systems Workshop, May 18, 2016,

### Mathias Jacquelin, Yili Zheng, Esmond Ng, Katherine Yelick, "An Asynchronous Task-based Fan-Both Sparse Cholesky Solver", Submitted to SuperComputing'16, May 10, 2016,

### Mathias Jacquelin, Scheduling Sparse Symmetric Fan-Both Cholesky Factorization, SIAM PP'16, April 15, 2016,

### M. Jacquelin, L. Lin, W. Jia, Y. Zhao and C. Yang, "A Left-looking selected inversion algorithm and task parallelism on shared memory systems", April 9, 2016,

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

- Download File: CU16SWWilliams.pptx (pptx: 1 MB)

### Z. Wen, C. Yang, X. Liu and Y. Zhang, "A Penalty-based Trace Minimization Method for Large-scale Eigenspace Computation", J. Sci. Comp., March 1, 2016, 66:1175-1203, doi: 10.1007/s10915-015-0061-0

### E. Vecharynski, C. Yang, and F. Xue, "Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems", SIAM Journal on Scientific Computing, Vol. 38, No. 1, pp. A500–A527, 2016, doi: 10.1137/15M1027413

We introduce the Generalized Preconditioned Locally Harmonic Residual (GPLHR) method for solving standard and generalized non-Hermitian eigenproblems. The method is particularly useful for computing a subset of eigenvalues, and their eigen- or Schur vectors, closest to a given shift. The proposed method is based on block iterations and can take advantage of a preconditioner if it is available. It does not need to perform exact shift-and-invert transformation. Standard and generalized eigenproblems are handled in a unified framework. Our numerical experiments demonstrate that GPLHR is generally more robust and efficient than existing methods, especially if the available memory is limited.

### E. Vecharynski and C. Yang, "Preconditioned iterative methods for eigenvalue counts", to appear in Proceedings of International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, in Lecture Notes in Computational Science and Engineering, Springer, 2016,

We describe preconditioned iterative methods for estimating the number of eigenvalues of a Hermitian matrix within a given interval. Such estimation is useful in a number of applications.In particular, it can be used to develop an efficient spectrum-slicing strategy to compute many eigenpairs of a Hermitian matrix. Our method is based on the Lanczos- and Arnoldi-type of iterations. We show that with a properly defined preconditioner, only a few iterations may be needed to obtain a good estimate of the number of eigenvalues within a prescribed interval. We also demonstrate that the number of iterations required by the proposed preconditioned schemes is independent of the size and condition number of the matrix. The efficiency of the methods is illustrated on several problems arising from density functional theory based electronic structure calculations.

### Wei Hu, Lin Lin, Chao Yang, Jun Dai and Jinlong Yang, "Edge-Modied Phosphorene Nano ake Heterojunctions as Highly Efficient Solar Cells", Nano Lett, February 5, 2016, 16:1675–1682, doi: 10.1021/acs.nanolett.5b04593

### L. Lin, Y. Saad and C. Yang, "Approximating spectral densities of large matrices", SIAM Review, February 1, 2016, 58:34–65, doi: 10.1137/130934283

### P. Li, X. Liu, M. Chen, P. Lin, X. Ren, L. Lin, C. Yang, L. He, "Large-scale ab initio simulations based on systematically improvable atomic basis", Computational Materials Science, February 1, 2016, 112:503–517, doi: doi:10.1016/j.commatsci.2015.07.004

### J. Brabec, C. Yang, E. Epifanovsky, A.I. Krylov, and E. Ng, "Reduced-cost sparsity-exploiting algorithm for solving coupled-cluster equations", Journal of Computational Chemistry, January 24, 2016, 37:1059–1067, doi: 10.1002/jcc.24293

### Burlen Loring, Suren Byna, Prabhat, Junmin Gu, Hari Krishnan, Michael Wehner, and Oliver Ruebel, "TECA an Extreme Event Detection and Climate Analysis Package for High Performance Computing", The AMS (American Meteorological Society) 96th Annual Meeting, January 6, 2016,

### Meiyue Shao, Felipe H. da Jornada, Chao Yang, Jack Deslippe, Steven G. Louie, "Structure preserving parallel algorithms for solving the Bethe–Salpeter eigenvalue problem", Linear Algebra and its Applications, 2016, 488:148–167, doi: 10.1016/j.laa.2015.09.036

### D. Pugmire, J. Kress, J. Choi, S. Klasky, Kurc, R. M. Churchill, M. Wolf, G., H. Childs, K. Wu, A. Sim, J. Gu, J. Low, "Visualization and Analysis for Near-Real-Time Decision in Distributed Workflows", 2016 IEEE International Parallel and Distributed Symposium Workshops (IPDPSW), 2016, 1007--1013, doi: 10.1109/IPDPSW.2016.175

### Mathias Jacquelin, Lin Lin, Nathan Wichmann, Chao Yang, Enhancing scalability and load balancing of Parallel Selected Inversion via tree-based asynchronous communication, Parallel and Distributed Processing Symposium, 2016 IEEE International, Pages: 192--201 January 1, 2016,

### 2015

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", Springer International Journal of Parallel Programming, December 2015, 43:6:1218-1243, doi: 10.1007/s10766-014-0326-5

### M. Jacquelin, L. Lin, N. Wichmann and C. Yang, "Enhancing the scalability tree-based asynchronous communication", accepted IPDPS16, November 25, 2015,

### Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism", Supercomputing (SC'15), November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.

We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.

We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

### M. van Setten; F. Carouso; S. Sharifzadeh; X. Ren; M. Scheffler; F. Liu; J. Lischner; L. Lin; J. Deslippe; S. Louie; C. Yang; F. Weigend; J. Neaton; F. Evers; P. Rinke, "GW 100: Benchmarking G0W0 for molecular systems", Journal of Chemical Theory and Computation, October 22, 2015,

### Jiri Brabec, Lin Lin, Meiyue Shao, Niranjan Govind, Chao Yang, Yousef Saad, Esmond G. Ng, "Fast Algorithms for Estimating the Absorption Spectrum within Linear Response Time-dependent Density Functional Theory", Journal of Chemical Theory and Computation, 2015, 11:5197–5208, doi: 10.1021/acs.jctc.5b00887

### Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

### Abhinav Sarje, Xiaoye Li, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Reconstructing Nanostructures from X-Ray Scattering Data", OLCF User Meeting, June 2015,

### M. Ulbrich, Z. Wen, C. Yang, D. Klockner, Z. Lu, "A proximal gradient method for ensemble density functional theory", SIAM J. Sci. Comp., June 20, 2015, 37:A1975--A20, doi: 10.1137/14098973X

### Mathias Jacquelin, Lin Lin, Chao Yang, "A Distributed Memory Parallel Algorithm for Selected Inversion : the Symmetric Case", To appear in ACM Transactions on Mathematical Software (TOMS), May 28, 2015,

### Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

- Download File: triangles-gabb.pdf (pdf: 384 KB)

### C. Yang, Absorption Spectrum Estimation via Linear Response TDDFT, Applied Math Seminar, Stanford University, May 13, 2015,

### C. Yang, Fast Numerical Algorithms for Large-scale Electronic Structure Calculations, DOE BES Computational and Theoretical Chemistry PI Meeting, April 28, 2015,

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Math Colloquium, Michigan Tech University, April 24, 2015,

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Applied math & PDE seminar, UC Davis, April 14, 2015,

### Fang Liu, Lin Lin , Derek Vigil-Fowlerd , Johannes Lischnerd, Alexander F. Kemper, , Sahar Sharifzadehe, Felipe H. da Jornadad, Jack Deslippef, Chao Yangc, Jeffrey B. Neaton, Steven G. Louied,, "Numerical integration for ab initio many-electron self energy calculations within the GW approximation", Journal of Computational Physics, April 1, 2015,

### Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer, "Recovering Nanostructures from X-Ray Scattering Data", Nvidia GPU Technology Conference (GTC), March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

### C. Yang, Fast Numerical Methods for Computational Materials Science and Chemistry, CRD All-hands meeting, March 4, 2015,

### Marc Baboulin, Xiaoye S. Li, Francois-Henry Rouet, "Using random butterfly transformations to avoid pivoting in sparse direct methods", High Performance Computing for Computational Science - VECPAR 2014, Lecture Notes in Computer Science, Springer. Preprint, 2015,

### E. Vecharynski, C. Yang, J. E. Pask, "A projected preconditioned conjugate gradient algorithm for computing many extreme eigenpairs of a Hermitian matrix", Journal of Computational Physics, Vol. 290, pp. 73–89, 2015,

We present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix *A*. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of *A*. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimal block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.

### Wei Hu, Lin Lin and Chao Yang, "Edge reconstruction in armchair phosphorene nanoribbons revealed by discontinuous Galerkin density functional theory", Phys. Chem. Chem. Phys., 2015, Advance Article, February 11, 2015, doi: 10.1039/C5CP00333D

With the help of our recently developed massively parallel DGDFT (Discontinuous Galerkin Density Functional Theory) methodology, we perform large-scale Kohn–Sham density functional theory calculations on phosphorene nanoribbons with armchair edges (ACPNRs) containing a few thousands to ten thousand atoms. The use of DGDFT allows us to systematically achieve a conventional plane wave basis set type of accuracy, but with a much smaller number (about 15) of adaptive local basis (ALB) functions per atom for this system. The relatively small number of degrees of freedom required to represent the Kohn–Sham Hamiltonian, together with the use of the pole expansion the selected inversion (PEXSI) technique that circumvents the need to diagonalize the Hamiltonian, results in a highly efficient and scalable computational scheme for analyzing the electronic structures of ACPNRs as well as their dynamics. The total wall clock time for calculating the electronic structures of large-scale ACPNRs containing 1080–10 800 atoms is only 10–25 s per self-consistent field (SCF) iteration, with accuracy fully comparable to that obtained from conventional planewave DFT calculations. For the ACPNR system, we observe that the DGDFT methodology can scale to 5000–50 000 processors. We use DGDFT based ab initio molecular dynamics (AIMD) calculations to study the thermodynamic stability of ACPNRs. Our calculations reveal that a 2 × 1 edge reconstruction appears in ACPNRs at room temperature.

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Workshop on High Performance and Parallel Computing Methods and Algorithms for Materials Defects, Singapore, February 9, 2015,

### M. Adams, P. Colella, D. T. Graves, J.N. Johnson, N.D. Keen, T. J. Ligocki. D. F. Martin. P.W. McCorquodale, D. Modiano. P.O. Schwartz, T.D. Sternberg, B. Van Straalen, "Chombo Software Package for AMR Applications - Design Document", Lawrence Berkeley National Laboratory Technical Report LBNL-6616E, January 9, 2015,

- Download File: chomboDesign.pdf (pdf: 994 KB)

### D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A.I. Krylov, "New algorithms for iterative matrix-free eigensolvers in quantum chemistry", Journal of Computational Chemistry, Vol. 36, Issue 5, pp. 273–284, 2015,

New algorithms for iterative diagonalization procedures that solve for a small set of eigen-states of a large matrix are described. The performance of the algorithms is illustrated by calculations of low and high-lying ionized and electronically excited states using equation-of-motion coupled-cluster methods with single and double substitutions (EOM-IP-CCSD and EOM-EE-CCSD). We present two algorithms suitable for calculating excited states that are close to a specified energy shift (interior eigenvalues). One solver is based on the Davidson algorithm, a diagonalization procedure commonly used in quantum-chemical calculations. The second is a recently developed solver, called the “Generalized Preconditioned Locally Harmonic Residual (GPLHR) method.” We also present a modification of the Davidson procedure that allows one to solve for a specific transition. The details of the algorithms, their computational scaling, and memory requirements are described. The new algorithms are implemented within the EOM-CC suite of methods in the Q-Chem electronic structure program.

### 2014

### Siegfried Cools, Pieter Ghysels, Wim van Aarle, Wim Vanroose, "A multi-level preconditioned Krylov method for the efficient solution of algebraic tomographic reconstruction problems", To appear in Journal of Computational and Applied Mathematics, December 28, 2014,

### François-Henry Rouet, Xiaoye S. Li, Pieter Ghysels, Artem Napov, "A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization", Submitted to ACM Transactions on Mathematical Software, December 2014,

### Wei Hu, Lin Lin, Chao Yang and Jinlong Yang, "Electronic structure and aromaticity of large-scale hexagonal graphene nanoflakes", J. Chem. Phys. 141, 214704 (2014), December 2, 2014, 141:214704, doi: 10.1063/1.4902806

- Download File: JCPGNFs.pdf (pdf: 3.7 MB)

With the help of the recently developed SIESTA-PEXSI method [L. Lin, A. García, G. Huhs, and C. Yang, J. Phys.: Condens. Matter26, 305503 (2014)], we perform Kohn-Sham density functional theory calculations to study the stability and electronic structure of hydrogen passivated hexagonal graphene nanoflakes (GNFs) with up to 11 700 atoms. We find the electronic properties of GNFs, including their cohesive energy, edge formation energy, highest occupied molecular orbital-lowest unoccupied molecular orbital energy gap, edge states, and aromaticity, depend sensitively on the type of edges (armchair graphene nanoflakes (ACGNFs) and zigzag graphene nanoflakes (ZZGNFs)), size and the number of electrons. We observe that, due to the edge-induced strain effect in ACGNFs, large-scale ACGNFs’ edge formation energydecreases as their size increases. This trend does not hold for ZZGNFs due to the presence of many edge states in ZZGNFs. We find that the energy gaps E g of GNFs all decay with respect to 1/L, where L is the size of the GNF, in a linear fashion. But as their size increases, ZZGNFs exhibit more localized edge states. We believe the presence of these states makes their gap decrease more rapidly. In particular, when L is larger than 6.40 nm, we find that ZZGNFs exhibit metallic characteristics. Furthermore, we find that the aromatic structures of GNFs appear to depend only on whether the system has 4N or 4N + 2 electrons, where N is an integer.

### David Trebotich, Mark F. Adams, Sergi Molins, Carl I. Steefel, Chaopeng Shen, "High-Resolution Simulation of Pore-Scale Reactive Transport Processes Associated with Carbon Sequestration", Computing in Science and Engineering, December 2014, 16:22-31, doi: 10.1109/MCSE.2014.77

- Download File: CISE-16-06-Trebotichappeared.pdf (pdf: 2.7 MB)

### Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

- Download File: SC14HPGMGBoF.pdf (pdf: 1.9 MB)

### Alex Druinsky, Brian Austin, Sherry Li, Osni Marques, Eric Roman, Samuel Williams, "A Roofline Performance Analysis of an Algebraic Multigrid Solver", Supercomputing (SC), November 2014,

### Veronika Strnadova, Aydın Buluç, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, John Gilbert, Daniel Rokhsar, Leonid Oliker, "Efficient and accurate clustering for large-scale genetic mapping", IEEE International Conference on Bioinformatics and Biomedicine (BIBM'14), November 1, 2014,

- Download File: bibm14.pdf (pdf: 764 KB)

### A. L. Chervenak, A. Sim, J. Gu, R. Schuler, N. Hirpathak, "Adaptation and Policy-Based Resource Allocation for Efficient Bulk Data Transfers in High Performance Computing Environments", 4th International Workshop on Network-aware Data Management (NDM'14), 2014,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

### Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

### W.A. de Jong, L. Lin, H. Shan, C. Yang and L. Oliker, "Towards modelling complex mesoscale molecular environments", International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), 2014,

### Pieter Ghysels, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Jianlin Xia, Hierarchically Low-Rank Structured Sparse Factorization with Reduced Communication and Synchronization, Householder Symposium XIX, June 2014,

### Pieter Ghysels, Wim Vanroose, Karl Meerbergen, High Performance Implementation of Deflated Preconditioned Conjugate Gradients with Approximate Eigenvectors, Householder Symposium XIX June 8-13, Spa Belgium, Pages: 84 June 2014,

### Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

- Download File: hpgmg.pdf (pdf: 183 KB)

### Abhinav Sarje, Xiaoye Li, Slim Chourou, Alexander Hexemer, "Petascale X-Ray Scattering Simulations With GPUs", GPU Technology Conference, March 2014,

### Abhinav Sarje, Xiaoye Li, Alexander Hexemer, "Inverse Modeling of X-Ray Scattering Data With Reverse Monte Carlo Simulations", GPU Technology Conference, March 2014,

### Xiaoye S. Li, Artem Napov, Francois-Henry Rouet, Designing multifrontal solvers using hierarchically semiseparable structures, SIAM Conference on Parallel Processing for Scientific Computing (PP12), Portland, OR, USA, February 2014,

### E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, U. V. Catalyurek, "An Out-of-core Task-based Middleware for Data Intensive Scientific Computing", Handbook on Data Centers, in press, (Springer: February 1, 2014)

### A. L. Chervenak, A. Sim, J. Gu, R. Schuler, N. Hirpathak, "Efficient Data Staging Using Performance-Based Adaptation and Policy-Based Resource Allocation", 22nd Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2014,

### J. Kaye, L. Lin and C. Yang, "A posteriori error estimator for adaptive local basis functions to solve Kohn-Sham density functional theory", Comm. Math. Sci., January 5, 2014, 13:1741--1740, doi: http://dx.doi.org/10.4310/CMS.2015.v13.n7.a5

### G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, Hong Diep Nguyen, E. Solomonik, "Reconstructing Householder Vectors from Tall-Skinny QR", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014, 1159-1170, doi: 10.1109/IPDPS.2014.120

### A. Fujii, O. Marques, "Axis Communication Method for Algebraic Multigrid Solver", IEICE Transactions on Information and Systems, 2014, E97-D:2955-2958,

### Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Hong Diep Nguyen, Edgar Solomonik, "Reconstructing Householder vectors from tall-skinny QR", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 1, 2014, 1159--1170,

### Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

- Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
- Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

### J. González-Domínguez, O. Marques, M. J. Martín and J. Touriño, "A 2D Algorithm with Asymmetric Workload for the UPC Conjugate Gradient Method", The Journal of Supercomputing, 2014, 70:816-829,

### Laura Grigori, Mathias Jacquelin, Amal Khabou, "Performance predictions of multilevel communication optimal LU and QR factorizations on hierarchical platforms", International Supercomputing Conference, 2014, 76--92,

### 2013

### H. M. Aktulga, L. Lin, C. Haine, E. G. Ng, C. Yang, "Parallel Eigenvalue Calculation based on Multiple Shift-invert Lanczos and Contour Integral based Spectral Projection Method", Parallel Computing, December 6, 2013, in press,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

- Download File: sarje-thmmcs-pmbs.pdf (pdf: 2 MB)

### M. Jung, E. H. Wilson III, W. Choi, J. Shalf, H. M. Aktulga, C. Yang, E. Saule, U. V. Catalyurek, M. Kandemir, "Exploring the Future of Out-of-core Computing with Compute-Local Non-Volatile Memory", International Conference for High Performance Computing, Networking, Storage and Analysis 2013 (SC13), NY, USA, ACM New York, November 17, 2013, doi: 10.1145/2503210.2503261

### Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

- Download File: sbac2013personal.pdf (pdf: 195 KB)

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

### H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, "Improving the Scalability of a Symmetric Iterative Eigensolver for Multi-core Platforms", Concurrency and Computation: Practice & Experience, September 12, 2013, online, doi: 10.1002/cpe.3129

### Alfredo Buttari, Serge Gratton, Xiaoye S. Li, Marième Ngom, François-Henry Rouet, David Titley-Peloquin, Clément Weisbecker, "Error Analysis of the Block Low-Rank LU factorization of dense matrices", IRIT-CERFACS, RT-APO-13-7, August 2013,

### Emmanuel Agullo, Patrick R. Amestoy, Alfredo Buttari, Abdou Guermouche, Guillaume Joslin, Jean-Yves L'Excellent, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Mohamed Sid-Lakhdar, Shen Wang, Clément Weisbecker, Ichitaro Yamazaki., "Recent Advances in Sparse Direct Solvers", 22nd Conference on Structural Mechanics in Reactor Technology, August 18, 2013,

- Download File: paper3.pdf (pdf: 243 KB)

### Shen Wang, Xiaoye S. Li, François-Henry Rouet, Jianlin Xia, Maarten V. de Hoop, "A parallel geometric multifrontal solver using hierarchically semiseparable structure", Submitted to ACM Transaction on Mathematical Software, 2013,

### James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

### P. Maris, H. M. Aktulga, S. Binder, A. Calci, U. V. Catalyurek, J. Langhammer, E. G. Ng, E. Saule, R. Roth, J. P. Vary, C. Yang, "No Core CI calculations for light nuclei with chiral 2- and 3-body forces", J. Phys. Conf. Ser., IOP Publishing, August 1, 2013, 454:012063, doi: 10.1088/1742-6596/454/1/012063

### P. Ghysels, W. Vanroose, "Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm", Parallel Computing, June 24, 2013, doi: 10.1016/j.parco.2013.06.001

### Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, Sivan Toledo, "Communication optimal parallel multiplication of sparse random matrices", SPAA 2013: The 25th ACM Symposium on Parallelism in Algorithms and Architectures, Montreal, Canada, 2013, 222-231, doi: 10.1145/2486159.2486196

- Download File: spaa134-ballard.pdf (pdf: 301 KB)

### Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

- Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

### E. Solomonik, A. Buluç, J. Demmel, "Minimizing communication in all-pairs shortest paths", International Parallel and Distributed Processing Symposium (IPDPS), 2013,

- Download File: 25dapspipdps13.pdf (pdf: 256 KB)

### P. Ghysels, T. J. Ashby, K. Meerbergen, W. Vanroose, "Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines", SIAM Journal on Scientific Computing, January 8, 2013, 35:1, doi: 10.1137/12086563X

### Jack Dongarra, Mathieu Faverge, Thomas Herault, Mathias Jacquelin, Julien Langou, Yves Robert, "Hierarchical QR factorization algorithms for multi-core clusters", Parallel Computing, 2013, 39:212--232,

### L. Lin, M. Chen, C. Yang, L. He, "Accelerating Atomic Orbital-based Electronic Structure Calculation via Pole Expansion and Selected Inversion", J Phsy: Condens Matter, 2013,

### L. Lin, C. Yang, "Elliptic preconditioner for accelerating the self-consistent field iteration in Kohn-Sham Density Functional Theory", SIAM J. Sci. Comp., 2013,

### 2012

###
P. Maris, H. M. Aktulga, M. A. Caprio, U. V. Catalyurek, E. G. Ng, D. Oryspayev, H. Potter, E.

Saule, M. Sosonkina, J. P. Vary, C. Yang, Z. Zhou,
"Large-scale Ab-initio Configuration Interaction Calculations for Light Nuclei",
J. Phys. Conf. Ser.,
IOP Publishing,
December 18, 2012,
403:012019,
doi: doi:10.1088/1742-6596/403/1/012019

### H. Hu, C. Yang, K. Zhao, "Absorption correction A* for cylindrical and spherical crystals with extended range and high accuracy calculated by Thorkildsen & Larsen analytical method", in press Acta Crystallographica, A, 2012,

### Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

### C. Mendl, L. Lin, "Towards the Kantorovich dual solution for strictly correlated electrons in atoms and molecules", submitted to Phys. Rev. B, 2012,

### Junmin Gu, David Smith, Ann L. Chervenak, Alex Sim, "Adaptive Data Transfers that Utilize Policies for Resource Sharing", The 2nd International Workshop on Network-Aware Data Management Workshop (NDM2012), 2012,

### L. Lin, S. Shao, W.E, "Efficient iterative method for solving the Dirac-Kohn-Sham density functional theory", submitted to J. Comput. Phys., 2012,

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

### A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

### Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, U. V. Catalyurek, "An Out-of-core Eigensolver on SSD-equipped Clusters", 2012 IEEE International Conference on Cluster Computing (CLUSTER), Beijing, China, September 26, 2012, 248 - 256, doi: 10.1109/CLUSTER.2012.76

### Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, U. V. Catalyurek, "An Out-Of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers", 2012 41st International Conference on Parallel Processing Workshops (ICPPW), Pittsburgh, PA, September 10, 2012, 71 - 80, doi: 10.1109/ICPPW.2012.13

### H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, E. G. Ng, "Topology-Aware Mappings for Large-Scale Eigenvalue Problems", Euro-Par 2012 Parallel Processing Conference, Rhode Island, Greece, August 31, 2012, LNCS 748:830-842, doi: 10.1007/978-3-642-32820-6_82

### L. Lin, L. Ying, "Element orbitals for Kohn-Sham density functional theory", Phys. Rev. B, 2012, 85:235144,

### A. Buluç, J. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments", SIAM Journal on Scientific Computing (SISC), 2012,

- Download File: spgemmsisc12.pdf (pdf: 1.2 MB)

### Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

### A. Lugowski, D. Alber, A. Buluç, J. Gilbert, S. Reinhardt, Y. Teng, A. Waranis, "A flexible open-source toolbox for scalable complex graph analysis", SIAM Conference on Data Mining (SDM), 2012,

- Download File: kdt-final.pdf (pdf: 753 KB)

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

- Download File: SRPS-2012-ABSTRACT-CHOUROU-rev.pdf (pdf: 764 KB)

### A. Lugowski, A. Buluç, J. Gilbert, S. Reinhardt, "Scalable complex graph analysis with the knowledge discovery toolbox", International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2012,

### L. Lin, J. Lu, L. Ying and W. E, "Optimized local basis set for Kohn-Sham density functional theory", J. Comput. Phys., 2012, 231:4515,

### Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

### D. Yu, D. Katramatos, A. Shoshani, A. Sim, J. Gu, V. Natarajan, "StorNet: Integrating Storage Resource Management with Dynamic Network Provisioning for Automated Data Transfer", International Committee for Future Accelerators (ICFA) Standing Committee on Inter-Regional Connectivity (SCIC) 2012 Report: Networking for High Energy Physics, 2012,

### D. Y. Parkinson, C. Yang, C. Knoechel, C. A. Larabell, M. Le Gros, "Automatic alignment and reconstruction of images for soft X-ray tomography", J Struct Biol, February 2012, 177:259--266, doi: 10.1016/j.jsb.2011.11.027

### P. Ghysels, P. Kłosiewicz, W. Vanroose, "Improving the arithmetic intensity of multigrid with the help of polynomial smoothers", Numerical Linear Algebra with Applications, February 1, 2012, 19:2, doi: 10.1002/nla.1808

### L. Lin, J. Lu, L. Ying and W. E, "Adaptive local basis set for Kohn-Sham density functional theory in a discontinuous Galerkin framework I: Total energy calculation", J. Comput. Phys., 2012, 231:2140,

### D. Flammini, A. Pietropaolo, R. Senesi, C. Andreani, F. McBride, A. Hodgson, M. Adams, L. Lin, and R. Car,, "Spherical momentum distribution of the protons in hexagonal ice from modeling of inelastic neutron scattering data", J. Chem. Phys., 2012, 136:024504,

### M. Kawai, T. Iwashita, H. Nakashima and O. Marques, "Parallel Smoother Based on Block Red-Black Ordering for Multigrid Poisson Solver", LNCS, Proc. VECPAR 2012, Kobe, Japan, Springer, 2012, 7851:292-299,

### Tudor David, Mathias Jacquelin, Loris Marchal, "Scheduling streaming applications on a complex multicore platform", Concurrency and Computation: Practice and Experience, 2012, 24:1726--1750,

### Zaiwen Wen, Chao Yang, Xin Liu, Stefano Marchesini, "Alternating direction methods for classical and ptychographic phase retrieval", Inverse Problems, January 2012, 28:115010,

### J. Gonzalez-Domınguez, O. Marques, M. Martın, G. Taboada, J. Tourino, "Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS", 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, Madrid, 2012,

### 2011

### Ichitaro Yamazaki, Xiaoye Sherry Li, François-Henry Rouet, Bora Uçar, "Partitioning, Ordering and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver", Institut National Polytechnique de Toulouse, RT-APO-12-2, November 2011,

- Download File: reportPDSLin.pdf (pdf: 634 KB)

### L. Lin, C. Yang, J. Lu, L. Ying, W. E, "A fast parallel algorithm for selected inversion of structured sparse matrices with application to 2D electronic structure calculations", SIAM J. Sci. Comput., 2011, 33:1329,

### R. Ryne, B. Austin, J. Byrd, J. Corlett, E. Esarey, C. G. R. Geddes, W. Leemans, X. Li, Prabhat, J. Qiang, O. Rübel, J.-L. Vay, M. Venturini, K. Wu, B. Carlsten, D. Higdon and N. Yampolsky, "High Performance Computing in Accelerator Science: Past Successes, Future Challenges", Workshop on Data and Communications in Basic Energy Sciences: Creating a Pathway for Scientific Discovery, October 2011,

### H. M. Aktulga, C. Yang, U. V. Catalyurek, P. Maris, J. P. Vary, E. G. Ng, "On Reducing I/O Overheads in Large-Scale Invariant Subspace Projections", Euro-Par 2011: Parallel Processing Workshops, Bordeaux, France, August 29, 2011, LNCS 715:305-314, doi: 10.1007/978-3-642-29737-3_35

### L. Lin, C. Yang, J. Meza, J. Lu, L. Ying, W. E, "SelInv -- An algorithm for selected inversion of a sparse symmetric matrix", ACM Trans. Math. Software, 2011, 37:40,

### L. Lin, J.A. Morrone and R. Car, "Correlated tunneling in hydrogen bonds", J. Stat. Phys., 2011, 145:365,

### J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

### E. G. Ng, J. Sarich, S. M.Wild, T. Munson, H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, N. Schunck, M. G. Bertolli, M. Kortelainen, W. Nazarewicz, T. Papenbrock, M. V. Stoitsov, "Advancing Nuclear Physics Through TOPS Solvers and Tools", SciDAC 2011 Conference, Denver, CO, July 10, 2011, arXiv:1110.1708,

### H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, E. G. Ng, "Large-scale Parallel Null Space Calculation for Nuclear Configuration Interaction", 2011 International Conference on High Performance Computing and Simulation (HPCS), Istanbul, Turkey, July 8, 2011, 176 - 185, doi: 10.1109/HPCSim.2011.5999822

### J. Gu, D. Katramatos, X. Liu, V. Natarajan, A. Shoshani, A. Sim, D. Yu, S. Bradley, S. McKee, "StorNet: Integrated Dynamic Storage and Network Resource Provisioning and Management for Automated Data Transfers", Journal of Physics: Conf. Ser., 2011, 331, doi: 10.1088/1742- 6596/331/1/012002

### G. Garzoglio, J. Bester, K. Chadwick, D. Dykstra, D. Groep, J. Gu, T. Hesselroth, O. Koeroo, T. Levshina, S. Martin, M. Salle, N. Sharma, A. Sim, S. Timm, A. Verstegen, "Adoption of a SAML-XACML Profile for Authorization Interoperability across Grid Middleware in OSG and EGEE", Journal of Physics: Conf. Ser., 2011, 331, doi: 10.1088/1742-6596/331/6/062011

### A. Buluç, J. R. Gilbert, V. B. Shah, "Implementing Sparse Matrices for Graph Algorithms", Graph Algorithms in the Language of Linear Algebra. SIAM Press, ( 2011)

### A. Buluç, J. R. Gilbert, "New Ideas in Sparse Matrix-Matrix Multiplication", Graph Algorithms in the Language of Linear Algebra. SIAM Press, ( 2011)

### A. Buluç, J. Gilbert, "The Combinatorial BLAS: Design, implementation, and applications", International Journal of High-Perormance Computing Applications (IJHPCA), 2011,

- Download File: combblas-r2.pdf (pdf: 288 KB)

### L. Lin, J.A. Morrone, R. Car, M. Parrinello, "Momentum distribution, vibrational dynamics and the potential of the mean force in ice", Phys. Rev. B (Rapid Communication), 2011, 83:220302,

### Aydın Buluç, Samuel Williams, Leonid Oliker, James Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", IPDPS, IEEE, 2011, doi: https://doi.org/10.1109/IPDPS.2011.73

- Download File: ipdps2011.pdf (pdf: 770 KB)

### Junmin Gu, Dimitrios Katramatos, Xin Liu, Vijaya Natarajan, Arie Shoshani, Alex Sim, Dantong Yu, Scott Bradley, Shawn McKee, "StorNet: Co-Scheduling of End-to-End Bandwidth Reservation on Storage and Network Systems for High Performance Data Transfers", IEEE INFOCOM HSN 2011, 2011,

### L. Lin, J. Lu and L. Ying, "Fast construction of hierarchical matrix representation from matrix-vector multiplication", J. Comput. Phys., 2011, 230:4071,

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

### Mathias Jacquelin, "Memory-aware algorithms and scheduling techniques: from multicore processors to petascale supercomputers", Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, Pages: 2038--2041 2011,

### Dean N. Williams, Ian T. Foster, Don E. Middleton, Rachana Ananthakrishnan, Neill Miller, Mehmet Balman, Junmin Gu, Vijaya Natarajan, Arie Shoshani, Alex Sim, Gavin Bell, Robert Drach, Michael Ganzberger, Jim Ahrens, Phil Jones, Daniel Crichton, Luca Cinquini, David Brown, Danielle Harper, Nathan Hook, Eric Nienhouse, Gary Strand, Hannah Wilcox, Nathan Wilhelmi, Stephan Zednik, Steve Hankin, Roland Schweitzer, John Harney, Ross Miller, Galen Shipman, Feiyi Wang, Peter Fox, Patrick West, Stephan Zednik, Ann Chervenak, Craig Ward, "Earth System Grid Center for Enabling Technologies (ESG-CET): A Data Infrastructure for Data-Intensive Climate Research", SciDAC Conference, 2011,

### Mathias Jacquelin, Loris Marchal, Yves Robert, Bora U\ccar, "On optimal tree traversals for sparse matrix factorization", Parallel \& Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, 556--567,

### Henricus Bouwmeester, Mathias Jacquelin, Julien Langou, Yves Robert, "Tiled QR factorization algorithms", Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, 7,

### Franck Cappello, Mathias Jacquelin, Loris Marchal, Yves Robert, Marc Snir, "Comparing archival policies for Blue Waters", High Performance Computing (HiPC), 2011 18th International Conference on, 2011, 1--10,

### Xiaoye S. Li, Meiyue Shao, "A supernodal approach to incomplete LU factorization with partial pivoting", ACM Transactions on Mathematical Software, 2011, 37:43:1--43:2, doi: 10.1145/1916461.1916467

### Filipe RNC Maia, Chao Yang, Stefano Marchesini, "Compressive auto-indexing in femtosecond nanocrystallography", Ultramicroscopy, 2011, 111:807--811, LBNL 4598E,

### 2010

### P. Ghysels, G. Samaey, P. Van Liedekerke, E. Tijskens, H. Ramon, D. Roose, "Multiscale Modeling of Viscoelastic Plant Tissue", International Journal for Multiscale Computational Engineering, 2010, 8:4, doi: 10.1615/IntJMultCompEng.v8.i4.30

### P. Ghysels, G. Samaey, P. Van Liedekerke, E. Tijskens, H. Ramon, D. Roose, "Coarse Implicit Time Integration of a Cellular Scale Particle Model for Plant Tissue Deformation", International Journal for Multiscale Computational Engineering, 2010, 8, doi: 10.1615/IntJMultCompEng.v8.i4.50

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

- Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

### P. Van Liedekerke, E. Tijskens, H. Ramon, P. Ghysels, G. Samaey, D. Roose, "Particle-based model to simulate the micromechanics of biological cells", Physical Review E, June 3, 2010, 81:6, doi: 10.1103/PhysRevE.81.061906

### E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

### A. Buluç, J. R. Gilbert, C. Budak, "Solving path problems on the GPU", Parallel Computing, 36(5-6):241 - 253., 2010, doi: http://dx.doi.org/10.1016/j.parco.2009.12.002

- Download File: parcoapsp.pdf (pdf: 160 KB)

### P. Van Liedekerke, P. Ghysels, E. Tijskens, G. Samaey, B. Smeedts, D. Roose, H. Ramon, "A particle-based model to simulate the micromechanics of single-plant parenchyma cells and aggregates", Physical Biology, May 26, 2010, 7:2, doi: 10.1088/1478-3975/7/2/026006

### L. Lin, J.A. Morrone, R. Car and M. Parrinello, "Displaced path integral formulation for the momentum distribution of quantum particles", Phys. Rev. Lett., 2010, 105:110602,

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

- Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

### Matthieu Gallet, Mathias Jacquelin, Loris Marchal, "Scheduling complex streaming applications on the Cell processor", Parallel \& Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, 1--8,

### Ichitaro Yamazaki, Zhaojun Bai, Horst D. Simon Lin-Wang Wang, Kesheng Wu, "Adaptive Projection Subspace Dimension for the Lanczos Method", ACM Transactions on Mathematical Software, 2010, 37, doi: 10.1145/1824801.1824805

### 2009

### "Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

### L. Lin, J. Lu, L. Ying, W. E, "Pole-based approximation of the Fermi-Dirac function", Chin. Ann. Math., 2009, 30B:729,

### A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks", SPAA '09 Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, doi: http://dx.doi.org/10.1145/1583991.1584053

- Download File: csb2009.pdf (pdf: 347 KB)

### L. Lin, J. Lu, L. Ying, R. Car, W. E, "Fast algorithm for extracting the diagonal of the inverse matrix with application to the electronic structure analysis of metallic systems", Commun. Math. Sci., 2009, 7:755,

### M. Riedel, E. Laure, Th. Soddemann, L. Field, J. P. Navarro, J. Casey, M. Litmaath, J. Ph. Baud, B. Koblitz, C. Catlett, D. Skow, C. Zheng, P. M. Papadopoulos, M. Katz, N. Sharma, O. Smirnova, B. Kónya, P. Arzberger, F. Würthwein, A. S. Rana, T. Martin, M. Wan, V. Welch, T. Rimovsky, S. Newhouse, A. Vanni, Y. Tanaka, Y. Tanimura, T. Ikegami, D. Abramson, C. Enticott, G. Jenkins, R. Pordes, N. Sharma, S. Timm, N. Sharma, G. Moont, M. Aggarwal, D. Colling, O. van der Aa, A. Sim, V. Natarajan, A. Shoshani, J. Gu, S. Chen, G. Galang, R. Zappi, L. Magnoni, V. Ciaschini, M. Pace, V. Venturi, M. Marzolla, P. Andreetto, B. Cowles, S. Wang, Y. Saeki, H. Sato, S. Matsuoka, P. Uthayopas, S. Sriprayoonsakul, O. Koeroo, M. Viljoen, L. Pearlman, S. Pickles, David Wallom, G. Moloney, J. Lauret, J. Marsteller, P. Sheldon, S. Pathak, S. De Witt, J. Mencák, J. Jensen, M. Hodges, D. Ross, S. Phatanapherom, G. Netzer, A. R. Gregersen, M. Jones, S. Chen, P. Kacsuk, A. Streit, D. Mallmann, F. Wolf, T. Lippert, Th. Delaitre, E. Huedo, N. Geddes, "Interoperation of world-wide production e-Science infrastructures", Concurrency and Computation: Practice and Experience, 2009, 21(8):961-990,

### Arie Shoshani, Flavia Donno, Junmin Gu, Jason Hick, Maarten Litmaath, Alex Sim, "Dynamic Storage Management", Scientific Data Management: Challenges, Technology, and Deployment, edited by Arie Shoshani, Doron Rotem, (Chapman & Hall/CRC Computational Science: 2009)

### J.A. Morrone, L. Lin, R. Car, "Tunneling and delocalization effects in hydrogen bonded systems: A study in position and momentum space", J. Chem. Phys., 2009, 130:204511,

### P. Ghysels, G. Samaey, B. Tijskens, P Van Liedekerke, H Ramon, D Roose, "Multi-scale simulation of plant tissue deformation using a model for individual cell mechanics", Physical Biology, March 25, 2009, 6:1, doi: 10.1088/1478-3975/6/1/016009

### L. Lin, J. Lu, R. Car, W. E, "Multipole representation of the Fermi operator with application to the electronic structure analysis of metallic systems", Phys. Rev. B, 2009, 79:115133,

### Xiaoye S. Li, Meiyue Shao, Ichitaro Yamazaki, Esmond G. Ng, "Factorization-based sparse solvers and preconditioners", (SciDAC 2009) Journal of Physics: Conference Series 180(2009) 012015, 2009, doi: 10.1088/1742-6596/180/1/012015

### Mathias Jacquelin, Loris Marchal, Yves Robert, "Complexity analysis and performance evaluation of matrix product on multicore architectures", Parallel Processing, 2009. ICPP 09. International Conference on, 2009, 196--203,

### K Wu et al., "FastBit: Interactively Searching Massive Data", SciDAC 2009, 2009, LBNL 2164E, doi: 10.1088/1742-6596/180/1/012053

- Download File: LBNL-2164E.pdf (pdf: 3.2 MB)

### 2008

### A, Buluç, J. Gilbert, "Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication", Proceedings of the 37th International Conference on Parallel Processing (ICPP), 2008, doi: 10.1109/ICPP.2008.45

- Download File: spgemmicpp08.pdf (pdf: 206 KB)

### P. Jakl, J. Lauret, A. Hanushevsky, A. Shoshani, A. Sim, J. Gu, "Grid data access on widely distributed worker nodes using scalla and SRM", Journal of Physics: Conf. Ser., 2008, 119, doi: 10.1088/1742-6596/119/7/072019

### S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

- Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

### O. Marques, J. Demmel, C. Voemel, B. Parlett, "A Testing Infrastructure for Symmetric Tridiagonal Eigensolvers", ACM TOMS, 2008, 35,

### Alex Sim, Arie Shoshani (Editors), Paolo Badino, Olof Barring, Jean‐Philippe Baud, Ezio Corso, Shaun De Witt, Flavia Donno, Junmin Gu, Michael Haddox‐Schatz, Bryan Hess, Jens Jensen, Andy Kowalski, Maarten Litmaath, Luca Magnoni, Timur Perelmutov, Don Petravick, Chip Watson, The Storage Resource Manager Interface Specification Version 2.2, Open Grid Forum, Document in Full Recommendation, GFD.129, 2008,

### A. Buluç, J.R. Gilbert, "On the Representation and Multiplication of Hypersparse Matrices", IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2008, doi: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2008.4536313

- Download File: hypersparse-ipdps08.pdf (pdf: 194 KB)

### Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine A. Yelick, James Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Parallel Computing, 2008, 35:38, doi: 10.1016/j.parco.2008.12.006

- Download File: parco08-spmv.pdf (pdf: 1.5 MB)

### A. Canning, O. Marques, C. Voemel, L.-W. Wang, J. Dongarra, J. Langou, S. Tomov, "New eigensolvers for large-scale nanoscience simulations", Journal of Physics: Conference Series, 2008, 125, doi: 10.1088/1742-6596/125/1/012074

### C. Vömel, S.Z. Tomov, O.A. Marques, A. Canning, L.-W. Wang, J.J. Dongarra, "State-of-the-art eigensolvers for electronic structure calculations of large scale nano-systems", Journal of Computational Physics, 2008, 227:7113-7124, doi: 10.1016/j.jcp.2008.01.018

### J. Demmel, O. Marques, C. Voemel, B. Parlett, "Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers", SIAM Journal on Scientific Computing, 2008, 30:1508–1526,

### 2007

### Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

- Download File: sc07-spmv.pdf (pdf: 438 KB)

###
L. Abadie, P. Badino, J. Baud, E. Corso, M. Crawford, S. De Witt, F. Donno, A. Forti, P. Fuhrmann,

G. Grosdidier, J. Gu , J. Jensen, S. Lemaitre, M. Litmaath, D. Litvinsev, G. Lo Presti, L. Magnoni, T. Mkrtchan, A. Moibenko, V. Natarajan, G. Oleynik, T. Perelmutov, D. Petravick, A. Shoshani, A. Sim, M. Sponza, R. Zappi,
"Storage Resource Managers: Recent International Experience on Requirements and Multiple Co-Operating Implementations",
the 24th IEEE Conference on Mass Storage Systems and Technologies,
2007,

### F. Donno, L. Abadie, P. Badino, J. Baud, E. Corso, M. Crawford, S. De Witt, A. Forti, P. Fuhrmann, G. Grosdidier, J. Gu , J. Jensen, S. Lemaitre, M. Litmaath, D. Litvinsev, G. Lo Presti, L. Magnoni, T. Mkrtchan, A. Moibenko, V. Natarajan, G. Oleynik, T. Perelmutov, D. Petravick, A. Shoshani, A. Sim, M. Sponza, R. Zappi, "Storage Resource Manager version 2.2: design, implementation, and testing experience", Journal of Physics: Conf. Ser., 2007, 119, doi: 10.1088/1742-6596/119/6/062028

### 2006

### O. Marques, B. Parlett, C. Voemel, "Computations of Eigenpair Subsets with the MRRR Algorithm", Numer. Linear Algebra Appl., 2006, 13:643–653,

### W. Kramer, J. Carter, D. Skinner, L. Oliker, P. Husbands, P. Hargrove, J. Shalf, O. Marques, E. Ng, A. Drummond, K. Yelick, "Software Roadmap to Plug and Play Petaflop/s", 2006,

### O. Marques, C. Voemel, J. Riedy, "Benefits of IEEE-754 Features in Modern Symmetric Tridiagonal Eigensolvers", SIAM J. Sci. Comput., 2006, 28:1613-1633,

### J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li, O. Marques, J. Riedy, C. Vömel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Langou, S. Tomov, "Prospectus for the next LAPACK and ScaLAPACK Libraries", PARA 2006, Umeå, Sweden, 2006,

### 2005

### Kesheng Wu, Junmin Gu, Jerome Lauret, Arthur Poskanzer, Arie Shoshani, Alexander Sim, Zhang, "Grid Collector: Facilitating Efficient Selective from Data Grids", International Supercomputer Conference 2005, 2005,

### 2004

### K. Wu, W. Zhang, A. Sim, J. Gu, A. Shoshani, "Grid Collector: an Event Catalog with Automated File Management", 2004, LBNL 55563,

### Alex Sim, Junmin Gu, Arie Shoshani, Vijaya Natarajan, "DataMover: Robust Terabytes-Scale Multi-file Replication over Wide-Area Networks", the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), 2004,

### 2003

### Arie Shoshani, Alexander Sim, Junmin Gu, "Storage Resource Managers: Essential Components for the Grid", Grid Resource Management: State of the Art and Future Trends, edited by Jarek Nabrzyski, Jennifer M. Schopf, Jan Weglarz, (Kluwer Academic Publishers: 2003)

### A. Sim, J. Gu, A. Shoshani, E. Hjort, D. Olson, "Experience with Deploying Storage Resource Managers to Achieve Robust File Replication", Computing in High Energy Physics, 2003,

### Arie Shoshani, Alex Sim, Junmin Gu, Storage Resource Managers: Essential Components for Grid Applications, Globus World, 2003,

### D. Vasco, L. Johnson, O. Marques, "Resolution, Uncertainty and Whole Earth Tomography", Journal of Geophysical Research, Solid Earth, 2003, 108,

### Kesheng Wu, Wei-Ming Zhang, Alexander Sim, Gu, Arie Shoshani, "Grid Collector: An Event Catalog With Automated File", Proceedings of IEEE Nuclear Science Symposium 2003, 2003, doi: 10.1109/NSSMIC.2003.1351830

### 2002

### A. Shoshani, A. Sim, J. Gu, "Storage Resource Managers: Middleware components for Grid Storage", the 19th IEEE Symposium on Mass Storage Systems, 2002,

### B. Gaeke, P. Husbands, X. Li, L. Oliker, K. Yelick, and R. Biswas, "Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines", International Parallel & Distributed Processing Symposium (IPDPS), 2002,

- Download File: ipdps02-iram.pdf (pdf: 91 KB)

### L. Oliker. X. Li, P. Husbands, R. Biswas, "Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations", SIAM Review Journal, 2002,

- Download File: sirev02-sparse.pdf (pdf: 475 KB)

### 2001

### L. Oliker, R. Biswas, P. Husbands, X. Li, Ordering Sparse Matrices for Cache-Based Systems, SIAM Conference on Parallel Processing, 2001,

- Download File: siampp01abstactb.pdf (pdf: 2.1 MB)

### L. Oliker, X. Li, P. Husbands, R. Biswas, "Ordering Schemes for Sparse Matrices using Modern Programming Paradigms", The IASTED International Conference on Applied Informatics (AI), 2001,

- Download File: ai01.pdf (pdf: 163 KB)

### 2000

### L. Oliker, X. Li. G. Heber, R. Biswas, "Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms", 13th International Conference on Parallel and Distributed Computing Systems, 2000,

- Download File: pdcs00-pcg.pdf (pdf: 167 KB)

### L. Oliker, X. Li, G. Heber, R. Biswas, "Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems", Seventh International Workshop on solving Irregularly Structured Problems in Parallel, 2000,

- Download File: irr00awk.pdf (pdf: 130 KB)

### B. Parlett, O. Marques, "An Implementation of the dqds Algorithm (positive case)", Linear Algebra and its Applications, 2000, 309:217-259,

### 1996

### S. Chatterjee, J. Gilbert, L. Oliker, R. Schreiber, and T. Sheffler, "Algorithms for Automatic Alignment of Arrays", Journal of Parallel and Distributed Computing (JPDC), July 1996,

- Download File: jpdc96.ps.gz (gz: 89 KB)