# Publications

### 2018

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runtimes", The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'18), November 13, 2018,

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work is driven by the emerging need for adaptive, lightweight communication in irregular applications at exascale. We present an overview of UPC++ and GASNet-EX, including examples and performance results.

GASNet-EX is a portable, high-performance communication library, leveraging hardware support to efficiently implement Active Messages and Remote Memory Access (RMA). UPC++ provides higher-level abstractions appropriate for PGAS programming such as: one-sided communication (RMA), remote procedure call, locality-aware APIs for user-defined distributed objects, and robust support for asynchronous execution to hide latency. Both libraries have been redesigned relative to their predecessors to meet the needs of exascale computing. While both libraries continue to evolve, the system already demonstrates improvements in microbenchmarks and application proxies.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Specification v1.0, Draft 8", Lawrence Berkeley National Laboratory Tech Report, September 26, 2018, LBNL 2001179, doi: 10.25344/S45P4X

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer's Guide, v1.0-2018.9.0", Lawrence Berkeley National Laboratory Tech Report, September 26, 2018, LBNL 2001180, doi: 10.25344/S49G6V

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### Meiyue Shao, Felipe H. da Jornada, Lin Lin, Chao Yang, Jack Deslippe, Steven G. Louie, "A structure preserving Lanczos algorithm for computing the optical absorption spectrum", SIAM Journal on Matrix Analysis and Applications, 2018, 39:683--711, doi: 10.1137/16M1102641

### Junmin Gu, Scott Klasky, Norbert Podhorszki, Ji Qiang, Kesheng Wu, "Querying Large Scientific Data Sets with Adaptable IO System ADIOS", Supercomputing Frontiers (Best Paper Award), Springer International Publishing, 2018, 51-69,

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer’s Guide, v1.0-2018.3.0", Lawrence Berkeley National Laboratory Tech Report, March 31, 2018, LBNL 2001136, doi: 10.2172/1430693

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian van Straalen,, "UPC++ Specification v1.0, Draft 6", Lawrence Berkeley National Laboratory Tech Report, March 26, 2018, LBNL 2001135, doi: 10.2172/1430689

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### H. Zhan, G. Gomes, X. S. Li, K. Madduri, A. Sim, K. Wu, "Consensus Ensemble System for Traffic Flow Prediction", IEEE Transactions on Intelligent Transportation Systems, 2018, doi: 10.1109/TITS.2018.2791505

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes", Poster at Exascale Computing Project (ECP) Annual Meeting 2018., February 2018,

### Meiyue Shao, Hasan Metin Aktulga, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "Accelerating nuclear configuration interaction calculations through a preconditioned block iterative eigensolver", Computer Physics Communications, 2018, 222:1--13, doi: 10.1016/j.cpc.2017.09.004

### 2017

### John Bachan, Dan Bonachea, Paul H Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, Scott Baden, "The UPC++ PGAS library for exascale computing", PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017, November 12, 2017, doi: 10.1145/3144779.3169108

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### Yang You, Aydin Buluc, James Demmel, "Scaling deep learning on GPU and Knights Landing clusters", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'17), 2017,

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, "UPC++: a PGAS C++ Library", ACM/IEEE Conference on Supercomputing, SC'17, November 2017,

### Meiyue Shao and Chao Yang, "Properties of Definite Bethe--Salpeter Eigenvalue Problems", Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing. EPASA 2015. Lecture Notes in Computational Science and Engineering, vol 117., 2017, 91--105, doi: 10.1007/978-3-319-62426-6_7

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ Programmer’s Guide, v1.0-2017.9", Lawrence Berkeley National Laboratory Tech Report, September 29, 2017, LBNL 2001065, doi: 10.2172/1398522

This document has been superseded by: UPC++ Programmer’s Guide, v1.0-2018.3.0 (LBNL-2001136)

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian van Straalen,, "UPC++ Specification v1.0, Draft 4", Lawrence Berkeley National Laboratory Tech Report, September 27, 2017, LBNL 2001066, doi: 10.2172/1398521

This document has been superseded by: UPC++ Specification v1.0, Draft 6 (LBNL-2001135)

UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Ariful Azad, Mathias Jacquelin, Aydin Buluc, Esmond G. Ng, "The Reverse Cuthill-McKee Algorithm in Distributed-Memory", IEEE International Parallel & Distributed Processing Symposium (IPDPS), Orlando, FL, May 2017,

- Download File: RCM-ipdps17.pdf (pdf: 1.1 MB)

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen, "UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes", Poster at Exascale Computing Project (ECP) Annual Meeting 2017., January 2017,

### E. Vecharynski and C. Yang, "Preconditioned iterative methods for eigenvalue counts", Lecture Notes in Computational Science, January 1, 2017,

### 2016

### M. Jacquelin, L. Lin and C. Yang, "A Distributed Memory Parallel Algorithm for Selected Inversion: the non-symmetric case", PMAA, December 30, 2016,

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

### Utkarsh Ayachit, Andrew Bauer, Earl P. N. Duque, Greg Eisenhauer, Nicola Ferrier, Junmin Gu, Kenneth E. Jansen, Burlen Loring, Zarija Lukić, Suresh Menon, Dmitriy Morozov, Patrick O'Leary, Reetesh Ranjan, Michel Rasquin, Christopher P. Stone, Venkat Vishwanath, Gunther H. Weber, Brad Whitlock, Matthew Wolf, K. John Wu, E. Wes Bethel, "Performance analysis, design considerations, and applications of extreme-scale in situ infrastructures", Supercomputing, 2016, 921-932, LBNL 1007264, doi: 10.1109/SC.2016.78

### Mark Adams, Samuel Williams, HPGMG BoF - Introduction, HPGMG BoF, Supercomputing, November 2016,

- Download File: SC16-HPGMG-BoF-Intro.pdf (pdf: 1020 KB)

### Utkarsh Ayachit, Andrew Bauer, Earl P. N. Duque, Greg Eisenhauer, Nicola Ferrier, Junmin Gu, Kenneth Jansen, Burlen Loring, Zarija Luki\ c, Suresh Menon, Dmitriy Morozov, Patrick O Leary, Michel Rasquin, Christopher P. Stone, Venkat Vishwanath, Gunther H. Weber, Brad Whitlock, Matthew Wolf, K. John Wu, E. Wes Bethel, "Performance Analysis, Design Considerations, and Applications of Extreme-scale In Situ Infrastructures", ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, USA, 2016, doi: 10.1109/SC.2016.78

### Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication", SIAM Journal on Scientific Computing, 38(6), C624–C651, November 2016, doi: 10.1137/15M104253X

- Download File: SISC-SpGEMM.pdf (pdf: 1.5 MB)

### Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary, "A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations", IEEE Transactions on Parallel and Distributed Systems (TPDS), November 2016, doi: 10.1109/TPDS.2016.2630699

- Download File: ieeetpds-mfdn-lobpcg-rev.pdf (pdf: 889 KB)

### Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

### A. S. Banerjee, L. Lin, W. Hu, C. Yang and J. E. Pask, "Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for large-scale electronic structure calculations", Journal of Chemical Physics, October 1, 2016,

### Veronika Strnadova-Neeley, Aydin Buluc, John R. Gilbert, Leonid Oliker, Weimin Ouyang, "LiRa: A New Likelihood-Based Similarity Score for Collaborative Filtering", August 30, 2016,

### Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

### R. Li, Y. Xi, E. Vecharynski, C. Yang, and Y. Saad, "A Thick-Restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems", SIAM Journal on Scientific Computing, Vol. 38, Issue 4, pp. A2512–A2534, 2016, doi: 10.1137/15M1054493

Polynomial filtering can provide a highly effective means of computing all eigenvalues of a real symmetric (or complex Hermitian) matrix that are located in a given interval, anywhere in the spectrum. This paper describes a technique for tackling this problem by combining a Thick-Restart version of the Lanczos algorithm with deflation ('locking') and a new type of polynomial filters obtained from a least-squares technique. The resulting algorithm can be utilized in a 'spectrum-slicing' approach whereby a very large number of eigenvalues and associated eigenvectors of the matrix are computed by extracting eigenpairs located in different sub-intervals independently from one another.

### Osni Marques, Paulo B. Vasconcelos, "Computing the Bidiagonal SVD through an Associated Tridiagonal Eigenproblem", VECPAR 2016, Porto, Portugal, Springer, June 2016,

### Naoya Nomura, Akihiro Fujii, Teruo Tanaka, Kengo Nakajima, Osni Marques, "Performance Analysis of SA-AMG Method by Setting Extracted Near-kernel Vectors", VECPAR 2016, Porto, Portugal, Springer, June 2016,

### Fabien Bruneval, Tonatiuh Rangel, Samia M. Hamed, Meiyue Shao, Chao Yang, Jeffrey B. Neaton, "MOLGW 1: many-body perturbation theory software for atoms, molecules, and clusters", Computer Physics Communications, 2016, 208:149–161, doi: 10.1016/j.cpc.2016.06.019

### Osni Marques, Alex Druinsky, Xiaoye S. Li, Andrew T. Barker, Panayot Vassilevski, Delyan Kalchev, "Tuning the Coarse Space Construction in a Spectral AMG Solver", ICCS 2016 (The International Conference on Computational Science), San Diego, CA, Elsevier, June 2016,

### Meiyue Shao, Lin Lin, Chao Yang, Fang Liu, Felipe H. da Jornada, Jack Deslippe and Steven G. Louie, "Low rank approximation in G0W0 calculations", Science China Mathematics, June 4, 2016, 59:1593–1612, doi: 10.1007/s11425-016-0296-x

### Mathias Jacquelin, Lin Lin, Nathan Wichmann, Chao Yang, Enhancing scalability and load balancing of Parallel Selected Inversion via tree-based asynchronous communication, IPDPS 16, May 24, 2016,

### D. Pugmire, J. Kress, H. Childs, M. Wolf, G. Eisenhauer, J. Low, R. M. Churchill, T. Kurc, K. Wu, A. Sim, J. Gu, J. Choi, S. Klasky, "Visualization and Analysis for Near-Real-Time Decision Making in Distributed Workflows", High Performance Data Analysis and Visualization Workshop (HPDAV2016) in conjunction with the 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016), 2016, doi: 10.1109/IPDPSW.2016.175

### Mathias Jacquelin, Scheduling Sparse Symmetric Fan-Both Cholesky Factorization, The 11th Scheduling for Large Scale Systems Workshop, May 18, 2016,

### Mathias Jacquelin, Yili Zheng, Esmond Ng, Katherine Yelick, "An Asynchronous Task-based Fan-Both Sparse Cholesky Solver", Submitted to SuperComputing'16, May 10, 2016,

### Mathias Jacquelin, Lin Lin, Weile Jia, Yonghua Zhao, Chao Yang, "A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems", Submitted to SuperComputing'16, May 10, 2016,

### Mathias Jacquelin, Scheduling Sparse Symmetric Fan-Both Cholesky Factorization, SIAM PP'16, April 15, 2016,

### M. Jacquelin, L. Lin, W. Jia, Y. Zhao and C. Yang, "A Left-looking selected inversion algorithm and task parallelism on shared memory systems", April 9, 2016,

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### Samuel Williams, Mark Adams, Brian Van Straalen, Performance Portability in Hybrid and Heterogeneous Multigrid Solvers, Copper Moutain, March 2016,

- Download File: CU16SWWilliams.pptx (pptx: 1 MB)

### Z. Wen, C. Yang, X. Liu and Y. Zhang, "A Penalty-based Trace Minimization Method for Large-scale Eigenspace Computation", J. Sci. Comp., March 1, 2016, 66:1175-1203, doi: 10.1007/s10915-015-0061-0

### E. Vecharynski, C. Yang, and F. Xue, "Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems", SIAM Journal on Scientific Computing, Vol. 38, No. 1, pp. A500–A527, 2016, doi: 10.1137/15M1027413

We introduce the Generalized Preconditioned Locally Harmonic Residual (GPLHR) method for solving standard and generalized non-Hermitian eigenproblems. The method is particularly useful for computing a subset of eigenvalues, and their eigen- or Schur vectors, closest to a given shift. The proposed method is based on block iterations and can take advantage of a preconditioner if it is available. It does not need to perform exact shift-and-invert transformation. Standard and generalized eigenproblems are handled in a unified framework. Our numerical experiments demonstrate that GPLHR is generally more robust and efficient than existing methods, especially if the available memory is limited.

### E. Vecharynski and C. Yang, "Preconditioned iterative methods for eigenvalue counts", to appear in Proceedings of International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, in Lecture Notes in Computational Science and Engineering, Springer, 2016,

We describe preconditioned iterative methods for estimating the number of eigenvalues of a Hermitian matrix within a given interval. Such estimation is useful in a number of applications.In particular, it can be used to develop an efficient spectrum-slicing strategy to compute many eigenpairs of a Hermitian matrix. Our method is based on the Lanczos- and Arnoldi-type of iterations. We show that with a properly defined preconditioner, only a few iterations may be needed to obtain a good estimate of the number of eigenvalues within a prescribed interval. We also demonstrate that the number of iterations required by the proposed preconditioned schemes is independent of the size and condition number of the matrix. The efficiency of the methods is illustrated on several problems arising from density functional theory based electronic structure calculations.

### Wei Hu, Lin Lin, Chao Yang, Jun Dai and Jinlong Yang, "Edge-Modied Phosphorene Nano ake Heterojunctions as Highly Ecient Solar Cells", Nano Lett, February 5, 2016, 16:1675–1682, doi: 10.1021/acs.nanolett.5b04593

### L. Lin, Y. Saad and C. Yang, "Approximating spectral densities of large matrices", SIAM Review, February 1, 2016, 58:34–65, doi: 10.1137/130934283

### P. Li, X. Liu, M. Chen, P. Lin, X. Ren, L. Lin, C. Yang, L. He, "Large-scale ab initio simulations based on systematically improvable atomic basis", Computational Materials Science, February 1, 2016, 112:503–517, doi: doi:10.1016/j.commatsci.2015.07.004

### J. Brabec, C. Yang, E. Epifanovsky, A.I. Krylov, and E. Ng, "Reduced-cost sparsity-exploiting algorithm for solving coupled-cluster equations", Journal of Computational Chemistry, January 24, 2016, 37:1059–1067, doi: 10.1002/jcc.24293

### Burlen Loring, Suren Byna, Prabhat, Junmin Gu, Hari Krishnan, Michael Wehner, and Oliver Ruebel, "TECA an Extreme Event Detection and Climate Analysis Package for High Performance Computing", The AMS (American Meteorological Society) 96th Annual Meeting, January 6, 2016,

### Meiyue Shao, Felipe H. da Jornada, Chao Yang, Jack Deslippe, Steven G. Louie, "Structure preserving parallel algorithms for solving the Bethe–Salpeter eigenvalue problem", Linear Algebra and its Applications, 2016, 488:148–167, doi: 10.1016/j.laa.2015.09.036

### 2015

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", Springer International Journal of Parallel Programming, December 2015, 43:6:1218-1243, doi: 10.1007/s10766-014-0326-5

### M. Jacquelin, L. Lin, N. Wichmann and C. Yang, "Enhancing the scalability tree-based asynchronous communication", accepted IPDPS16, November 25, 2015,

### Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism", Supercomputing (SC'15), November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.

We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.

We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

### E. Vecharynski, J. Brabec, M. Shao, N. Govind, C. Yang, "Efficient Block Preconditioned Eigensolvers for Linear Response Time-dependent Density Functional Theory", submitted to JCC, 2015,

We present two efficient iterative algorithms for solving the linear response eigenvalue problem arising fromthe time dependent density functional theory. Although the matrix to be diagonalized is nonsymmetric, it has a special structure that can be exploited to save both memory and floating point operations. In particular, the nonsymmetric eigenvalue problem can be transformed into a product eigenvalue problem that is self-adjoint with respect to a K-inner product. This product eigenvalue problem can be solved efficiently by a modified Davidson algorithm and a modified locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm that make use of the K-inner product. The solution of the product eigenvalue problem yields one component of the eigenvector associated with the original eigenvalue problem. However, the other component of the eigenvector can be easily recovered in a postprocessing procedure. Therefore, the algorithms we present here are more efficient than existing algorithms that try to approximate both components of the eigenvectors simultaneously.The efficiency of the new algorithms is demonstrated by numerical examples.

### Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Nicholas Knight, Hong Diep Nguyen, "Reconstructing Householder vectors from Tall-Skinny QR", Journal of Parallel and Distributed Computing, November 1, 2015, 85:3-31, doi: 10.1016/j.jpdc.2015.06.003

### M. van Setten; F. Carouso; S. Sharifzadeh; X. Ren; M. Scheffler; F. Liu; J. Lischner; L. Lin; J. Deslippe; S. Louie; C. Yang; F. Weigend; J. Neaton; F. Evers; P. Rinke, "GW 100: Benchmarking G0W0 for molecular systems", Journal of Chemical Theory and Computation, October 22, 2015,

### Jiri Brabec, Lin Lin, Meiyue Shao, Niranjan Govind, Chao Yang, Yousef Saad, Esmond G. Ng, "Fast Algorithms for Estimating the Absorption Spectrum within Linear Response Time-dependent Density Functional Theory", Journal of Chemical Theory and Computation, 2015, 11:5197–5208, doi: 10.1021/acs.jctc.5b00887

### Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

### Abhinav Sarje, Xiaoye Li, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Reconstructing Nanostructures from X-Ray Scattering Data", OLCF User Meeting, June 2015,

### M. Ulbrich, Z. Wen, C. Yang, D. Klockner, Z. Lu, "A proximal gradient method for ensemble density functional theory", SIAM J. Sci. Comp., June 20, 2015, 37:A1975--A20, doi: 10.1137/14098973X

### Mathias Jacquelin, Lin Lin, Chao Yang, "A Distributed Memory Parallel Algorithm for Selected Inversion : the Symmetric Case", To appear in ACM Transactions on Mathematical Software (TOMS), May 28, 2015,

### Ariful Azad, Aydin Buluc, John Gilbert, "Parallel Triangle Counting and Enumeration using Matrix Algebra", Workshop on Graph Algorithms Building Blocks (GABB), in conjunction with IPDPS, IEEE, May 2015,

- Download File: triangles-gabb.pdf (pdf: 384 KB)

### C. Yang, Absorption Spectrum Estimation via Linear Response TDDFT, Applied Math Seminar, Stanford University, May 13, 2015,

### C. Yang, Fast Numerical Algorithms for Large-scale Electronic Structure Calculations, DOE BES Computational and Theoretical Chemistry PI Meeting, April 28, 2015,

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Math Colloquium, Michigan Tech University, April 24, 2015,

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Applied math & PDE seminar, UC Davis, April 14, 2015,

### Fang Liu, Lin Lin , Derek Vigil-Fowlerd , Johannes Lischnerd, Alexander F. Kemper, , Sahar Sharifzadehe, Felipe H. da Jornadad, Jack Deslippef, Chao Yangc, Jeffrey B. Neaton, Steven G. Louied,, "Numerical integration for ab initio many-electron self energy calculations within the GW approximation", Journal of Computational Physics, April 1, 2015,

### Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer, "Recovering Nanostructures from X-Ray Scattering Data", Nvidia GPU Technology Conference (GTC), March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

### C. Yang, Fast Numerical Methods for Computational Materials Science and Chemistry, CRD All-hands meeting, March 4, 2015,

### Marc Baboulin, Xiaoye S. Li, Francois-Henry Rouet, "Using random butterfly transformations to avoid pivoting in sparse direct methods", High Performance Computing for Computational Science - VECPAR 2014, Lecture Notes in Computer Science, Springer. Preprint, 2015,

### E. Vecharynski, C. Yang, J. E. Pask, "A projected preconditioned conjugate gradient algorithm for computing many extreme eigenpairs of a Hermitian matrix", Journal of Computational Physics, Vol. 290, pp. 73–89, 2015,

We present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix *A*. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of *A*. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimal block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.

### Wei Hu, Lin Lin and Chao Yang, "Edge reconstruction in armchair phosphorene nanoribbons revealed by discontinuous Galerkin density functional theory", Phys. Chem. Chem. Phys., 2015, Advance Article, February 11, 2015, doi: 10.1039/C5CP00333D

With the help of our recently developed massively parallel DGDFT (Discontinuous Galerkin Density Functional Theory) methodology, we perform large-scale Kohn–Sham density functional theory calculations on phosphorene nanoribbons with armchair edges (ACPNRs) containing a few thousands to ten thousand atoms. The use of DGDFT allows us to systematically achieve a conventional plane wave basis set type of accuracy, but with a much smaller number (about 15) of adaptive local basis (ALB) functions per atom for this system. The relatively small number of degrees of freedom required to represent the Kohn–Sham Hamiltonian, together with the use of the pole expansion the selected inversion (PEXSI) technique that circumvents the need to diagonalize the Hamiltonian, results in a highly efficient and scalable computational scheme for analyzing the electronic structures of ACPNRs as well as their dynamics. The total wall clock time for calculating the electronic structures of large-scale ACPNRs containing 1080–10 800 atoms is only 10–25 s per self-consistent field (SCF) iteration, with accuracy fully comparable to that obtained from conventional planewave DFT calculations. For the ACPNR system, we observe that the DGDFT methodology can scale to 5000–50 000 processors. We use DGDFT based ab initio molecular dynamics (AIMD) calculations to study the thermodynamic stability of ACPNRs. Our calculations reveal that a 2 × 1 edge reconstruction appears in ACPNRs at room temperature.

### C. Yang, Fast Numerical Methods for Electronic Structure Calculations, Workshop on High Performance and Parallel Computing Methods and Algorithms for Materials Defects, Singapore, February 9, 2015,

### M. Adams, P. Colella, D. T. Graves, J.N. Johnson, N.D. Keen, T. J. Ligocki. D. F. Martin. P.W. McCorquodale, D. Modiano. P.O. Schwartz, T.D. Sternberg, B. Van Straalen, "Chombo Software Package for AMR Applications - Design Document", Lawrence Berkeley National Laboratory Technical Report LBNL-6616E, January 9, 2015,

- Download File: chomboDesign.pdf (pdf: 994 KB)

### D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A.I. Krylov, "New algorithms for iterative matrix-free eigensolvers in quantum chemistry", Journal of Computational Chemistry, Vol. 36, Issue 5, pp. 273–284, 2015,

New algorithms for iterative diagonalization procedures that solve for a small set of eigen-states of a large matrix are described. The performance of the algorithms is illustrated by calculations of low and high-lying ionized and electronically excited states using equation-of-motion coupled-cluster methods with single and double substitutions (EOM-IP-CCSD and EOM-EE-CCSD). We present two algorithms suitable for calculating excited states that are close to a specified energy shift (interior eigenvalues). One solver is based on the Davidson algorithm, a diagonalization procedure commonly used in quantum-chemical calculations. The second is a recently developed solver, called the “Generalized Preconditioned Locally Harmonic Residual (GPLHR) method.” We also present a modification of the Davidson procedure that allows one to solve for a specific transition. The details of the algorithms, their computational scaling, and memory requirements are described. The new algorithms are implemented within the EOM-CC suite of methods in the Q-Chem electronic structure program.

### 2014

### Siegfried Cools, Pieter Ghysels, Wim van Aarle, Wim Vanroose, "A multi-level preconditioned Krylov method for the efficient solution of algebraic tomographic reconstruction problems", To appear in Journal of Computational and Applied Mathematics, December 28, 2014,

### François-Henry Rouet, Xiaoye S. Li, Pieter Ghysels, Artem Napov, "A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization", Submitted to ACM Transactions on Mathematical Software, December 2014,

### Wei Hu, Lin Lin, Chao Yang and Jinlong Yang, "Electronic structure and aromaticity of large-scale hexagonal graphene nanoflakes", J. Chem. Phys. 141, 214704 (2014), December 2, 2014, 141:214704, doi: 10.1063/1.4902806

- Download File: JCPGNFs.pdf (pdf: 3.7 MB)

With the help of the recently developed SIESTA-PEXSI method [L. Lin, A. García, G. Huhs, and C. Yang, J. Phys.: Condens. Matter26, 305503 (2014)], we perform Kohn-Sham density functional theory calculations to study the stability and electronic structure of hydrogen passivated hexagonal graphene nanoflakes (GNFs) with up to 11 700 atoms. We find the electronic properties of GNFs, including their cohesive energy, edge formation energy, highest occupied molecular orbital-lowest unoccupied molecular orbital energy gap, edge states, and aromaticity, depend sensitively on the type of edges (armchair graphene nanoflakes (ACGNFs) and zigzag graphene nanoflakes (ZZGNFs)), size and the number of electrons. We observe that, due to the edge-induced strain effect in ACGNFs, large-scale ACGNFs’ edge formation energydecreases as their size increases. This trend does not hold for ZZGNFs due to the presence of many edge states in ZZGNFs. We find that the energy gaps E g of GNFs all decay with respect to 1/L, where L is the size of the GNF, in a linear fashion. But as their size increases, ZZGNFs exhibit more localized edge states. We believe the presence of these states makes their gap decrease more rapidly. In particular, when L is larger than 6.40 nm, we find that ZZGNFs exhibit metallic characteristics. Furthermore, we find that the aromatic structures of GNFs appear to depend only on whether the system has 4N or 4N + 2 electrons, where N is an integer.

### David Trebotich, Mark F. Adams, Sergi Molins, Carl I. Steefel, Chaopeng Shen, "High-Resolution Simulation of Pore-Scale Reactive Transport Processes Associated with Carbon Sequestration", Computing in Science and Engineering, December 2014, 16:22-31, doi: 10.1109/MCSE.2014.77

- Download File: CISE-16-06-Trebotichappeared.pdf (pdf: 2.7 MB)

### Mark Adams, Samuel Williams, Jed Brown, HPGMG, Birds of a Feather (BoF), Supercomputing, November 2014,

- Download File: SC14HPGMGBoF.pdf (pdf: 1.9 MB)

### Alex Druinsky, Brian Austin, Sherry Li, Osni Marques, Eric Roman, Samuel Williams, "A Roofline Performance Analysis of an Algebraic Multigrid Solver", Supercomputing (SC), November 2014,

### A. L. Chervenak, A. Sim, J. Gu, R. Schuler, N. Hirpathak, "Adaptation and Policy-Based Resource Allocation for Efficient Bulk Data Transfers in High Performance Computing Environments", 4th International Workshop on Network-aware Data Management (NDM'14), 2014,

### Veronika Strnadova, Aydın Buluç, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, John Gilbert, Daniel Rokhsar, Leonid Oliker, "Efficient and accurate clustering for large-scale genetic mapping", IEEE International Conference on Bioinformatics and Biomedicine (BIBM'14), November 1, 2014,

- Download File: bibm14.pdf (pdf: 764 KB)

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

### Adam Lugowski, Shoaib Kamil, Aydın Buluç, Samuel Williams, Erika Duriakova, Leonid Oliker, Armando Fox, John R. Gilbert,, "Parallel processing of filtered queries in attributed semantic graphs", Journal of Parallel and Distributed Computing (JPDC), September 2014, doi: 10.1016/j.jpdc.2014.08.010

### W.A. de Jong, L. Lin, H. Shan, C. Yang and L. Oliker, "Towards modelling complex mesoscale molecular environments", International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), 2014,

### Mark Adams, Jed Brown, Matt Knepley, Ravi Samtaney, "Segmental Refinement: A Multigrid Technique for Data Locality", Submitted to SISC, June 30, 2014,

### Pieter Ghysels, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Jianlin Xia, Hierarchically Low-Rank Structured Sparse Factorization with Reduced Communication and Synchronization, Householder Symposium XIX, June 2014,

### Pieter Ghysels, Wim Vanroose, Karl Meerbergen, High Performance Implementation of Deflated Preconditioned Conjugate Gradients with Approximate Eigenvectors, Householder Symposium XIX June 8-13, Spa Belgium, Pages: 84 June 2014,

### Mark F. Adams, Jed Brown, John Shalf, Brian Van Straalen, Erich Strohmaier, Samuel Williams, "HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems", LBNL Technical Report, 2014, LBNL 6630E,

- Download File: hpgmg.pdf (pdf: 183 KB)

### Abhinav Sarje, Xiaoye Li, Slim Chourou, Alexander Hexemer, "Petascale X-Ray Scattering Simulations With GPUs", GPU Technology Conference, March 2014,

### Abhinav Sarje, Xiaoye Li, Alexander Hexemer, "Inverse Modeling of X-Ray Scattering Data With Reverse Monte Carlo Simulations", GPU Technology Conference, March 2014,

### Xiaoye S. Li, Artem Napov, Francois-Henry Rouet, Designing multifrontal solvers using hierarchically semiseparable structures, SIAM Conference on Parallel Processing for Scientific Computing (PP12), Portland, OR, USA, February 2014,

### A. L. Chervenak, A. Sim, J. Gu, R. Schuler, N. Hirpathak, "Efficient Data Staging Using Performance-Based Adaptation and Policy-Based Resource Allocation", 22nd Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2014,

### E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, U. V. Catalyurek, "An Out-of-core Task-based Middleware for Data Intensive Scientific Computing", Handbook on Data Centers, in press, (Springer: February 1, 2014)

### J. Kaye, L. Lin and C. Yang, "A posteriori error estimator for adaptive local basis functions to solve Kohn-Sham density functional theory", Comm. Math. Sci., January 5, 2014, 13:1741--1740, doi: http://dx.doi.org/10.4310/CMS.2015.v13.n7.a5

### J. González-Domínguez, O. Marques, M. J. Martín and J. Touriño, "A 2D Algorithm with Asymmetric Workload for the UPC Conjugate Gradient Method", The Journal of Supercomputing, 2014, 70:816-829,

### A. Fujii, O. Marques, "Axis Communication Method for Algebraic Multigrid Solver", IEICE Transactions on Information and Systems, 2014, E97-D:2955-2958,

### Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, James Demmel, "s-step Krylov subspace methods as bottom solvers for geometric multigrid", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, January 2014, 1149--1158, doi: 10.1109/IPDPS.2014.119

- Download File: ipdps14cabicgstabfinal.pdf (pdf: 943 KB)
- Download File: ipdps14CABiCGStabtalk.pdf (pdf: 944 KB)

### G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, Hong Diep Nguyen, E. Solomonik, "Reconstructing Householder Vectors from Tall-Skinny QR", Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014, 1159-1170, doi: 10.1109/IPDPS.2014.120

### Laura Grigori, Mathias Jacquelin, Amal Khabou, "Performance predictions of multilevel communication optimal LU and QR factorizations on hierarchical platforms", 29th IEEE International Supercomputing Conference (ISC'2014), Springer, 2014, 76--92,

### 2013

### H. M. Aktulga, L. Lin, C. Haine, E. G. Ng, C. Yang, "Parallel Eigenvalue Calculation based on Multiple Shift-invert Lanczos and Contour Integral based Spectral Projection Method", Parallel Computing, December 6, 2013, in press,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

- Download File: sarje-thmmcs-pmbs.pdf (pdf: 2 MB)

### M. Jung, E. H. Wilson III, W. Choi, J. Shalf, H. M. Aktulga, C. Yang, E. Saule, U. V. Catalyurek, M. Kandemir, "Exploring the Future of Out-of-core Computing with Compute-Local Non-Volatile Memory", International Conference for High Performance Computing, Networking, Storage and Analysis 2013 (SC13), NY, USA, ACM New York, November 17, 2013, doi: 10.1145/2503210.2503261

### Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

- Download File: sbac2013personal.pdf (pdf: 195 KB)

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

### H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, "Improving the Scalability of a Symmetric Iterative Eigensolver for Multi-core Platforms", Concurrency and Computation: Practice & Experience, September 12, 2013, online, doi: 10.1002/cpe.3129

### Alfredo Buttari, Serge Gratton, Xiaoye S. Li, Marième Ngom, François-Henry Rouet, David Titley-Peloquin, Clément Weisbecker, "Error Analysis of the Block Low-Rank LU factorization of dense matrices", IRIT-CERFACS, RT-APO-13-7, August 2013,

### Emmanuel Agullo, Patrick R. Amestoy, Alfredo Buttari, Abdou Guermouche, Guillaume Joslin, Jean-Yves L'Excellent, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Mohamed Sid-Lakhdar, Shen Wang, Clément Weisbecker, Ichitaro Yamazaki., "Recent Advances in Sparse Direct Solvers", 22nd Conference on Structural Mechanics in Reactor Technology, August 18, 2013,

- Download File: paper3.pdf (pdf: 243 KB)

### Shen Wang, Xiaoye S. Li, François-Henry Rouet, Jianlin Xia, Maarten V. de Hoop, "A parallel geometric multifrontal solver using hierarchically semiseparable structure", Submitted to ACM Transaction on Mathematical Software, 2013,

### James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

### P. Maris, H. M. Aktulga, S. Binder, A. Calci, U. V. Catalyurek, J. Langhammer, E. G. Ng, E. Saule, R. Roth, J. P. Vary, C. Yang, "No Core CI calculations for light nuclei with chiral 2- and 3-body forces", J. Phys. Conf. Ser., IOP Publishing, August 1, 2013, 454:012063, doi: 10.1088/1742-6596/454/1/012063

### P. Ghysels, W. Vanroose, "Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm", Parallel Computing, June 24, 2013, doi: 10.1016/j.parco.2013.06.001

### Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, Sivan Toledo, "Communication optimal parallel multiplication of sparse random matrices", SPAA 2013: The 25th ACM Symposium on Parallelism in Algorithms and Architectures, Montreal, Canada, 2013, 222-231, doi: 10.1145/2486159.2486196

- Download File: spaa134-ballard.pdf (pdf: 301 KB)

### E. Solomonik, A. Buluç, J. Demmel, "Minimizing communication in all-pairs shortest paths", International Parallel and Distributed Processing Symposium (IPDPS), 2013,

- Download File: 25dapspipdps13.pdf (pdf: 256 KB)

### Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams, "High-Productivity and High-Performance Analysis of Filtered Semantic Graphs", International Parallel and Distributed Processing Symposium (IPDPS), 2013, doi: 10.1145/2370816.2370897

- Download File: ipdps13-kdtsejits.pdf (pdf: 398 KB)

### P. Ghysels, T. J. Ashby, K. Meerbergen, W. Vanroose, "Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines", SIAM Journal on Scientific Computing, January 8, 2013, 35:1, doi: 10.1137/12086563X

### L. Lin, M. Chen, C. Yang, L. He, "Accelerating Atomic Orbital-based Electronic Structure Calculation via Pole Expansion and Selected Inversion", J Phsy: Condens Matter, 2013,

### L. Lin, C. Yang, "Elliptic preconditioner for accelerating the self-consistent field iteration in Kohn-Sham Density Functional Theory", SIAM J. Sci. Comp., 2013,

### Jack Dongarra, Mathieu Faverge, Thomas Herault, Mathias Jacquelin, Julien Langou, Yves Robert, "Hierarchical QR factorization algorithms for multi-core clusters", Parallel Computing, 2013, 39:212--232,

### 2012

###
P. Maris, H. M. Aktulga, M. A. Caprio, U. V. Catalyurek, E. G. Ng, D. Oryspayev, H. Potter, E.

Saule, M. Sosonkina, J. P. Vary, C. Yang, Z. Zhou,
"Large-scale Ab-initio Configuration Interaction Calculations for Light Nuclei",
J. Phys. Conf. Ser.,
IOP Publishing,
December 18, 2012,
403:012019,
doi: doi:10.1088/1742-6596/403/1/012019

### H. Hu, C. Yang, K. Zhao, "Absorption correction A* for cylindrical and spherical crystals with extended range and high accuracy calculated by Thorkildsen & Larsen analytical method", in press Acta Crystallographica, A, 2012,

### Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

### C. Mendl, L. Lin, "Towards the Kantorovich dual solution for strictly correlated electrons in atoms and molecules", submitted to Phys. Rev. B, 2012,

### Junmin Gu, David Smith, Ann L. Chervenak, Alex Sim, "Adaptive Data Transfers that Utilize Policies for Resource Sharing", The 2nd International Workshop on Network-Aware Data Management Workshop (NDM2012), 2012,

### L. Lin, S. Shao, W.E, "Efficient iterative method for solving the Dirac-Kohn-Sham density functional theory", submitted to J. Comput. Phys., 2012,

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

### A. Buluç, A. Fox, J. R. Gilbert, S. Kamil, A. Lugowski, L. Oliker, S. Williams, "High-performance analysis of filtered semantic graphs", PACT '12 Proceedings of the 21st international conference on Parallel architectures and compilation techniques (extended abstract), 2012, doi: 10.1145/2370816.2370897

### Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, U. V. Catalyurek, "An Out-of-core Eigensolver on SSD-equipped Clusters", 2012 IEEE International Conference on Cluster Computing (CLUSTER), Beijing, China, September 26, 2012, 248 - 256, doi: 10.1109/CLUSTER.2012.76

### Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, U. V. Catalyurek, "An Out-Of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers", 2012 41st International Conference on Parallel Processing Workshops (ICPPW), Pittsburgh, PA, September 10, 2012, 71 - 80, doi: 10.1109/ICPPW.2012.13

### H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, E. G. Ng, "Topology-Aware Mappings for Large-Scale Eigenvalue Problems", Euro-Par 2012 Parallel Processing Conference, Rhode Island, Greece, August 31, 2012, LNCS 748:830-842, doi: 10.1007/978-3-642-32820-6_82

### L. Lin, L. Ying, "Element orbitals for Kohn-Sham density functional theory", Phys. Rev. B, 2012, 85:235144,

### A. Buluç, J. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments", SIAM Journal on Scientific Computing (SISC), 2012,

- Download File: spgemmsisc12.pdf (pdf: 1.2 MB)

### Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

### A. Lugowski, D. Alber, A. Buluç, J. Gilbert, S. Reinhardt, Y. Teng, A. Waranis, "A flexible open-source toolbox for scalable complex graph analysis", SIAM Conference on Data Mining (SDM), 2012,

- Download File: kdt-final.pdf (pdf: 753 KB)

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

- Download File: SRPS-2012-ABSTRACT-CHOUROU-rev.pdf (pdf: 764 KB)

### A. Lugowski, A. Buluç, J. Gilbert, S. Reinhardt, "Scalable complex graph analysis with the knowledge discovery toolbox", International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2012,

### L. Lin, J. Lu, L. Ying and W. E, "Optimized local basis set for Kohn-Sham density functional theory", J. Comput. Phys., 2012, 231:4515,

### Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

### P. Ghysels, P. Kłosiewicz, W. Vanroose, "Improving the arithmetic intensity of multigrid with the help of polynomial smoothers", Numerical Linear Algebra with Applications, February 1, 2012, 19:2, doi: 10.1002/nla.1808

### D. Yu, D. Katramatos, A. Shoshani, A. Sim, J. Gu, V. Natarajan, "StorNet: Integrating Storage Resource Management with Dynamic Network Provisioning for Automated Data Transfer", International Committee for Future Accelerators (ICFA) Standing Committee on Inter-Regional Connectivity (SCIC) 2012 Report: Networking for High Energy Physics, 2012,

### D. Y. Parkinson, C. Yang, C. Knoechel, C. A. Larabell, M. Le Gros, "Automatic alignment and reconstruction of images for soft X-ray tomography", J Struct Biol, February 2012, 177:259--266, doi: 10.1016/j.jsb.2011.11.027

### L. Lin, J. Lu, L. Ying and W. E, "Adaptive local basis set for Kohn-Sham density functional theory in a discontinuous Galerkin framework I: Total energy calculation", J. Comput. Phys., 2012, 231:2140,

### D. Flammini, A. Pietropaolo, R. Senesi, C. Andreani, F. McBride, A. Hodgson, M. Adams, L. Lin, and R. Car,, "Spherical momentum distribution of the protons in hexagonal ice from modeling of inelastic neutron scattering data", J. Chem. Phys., 2012, 136:024504,

### J. Gonzalez-Domınguez, O. Marques, M. Martın, G. Taboada, J. Tourino, "Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS", 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, Madrid, 2012,

### M. Kawai, T. Iwashita, H. Nakashima and O. Marques, "Parallel Smoother Based on Block Red-Black Ordering for Multigrid Poisson Solver", LNCS, Proc. VECPAR 2012, Kobe, Japan, Springer, 2012, 7851:292-299,

### Zaiwen Wen, Chao Yang, Xin Liu, Stefano Marchesini, "Alternating direction methods for classical and ptychographic phase retrieval", Inverse Problems, January 2012, 28:115010,

### 2011

### Ichitaro Yamazaki, Xiaoye Sherry Li, François-Henry Rouet, Bora Uçar, "Partitioning, Ordering and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver", Institut National Polytechnique de Toulouse, RT-APO-12-2, November 2011,

- Download File: reportPDSLin.pdf (pdf: 634 KB)

### L. Lin, C. Yang, J. Lu, L. Ying, W. E, "A fast parallel algorithm for selected inversion of structured sparse matrices with application to 2D electronic structure calculations", SIAM J. Sci. Comput., 2011, 33:1329,

### R. Ryne, B. Austin, J. Byrd, J. Corlett, E. Esarey, C. G. R. Geddes, W. Leemans, X. Li, Prabhat, J. Qiang, O. Rübel, J.-L. Vay, M. Venturini, K. Wu, B. Carlsten, D. Higdon and N. Yampolsky, "High Performance Computing in Accelerator Science: Past Successes, Future Challenges", Workshop on Data and Communications in Basic Energy Sciences: Creating a Pathway for Scientific Discovery, October 2011,

### H. M. Aktulga, C. Yang, U. V. Catalyurek, P. Maris, J. P. Vary, E. G. Ng, "On Reducing I/O Overheads in Large-Scale Invariant Subspace Projections", Euro-Par 2011: Parallel Processing Workshops, Bordeaux, France, August 29, 2011, LNCS 715:305-314, doi: 10.1007/978-3-642-29737-3_35

### L. Lin, C. Yang, J. Meza, J. Lu, L. Ying, W. E, "SelInv -- An algorithm for selected inversion of a sparse symmetric matrix", ACM Trans. Math. Software, 2011, 37:40,

### L. Lin, J.A. Morrone and R. Car, "Correlated tunneling in hydrogen bonds", J. Stat. Phys., 2011, 145:365,

### J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

### E. G. Ng, J. Sarich, S. M.Wild, T. Munson, H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, N. Schunck, M. G. Bertolli, M. Kortelainen, W. Nazarewicz, T. Papenbrock, M. V. Stoitsov, "Advancing Nuclear Physics Through TOPS Solvers and Tools", SciDAC 2011 Conference, Denver, CO, July 10, 2011, arXiv:1110.1708,

### H. M. Aktulga, C. Yang, P. Maris, J. P. Vary, E. G. Ng, "Large-scale Parallel Null Space Calculation for Nuclear Configuration Interaction", 2011 International Conference on High Performance Computing and Simulation (HPCS), Istanbul, Turkey, July 8, 2011, 176 - 185, doi: 10.1109/HPCSim.2011.5999822

### J. Gu, D. Katramatos, X. Liu, V. Natarajan, A. Shoshani, A. Sim, D. Yu, S. Bradley, S. McKee, "StorNet: Integrated Dynamic Storage and Network Resource Provisioning and Management for Automated Data Transfers", Journal of Physics: Conf. Ser., 2011, 331, doi: 10.1088/1742- 6596/331/1/012002

### G. Garzoglio, J. Bester, K. Chadwick, D. Dykstra, D. Groep, J. Gu, T. Hesselroth, O. Koeroo, T. Levshina, S. Martin, M. Salle, N. Sharma, A. Sim, S. Timm, A. Verstegen, "Adoption of a SAML-XACML Profile for Authorization Interoperability across Grid Middleware in OSG and EGEE", Journal of Physics: Conf. Ser., 2011, 331, doi: 10.1088/1742-6596/331/6/062011

### A. Buluç, J. R. Gilbert, V. B. Shah, "Implementing Sparse Matrices for Graph Algorithms", Graph Algorithms in the Language of Linear Algebra. SIAM Press, ( 2011)

### A. Buluç, J. R. Gilbert, "New Ideas in Sparse Matrix-Matrix Multiplication", Graph Algorithms in the Language of Linear Algebra. SIAM Press, ( 2011)

### A. Buluç, J. Gilbert, "The Combinatorial BLAS: Design, implementation, and applications", International Journal of High-Perormance Computing Applications (IJHPCA), 2011,

- Download File: combblas-r2.pdf (pdf: 288 KB)

### L. Lin, J.A. Morrone, R. Car, M. Parrinello, "Momentum distribution, vibrational dynamics and the potential of the mean force in ice", Phys. Rev. B (Rapid Communication), 2011, 83:220302,

### Junmin Gu, Dimitrios Katramatos, Xin Liu, Vijaya Natarajan, Arie Shoshani, Alex Sim, Dantong Yu, Scott Bradley, Shawn McKee, "StorNet: Co-Scheduling of End-to-End Bandwidth Reservation on Storage and Network Systems for High Performance Data Transfers", IEEE INFOCOM HSN 2011, 2011,

### A. Buluç, S. Williams, L. Oliker, J. Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication", International Parallel Distributed Processing Symposium (IPDPS), May 2011, doi: 10.1109/IPDPS.2011.73

- Download File: ipdps11-spmv.pdf (pdf: 761 KB)

### L. Lin, J. Lu and L. Ying, "Fast construction of hierarchical matrix representation from matrix-vector multiplication", J. Comput. Phys., 2011, 230:4071,

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "TORCH Computational Reference Kernels: A Testbed for Computer Science Research", LBNL Technical Report, 2011, LBNL 4172E,

### Henricus Bouwmeester, Jacquelin, Langou, Yves Robert, "Tiled QR factorization algorithms", SC, 2011, 7,

### Franck Cappello, Jacquelin, Marchal, Robert, Marc Snir, "Comparing archival policies for Blue Waters", HiPC, 2011, 1-10,

### Tudor David, Mathias Jacquelin, Loris Marchal, "Scheduling streaming applications on a complex multicore platform", Concurrency and Computation: Practice and Experience, 2011, n/a--n/a, doi: 10.1002/cpe.1874

### Dean N. Williams, Ian T. Foster, Don E. Middleton, Rachana Ananthakrishnan, Neill Miller, Mehmet Balman, Junmin Gu, Vijaya Natarajan, Arie Shoshani, Alex Sim, Gavin Bell, Robert Drach, Michael Ganzberger, Jim Ahrens, Phil Jones, Daniel Crichton, Luca Cinquini, David Brown, Danielle Harper, Nathan Hook, Eric Nienhouse, Gary Strand, Hannah Wilcox, Nathan Wilhelmi, Stephan Zednik, Steve Hankin, Roland Schweitzer, John Harney, Ross Miller, Galen Shipman, Feiyi Wang, Peter Fox, Patrick West, Stephan Zednik, Ann Chervenak, Craig Ward, "Earth System Grid Center for Enabling Technologies (ESG-CET): A Data Infrastructure for Data-Intensive Climate Research", SciDAC Conference, 2011,

### Filipe RNC Maia, Chao Yang, Stefano Marchesini, "Compressive auto-indexing in femtosecond nanocrystallography", Ultramicroscopy, 2011, 111:807--811, LBNL 4598E,

### Xiaoye S. Li, Meiyue Shao, "A supernodal approach to incomplete LU factorization with partial pivoting", ACM Transactions on Mathematical Software, 2011, 37:43:1--43:2, doi: 10.1145/1916461.1916467

### Mathias Jacquelin, "Memory-Aware Algorithms and Scheduling Techniques: From Multicore Processors to Petascale Supercomputers", IPDPS Workshops, Pages: 2038-2041 2011,

### Mathias Jacquelin, Marchal, Robert, Bora U\ccar, "On Optimal Tree Traversals for Sparse Matrix Factorization", IPDPS, 2011, 556-567,

### 2010

### P. Ghysels, G. Samaey, P. Van Liedekerke, E. Tijskens, H. Ramon, D. Roose, "Multiscale Modeling of Viscoelastic Plant Tissue", International Journal for Multiscale Computational Engineering, 2010, 8:4, doi: 10.1615/IntJMultCompEng.v8.i4.30

### P. Ghysels, G. Samaey, P. Van Liedekerke, E. Tijskens, H. Ramon, D. Roose, "Coarse Implicit Time Integration of a Cellular Scale Particle Model for Plant Tissue Deformation", International Journal for Multiscale Computational Engineering, 2010, 8, doi: 10.1615/IntJMultCompEng.v8.i4.50

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

- Download File: hotpar10-dwarfs-poster.pdf (pdf: 679 KB)

### P. Van Liedekerke, E. Tijskens, H. Ramon, P. Ghysels, G. Samaey, D. Roose, "Particle-based model to simulate the micromechanics of biological cells", Physical Review E, June 3, 2010, 81:6, doi: 10.1103/PhysRevE.81.061906

### A. Buluç, J. R. Gilbert, C. Budak, "Solving path problems on the GPU", Parallel Computing, 36(5-6):241 - 253., 2010, doi: http://dx.doi.org/10.1016/j.parco.2009.12.002

- Download File: parcoapsp.pdf (pdf: 160 KB)

### E. Strohmaier, S. Williams, A. Kaiser, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel,, "A Kernel Testbed for Parallel Architecture, Language, and Performance Research", International Conference of Numerical Analysis and Applied Mathematics (ICNAAM), June 1, 2010, doi: 10.1063/1.3497950

### P. Van Liedekerke, P. Ghysels, E. Tijskens, G. Samaey, B. Smeedts, D. Roose, H. Ramon, "A particle-based model to simulate the micromechanics of single-plant parenchyma cells and aggregates", Physical Biology, May 26, 2010, 7:2, doi: 10.1088/1478-3975/7/2/026006

### L. Lin, J.A. Morrone, R. Car and M. Parrinello, "Displaced path integral formulation for the momentum distribution of quantum particles", Phys. Rev. Lett., 2010, 105:110602,

### A. Kaiser, S. Williams, K. Madduri, K. Ibrahim, D. Bailey, J. Demmel, E. Strohmaier, "A Principled Kernel Testbed for Hardware/Software Co-Design Research", Proceedings of the 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar), 2010,

- Download File: hotpar10-dwarfs.pdf (pdf: 128 KB)

### Matthieu Gallet, Jacquelin, Loris Marchal, "Scheduling complex streaming applications on the Cell processor", IPDPS Workshops, 2010, 1-8,

### Ichitaro Yamazaki, Zhaojun Bai, Horst D. Simon Lin-Wang Wang, Kesheng Wu, "Adaptive Projection Subspace Dimension for the Lanczos Method", ACM Transactions on Mathematical Software, 2010, 37, doi: 10.1145/1824801.1824805

### 2009

### "Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

### L. Lin, J. Lu, L. Ying, W. E, "Pole-based approximation of the Fermi-Dirac function", Chin. Ann. Math., 2009, 30B:729,

### A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks", SPAA '09 Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, doi: http://dx.doi.org/10.1145/1583991.1584053

- Download File: csb2009.pdf (pdf: 347 KB)

### L. Lin, J. Lu, L. Ying, R. Car, W. E, "Fast algorithm for extracting the diagonal of the inverse matrix with application to the electronic structure analysis of metallic systems", Commun. Math. Sci., 2009, 7:755,

### M. Riedel, E. Laure, Th. Soddemann, L. Field, J. P. Navarro, J. Casey, M. Litmaath, J. Ph. Baud, B. Koblitz, C. Catlett, D. Skow, C. Zheng, P. M. Papadopoulos, M. Katz, N. Sharma, O. Smirnova, B. Kónya, P. Arzberger, F. Würthwein, A. S. Rana, T. Martin, M. Wan, V. Welch, T. Rimovsky, S. Newhouse, A. Vanni, Y. Tanaka, Y. Tanimura, T. Ikegami, D. Abramson, C. Enticott, G. Jenkins, R. Pordes, N. Sharma, S. Timm, N. Sharma, G. Moont, M. Aggarwal, D. Colling, O. van der Aa, A. Sim, V. Natarajan, A. Shoshani, J. Gu, S. Chen, G. Galang, R. Zappi, L. Magnoni, V. Ciaschini, M. Pace, V. Venturi, M. Marzolla, P. Andreetto, B. Cowles, S. Wang, Y. Saeki, H. Sato, S. Matsuoka, P. Uthayopas, S. Sriprayoonsakul, O. Koeroo, M. Viljoen, L. Pearlman, S. Pickles, David Wallom, G. Moloney, J. Lauret, J. Marsteller, P. Sheldon, S. Pathak, S. De Witt, J. Mencák, J. Jensen, M. Hodges, D. Ross, S. Phatanapherom, G. Netzer, A. R. Gregersen, M. Jones, S. Chen, P. Kacsuk, A. Streit, D. Mallmann, F. Wolf, T. Lippert, Th. Delaitre, E. Huedo, N. Geddes, "Interoperation of world-wide production e-Science infrastructures", Concurrency and Computation: Practice and Experience, 2009, 21(8):961-990,

### Arie Shoshani, Flavia Donno, Junmin Gu, Jason Hick, Maarten Litmaath, Alex Sim, "Dynamic Storage Management", Scientific Data Management: Challenges, Technology, and Deployment, edited by Arie Shoshani, Doron Rotem, (Chapman & Hall/CRC Computational Science: 2009)

### J.A. Morrone, L. Lin, R. Car, "Tunneling and delocalization effects in hydrogen bonded systems: A study in position and momentum space", J. Chem. Phys., 2009, 130:204511,

### P. Ghysels, G. Samaey, B. Tijskens, P Van Liedekerke, H Ramon, D Roose, "Multi-scale simulation of plant tissue deformation using a model for individual cell mechanics", Physical Biology, March 25, 2009, 6:1, doi: 10.1088/1478-3975/6/1/016009

### L. Lin, J. Lu, R. Car, W. E, "Multipole representation of the Fermi operator with application to the electronic structure analysis of metallic systems", Phys. Rev. B, 2009, 79:115133,

### Mathias Jacquelin, Loris Marchal, Yves Robert, "Complexity analysis and performance evaluation of matrix product on multicore architectures", Parallel Processing, 2009. ICPP 09. International Conference on, 2009, 196--203,

### K Wu et al., "FastBit: Interactively Searching Massive Data", SciDAC 2009, 2009, LBNL 2164E, doi: 10.1088/1742-6596/180/1/012053

- Download File: LBNL-2164E.pdf (pdf: 3.2 MB)

### Xiaoye S. Li, Meiyue Shao, Ichitaro Yamazaki, Esmond G. Ng, "Factorization-based sparse solvers and preconditioners", (SciDAC 2009) Journal of Physics: Conference Series 180(2009) 012015, 2009, doi: 10.1088/1742-6596/180/1/012015

### 2008

### A, Buluç, J. Gilbert, "Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication", Proceedings of the 37th International Conference on Parallel Processing (ICPP), 2008, doi: 10.1109/ICPP.2008.45

- Download File: spgemmicpp08.pdf (pdf: 206 KB)

### C. Voemel, S. Tomov, O. Marques, A. Canning, L.-W. Wang, J. Dongarra, "State-of-the-art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-systems", Journal of Computational Physics, 2008, 227:7113-7124,

### S. Williams, J. Carter, J. Demmel, L. Oliker, D. Patterson, J. Shalf, K. Yelick, R. Vuduc, "Autotuning Scientific Kernels on Multicore Systems", ASCR PI Meeting, 2008,

- Download File: ascrpi08-autotuning-poster.pdf (pdf: 2.2 MB)

### P. Jakl, J. Lauret, A. Hanushevsky, A. Shoshani, A. Sim, J. Gu, "Grid data access on widely distributed worker nodes using scalla and SRM", Journal of Physics: Conf. Ser., 2008, 119, doi: 10.1088/1742-6596/119/7/072019

### O. Marques, J. Demmel, C. Voemel, B. Parlett, "A Testing Infrastructure for Symmetric Tridiagonal Eigensolvers", ACM TOMS, 2008, 35,

### Alex Sim, Arie Shoshani (Editors), Paolo Badino, Olof Barring, Jean‐Philippe Baud, Ezio Corso, Shaun De Witt, Flavia Donno, Junmin Gu, Michael Haddox‐Schatz, Bryan Hess, Jens Jensen, Andy Kowalski, Maarten Litmaath, Luca Magnoni, Timur Perelmutov, Don Petravick, Chip Watson, The Storage Resource Manager Interface Specification Version 2.2, Open Grid Forum, Document in Full Recommendation, GFD.129, 2008,

### A. Buluç, J.R. Gilbert, "On the Representation and Multiplication of Hypersparse Matrices", IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2008, doi: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2008.4536313

- Download File: hypersparse-ipdps08.pdf (pdf: 194 KB)

### J. Demmel, O. Marques, C. Voemel, B. Parlett, "Performance and Accuracy of LAPACK's Symmetric Tridiagonal Eigensolvers", SIAM Journal on Scientific Computing, 2008, 30:1508–1526,

### Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine A. Yelick, James Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Parallel Computing, 2008, 35:38, doi: 10.1016/j.parco.2008.12.006

- Download File: parco08-spmv.pdf (pdf: 1.5 MB)

### 2007

### Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

- Download File: sc07-spmv.pdf (pdf: 438 KB)

###
L. Abadie, P. Badino, J. Baud, E. Corso, M. Crawford, S. De Witt, F. Donno, A. Forti, P. Fuhrmann,

G. Grosdidier, J. Gu , J. Jensen, S. Lemaitre, M. Litmaath, D. Litvinsev, G. Lo Presti, L. Magnoni, T. Mkrtchan, A. Moibenko, V. Natarajan, G. Oleynik, T. Perelmutov, D. Petravick, A. Shoshani, A. Sim, M. Sponza, R. Zappi,
"Storage Resource Managers: Recent International Experience on Requirements and Multiple Co-Operating Implementations",
the 24th IEEE Conference on Mass Storage Systems and Technologies,
2007,

### F. Donno, L. Abadie, P. Badino, J. Baud, E. Corso, M. Crawford, S. De Witt, A. Forti, P. Fuhrmann, G. Grosdidier, J. Gu , J. Jensen, S. Lemaitre, M. Litmaath, D. Litvinsev, G. Lo Presti, L. Magnoni, T. Mkrtchan, A. Moibenko, V. Natarajan, G. Oleynik, T. Perelmutov, D. Petravick, A. Shoshani, A. Sim, M. Sponza, R. Zappi, "Storage Resource Manager version 2.2: design, implementation, and testing experience", Journal of Physics: Conf. Ser., 2007, 119, doi: 10.1088/1742-6596/119/6/062028

### 2006

### O. Marques, C. Voemel, J. Riedy, "Benefits of IEEE-754 Features in Modern Symmetric Tridiagonal Eigensolvers", SIAM J. Sci. Comput., 2006, 28:1613-1633,

### J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li, O. Marques, J. Riedy, C. Vömel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Langou, S. Tomov, "Prospectus for the next LAPACK and ScaLAPACK Libraries", PARA 2006, Umeå, Sweden, 2006,

### O. Marques, B. Parlett, C. Voemel, "Computations of Eigenpair Subsets with the MRRR Algorithm", Numer. Linear Algebra Appl., 2006, 13:643–653,

### W. Kramer, J. Carter, D. Skinner, L. Oliker, P. Husbands, P. Hargrove, J. Shalf, O. Marques, E. Ng, A. Drummond, K. Yelick, "Software Roadmap to Plug and Play Petaflop/s", 2006,

### 2005

### Kesheng Wu, Junmin Gu, Jerome Lauret, Arthur Poskanzer, Arie Shoshani, Alexander Sim, Zhang, "Grid Collector: Facilitating Efficient Selective from Data Grids", International Supercomputer Conference 2005, 2005,

### 2004

### K. Wu, W. Zhang, A. Sim, J. Gu, A. Shoshani, "Grid Collector: an Event Catalog with Automated File Management", 2004, LBNL 55563,

### Alex Sim, Junmin Gu, Arie Shoshani, Vijaya Natarajan, "DataMover: Robust Terabytes-Scale Multi-file Replication over Wide-Area Networks", the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), 2004,

### 2003

### Arie Shoshani, Alexander Sim, Junmin Gu, "Storage Resource Managers: Essential Components for the Grid", Grid Resource Management: State of the Art and Future Trends, edited by Jarek Nabrzyski, Jennifer M. Schopf, Jan Weglarz, (Kluwer Academic Publishers: 2003)

### A. Sim, J. Gu, A. Shoshani, E. Hjort, D. Olson, "Experience with Deploying Storage Resource Managers to Achieve Robust File Replication", Computing in High Energy Physics, 2003,

### Arie Shoshani, Alex Sim, Junmin Gu, Storage Resource Managers: Essential Components for Grid Applications, Globus World, 2003,

### D. Vasco, L. Johnson, O. Marques, "Resolution, Uncertainty and Whole Earth Tomography", Journal of Geophysical Research, Solid Earth, 2003, 108,

### Kesheng Wu, Wei-Ming Zhang, Alexander Sim, Gu, Arie Shoshani, "Grid Collector: An Event Catalog With Automated File", Proceedings of IEEE Nuclear Science Symposium 2003, 2003, doi: 10.1109/NSSMIC.2003.1351830

### 2002

### A. Shoshani, A. Sim, J. Gu, "Storage Resource Managers: Middleware components for Grid Storage", the 19th IEEE Symposium on Mass Storage Systems, 2002,

### L. Oliker. X. Li, P. Husbands, R. Biswas, "Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations", SIAM Review Journal, 2002,

- Download File: sirev02-sparse.pdf (pdf: 475 KB)

### B. Gaeke, P. Husbands, X. Li, L. Oliker, K. Yelick, and R. Biswas, "Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines", International Parallel & Distributed Processing Symposium (IPDPS), 2002,

- Download File: ipdps02-iram.pdf (pdf: 91 KB)

### 2001

### L. Oliker, R. Biswas, P. Husbands, X. Li, Ordering Sparse Matrices for Cache-Based Systems, SIAM Conference on Parallel Processing, 2001,

- Download File: siampp01abstactb.pdf (pdf: 2.1 MB)

### L. Oliker, X. Li, P. Husbands, R. Biswas, "Ordering Schemes for Sparse Matrices using Modern Programming Paradigms", The IASTED International Conference on Applied Informatics (AI), 2001,

- Download File: ai01.pdf (pdf: 163 KB)

### 2000

### L. Oliker, X. Li. G. Heber, R. Biswas, "Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms", 13th Interational Conference on Parallel and Distributed Computing Systems, 2000,

- Download File: pdcs00-pcg.pdf (pdf: 167 KB)

### B. Parlett, O. Marques, "An Implementation of the dqds Algorithm (positive case)", Linear Algebra and its Applications, 2000, 309:217-259,

### L. Oliker, X. Li, G. Heber, R. Biswas, "Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems", Seventh International Workshop on solving Irregularly Structured Problems in Parallel, 2000,

- Download File: irr00awk.pdf (pdf: 130 KB)

### 1996

### S. Chatterjee, J. Gilbert, L. Oliker, R. Schreiber, and T. Sheffler, "Algorithms for Automatic Alignment of Arrays", Journal of Parallel and Distributed Computing (JPDC), July 1996,

- Download File: jpdc96.ps.gz (gz: 89 KB)