# Xiaoye Sherry Li

Sherry Li is a Senior Scientist in the Computational Research Division, Lawrence Berkeley National Laboratory. She has worked on diverse problems in high performance scientific computations, including parallel computing, sparse matrix computations, high precision arithmetic, and combinatorial scientific computing. She has (co)authored over 90 publications, and contributed to several book chapters. She is the lead developer of SuperLU, a widely-used sparse direct solver, and has contributed to the development of several other mathematical libraries, including ARPREC, LAPACK, PDSLin, STRUMPACK, and XBLAS. She has collaborated with many domain scientists to deploy the advanced mathematical software in their application codes, including those from accelerator engineering, chemical science, earth science, plasma fusion energy science, and materials science. She earned Ph.D. in Computer Science from UC Berkeley in 1996. She has served on the editorial boards of the SIAM J. Scientific Comput. and ACM Trans. Math. Software, as well as many program committees of the scientific conferences. She is a SIAM Fellow and a Senior Member of ACM.

## Journal Articles

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

### Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov, "An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling", SIAM J. Sci. Comput. 38-5, pp. S358-S384, October 2016, doi: 10.1137/15M1010117

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno, "An efficient basis set representation for calculating electrons in molecules", Journal of Molecular Physics, 2016, doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", Springer International Journal of Parallel Programming, December 2015, 43:6:1218-1243, doi: 10.1007/s10766-014-0326-5

### François-Henry Rouet, Xiaoye S. Li, Pieter Ghysels, Artem Napov, "A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization", Submitted to ACM Transactions on Mathematical Software, December 2014,

### Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

### Shen Wang, Xiaoye S. Li, François-Henry Rouet, Jianlin Xia, Maarten V. de Hoop, "A parallel geometric multifrontal solver using hierarchically semiseparable structure", Submitted to ACM Transaction on Mathematical Software, 2013,

### Xiaoye S. Li, Meiyue Shao, "A supernodal approach to incomplete LU factorization with partial pivoting", ACM Transactions on Mathematical Software, 2011, 37:43:1--43:2, doi: 10.1145/1916461.1916467

### L. Oliker. X. Li, P. Husbands, R. Biswas, "Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations", SIAM Review Journal, 2002,

- Download File: sirev02-sparse.pdf (pdf: 475 KB)

## Conference Papers

### Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

### Osni Marques, Alex Druinsky, Xiaoye S. Li, Andrew T. Barker, Panayot Vassilevski, Delyan Kalchev, "Tuning the Coarse Space Construction in a Spectral AMG Solver", ICCS 2016 (The International Conference on Computational Science), San Diego, CA, Elsevier, June 2016,

### Alex Druinsky, Pieter Ghysels, Xiaoye S. Li, Osni Marques, Samuel Williams, Andrew Barker, Delyan Kalchev, Panayot Vassilevski, "Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures", International Conference on Parallel Processing and Applied Mathematics (PPAM), September 6, 2015, doi: 10.1007/978-3-319-32149-3_12

### Marc Baboulin, Xiaoye S. Li, Francois-Henry Rouet, "Using random butterfly transformations to avoid pivoting in sparse direct methods", High Performance Computing for Computational Science - VECPAR 2014, Lecture Notes in Computer Science, Springer. Preprint, 2015,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

- Download File: sbac2013personal.pdf (pdf: 195 KB)

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

### Emmanuel Agullo, Patrick R. Amestoy, Alfredo Buttari, Abdou Guermouche, Guillaume Joslin, Jean-Yves L'Excellent, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Mohamed Sid-Lakhdar, Shen Wang, Clément Weisbecker, Ichitaro Yamazaki., "Recent Advances in Sparse Direct Solvers", 22nd Conference on Structural Mechanics in Reactor Technology, August 18, 2013,

- Download File: paper3.pdf (pdf: 243 KB)

### Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

- Download File: SRPS-2012-ABSTRACT-CHOUROU-rev.pdf (pdf: 764 KB)

### Xiaoye S. Li, Meiyue Shao, Ichitaro Yamazaki, Esmond G. Ng, "Factorization-based sparse solvers and preconditioners", (SciDAC 2009) Journal of Physics: Conference Series 180(2009) 012015, 2009, doi: 10.1088/1742-6596/180/1/012015

### B. Gaeke, P. Husbands, X. Li, L. Oliker, K. Yelick, and R. Biswas, "Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines", International Parallel & Distributed Processing Symposium (IPDPS), 2002,

- Download File: ipdps02-iram.pdf (pdf: 91 KB)

### L. Oliker, X. Li, P. Husbands, R. Biswas, "Ordering Schemes for Sparse Matrices using Modern Programming Paradigms", The IASTED International Conference on Applied Informatics (AI), 2001,

- Download File: ai01.pdf (pdf: 163 KB)

### L. Oliker, X. Li. G. Heber, R. Biswas, "Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms", 13th Interational Conference on Parallel and Distributed Computing Systems, 2000,

- Download File: pdcs00-pcg.pdf (pdf: 167 KB)

### L. Oliker, X. Li, G. Heber, R. Biswas, "Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems", Seventh International Workshop on solving Irregularly Structured Problems in Parallel, 2000,

- Download File: irr00awk.pdf (pdf: 130 KB)

## Presentation/Talks

### Pieter Ghysels, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Jianlin Xia, Hierarchically Low-Rank Structured Sparse Factorization with Reduced Communication and Synchronization, Householder Symposium XIX, June 2014,

### Xiaoye S. Li, Artem Napov, Francois-Henry Rouet, Designing multifrontal solvers using hierarchically semiseparable structures, SIAM Conference on Parallel Processing for Scientific Computing (PP12), Portland, OR, USA, February 2014,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

- Download File: sarje-thmmcs-pmbs.pdf (pdf: 2 MB)

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

### Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

### L. Oliker, R. Biswas, P. Husbands, X. Li, Ordering Sparse Matrices for Cache-Based Systems, SIAM Conference on Parallel Processing, 2001,

- Download File: siampp01abstactb.pdf (pdf: 2.1 MB)

## Reports

### Alfredo Buttari, Serge Gratton, Xiaoye S. Li, Marième Ngom, François-Henry Rouet, David Titley-Peloquin, Clément Weisbecker, "Error Analysis of the Block Low-Rank LU factorization of dense matrices", IRIT-CERFACS, RT-APO-13-7, August 2013,

### Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

### Ichitaro Yamazaki, Xiaoye Sherry Li, François-Henry Rouet, Bora Uçar, "Partitioning, Ordering and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver", Institut National Polytechnique de Toulouse, RT-APO-12-2, November 2011,

- Download File: reportPDSLin.pdf (pdf: 634 KB)

## Posters

### Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism", Supercomputing (SC'15), November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.

We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.

We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

### Abhinav Sarje, Xiaoye Li, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Reconstructing Nanostructures from X-Ray Scattering Data", OLCF User Meeting, June 2015,

### Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer, "Recovering Nanostructures from X-Ray Scattering Data", Nvidia GPU Technology Conference (GTC), March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.