# Xiaoye Sherry Li

Sherry Li is a Staff Scientist in the Computational Research Division of Lawrence Berkeley National Laboratory. She has worked on diverse problems in high performance scientific computations, including parallel computing, sparse matrix computations, high precision arithmetic, and combinatorial scientific computing. She has (co)authored over 75 publications in referred journals or conference proceedings, and contributed to several book chapters. She has contributed to the design and implementation of the following high quality, open source software packages: SuperLU, PDSLin, XBLAS, ARPREC, QD, and LAPACK. She has collaborated with many domain scientists to deploy the advanced mathematical software in their simulation codes, including those from accelerator structure modeling, plasma fusion energy study, and materials sciences. She earned her Ph.D. in Computer Science from UC Berkeley in 1996. She has served on the editorial boards of the *SIAM J. Scientific Comput.* and *ACM Trans. Math. Software*, as well as many program committees of the scientific conferences. She is a Senior Member of ACM.

## Journal Articles

### Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

### Shen Wang, Xiaoye S. Li, François-Henry Rouet, Jianlin Xia, Maarten V. de Hoop, "A parallel geometric multifrontal solver using hierarchically semiseparable structure", Submitted to TOMS, 2013,

### L. Oliker. X. Li, P. Husbands, R. Biswas, "Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations", SIAM Review Journal, 2002,

- Download File: sirev02-sparse.pdf (pdf: 475 KB)

### L. Oliker, X. Li, G. Heber, R. Biswas, "Ordering Unstructured Meshes for Sparse Matrix Computations on Leading Parallel Systems", Seventh International Workshop on solving Irregularly Structured Problems in Parallel, 2000,

- Download File: irr00awk.pdf (pdf: 130 KB)

## Conference Papers

### Marc Baboulin, Xiaoye S. Li, Francois-Henry Rouet, "Using random butterfly transformations to avoid pivoting in sparse direct methods", Accepted to Vecpar 2014, 2014,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems at Supercomputing (SC13), Denver, CO, 2013,

### George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf, "Extending Summation Precision for Network Reduction Operations", 25th International Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society, October 2013,

- Download File: sbac2013personal.pdf (pdf: 195 KB)

Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

### Emmanuel Agullo, Patrick R. Amestoy, Alfredo Buttari, Abdou Guermouche, Guillaume Joslin, Jean-Yves L'Excellent, Xiaoye S. Li, Artem Napov, François-Henry Rouet, Mohamed Sid-Lakhdar, Shen Wang, Clément Weisbecker, Ichitaro Yamazaki., "Recent Advances in Sparse Direct Solvers", 22nd Conference on Structural Mechanics in Reactor Technology, August 18, 2013,

- Download File: paper3.pdf (pdf: 243 KB)

### Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

- Download File: SRPS-2012-ABSTRACT-CHOUROU-rev.pdf (pdf: 764 KB)

### B. Gaeke, P. Husbands, X. Li, L. Oliker, K. Yelick, and R. Biswas, "Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines", International Parallel & Distributed Processing Symposium (IPDPS), 2002,

- Download File: ipdps02-iram.pdf (pdf: 91 KB)

### L. Oliker, R. Biswas, P. Husbands, X. Li, "Ordering Sparse Matrices for Cache-Based Systems", SIAM Conference on Parallel Processing, 2001,

- Download File: siampp01abstactb.pdf (pdf: 2.1 MB)

### L. Oliker, X. Li, P. Husbands, R. Biswas, "Ordering Schemes for Sparse Matrices using Modern Programming Paradigms", The IASTED International Conference on Applied Informatics (AI), 2001,

- Download File: ai01.pdf (pdf: 163 KB)

### L. Oliker, X. Li. G. Heber, R. Biswas, "Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms", 13th Interational Conference on Parallel and Distributed Computing Systems, 2000,

- Download File: pdcs00-pcg.pdf (pdf: 167 KB)

## Presentation/Talks

### Xiaoye S. Li, Artem Napov, Francois-Henry Rouet, Designing multifrontal solvers using hierarchically semiseparable structures, SIAM Conference on Parallel Processing for Scientific Computing (PP12), Portland, OR, USA, February 2014,

### Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

### Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

### S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

## Reports

### Alfredo Buttari, Serge Gratton, Xiaoye S. Li, Marième Ngom, François-Henry Rouet, David Titley-Peloquin, Clément Weisbecker, "Error Analysis of the Block Low-Rank LU factorization of dense matrices", IRIT-CERFACS, RT-APO-13-7, August 2013,

### Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

### Ichitaro Yamazaki, Xiaoye Sherry Li, François-Henry Rouet, Bora Uçar, "Partitioning, Ordering and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver", Institut National Polytechnique de Toulouse, RT-APO-12-2, November 2011,

- Download File: reportPDSLin.pdf (pdf: 634 KB)