Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Performance and Algorithms Research

Abhinav Sarje

ASarje
Abhinav Sarje
Research Scientist
Computer Science Department
Phone: 1 510 495 2793
Fax: 1 510 486 6459
Office: 59-4029M
One Cyclotron Rd
M/S 59R4104
Berkeley, CA 94720 us

Abhinav Sarje is a research scientist in the Performance and Algorithms Research group of the Computer Science department at the Berkeley Lab. He completed his doctoral studies in computer engineering at Iowa State University.

His primary research interests are in parallel algorithms, high-performance computing, machine learning, performance engineering, string and graph algorithms, and computational biology.

Also visit www.abhinavsarje.net for some outdated information.

Some Project Webpages

HipGISAXS

HipRMC

SciDAC/SUPER

Multiscale

Journal Articles

S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer, "A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering", Journal of Applied Crystallography, December 2016, 49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

Thorsten Kurth, Andrew Pochinsky, Abhinav Sarje, Sergey Syritsyn, Andre Walker-Loud, "High-Performance I/O: HDF5 for Lattice QCD", arXiv:1501.06992, January 2015,

Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design and implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.

Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

Abhinav Sarje, Srinivas Aluru, "All-pairs computations on many-core graphics processors", Parallel Computing, 2013, 39-2:79-93, doi: 10.1016/j.parco.2013.01.002

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.

Developing high-performance applications on emerging multi- and many-core
architectures requires efficient mapping techniques and architecture-specific
tuning methodologies to realize performance closer to their peak compute
capability and memory bandwidth. In this paper, we develop architecture-aware
methods to accelerate all-pairs computations on many-core graphics processors.
Pairwise computations occur frequently in numerous application areas in
scientific computing. While they appear easy to parallelize due to the
independence of computing each pairwise interaction from all others, development
of techniques to address multi-layered memory hierarchies, mapping within the
restrictions imposed by the small and low-latency on-chip memories, striking the
right balanced between concurrency, reuse and memory traffic etc., are crucial
to obtain high-performance. We present a hierarchical decomposition scheme for
GPUs based on decomposition of the output matrix and input data. We demonstrate
that a careful tuning of the involved set of decomposition parameters is
essential to achieve high efficiency on the GPUs. We also compare the
performance of our strategies with an implementation on the STI Cell processor
as well as multi-core CPU parallelizations using OpenMP and Intel Threading
Building Blocks.Developing high-performance applications on emerging multi- and many-core
architectures requires efficient mapping techniques and architecture-specific
tuning methodologies to realize performance closer to their peak compute
capability and memory bandwidth. In this paper, we develop architecture-aware
methods to accelerate all-pairs computations on many-core graphics processors.
Pairwise computations occur frequently in numerous application areas in
scientific computing. While they appear easy to parallelize due to the
independence of computing each pairwise interaction from all others, development
of techniques to address multi-layered memory hierarchies, mapping within the
restrictions imposed by the small and low-latency on-chip memories, striking the
right balanced between concurrency, reuse and memory traffic etc., are crucial
to obtain high-performance. We present a hierarchical decomposition scheme for
GPUs based on decomposition of the output matrix and input data. We demonstrate
that a careful tuning of the involved set of decomposition parameters is
essential to achieve high efficiency on the GPUs. We also compare the
performance of our strategies with an implementation on the STI Cell processor
as well as multi-core CPU parallelizations using OpenMP and Intel Threading
Building Blocks.

Abhinav Sarje, Srinivas Aluru, "Accelerating Pairwise Computations on Cell Processors", Transactions on Parallel and Distributed Systems (TPDS), January 2011, 22-1:69-77, doi: 10.1109/TPDS.2010.65

Direct computation of all pairwise distances or interactions is a fundamental problem that arises in many application areas including particle or atomistic simulations, fluid dynamics, computational electromagnetics, materials science, genomics and systems biology, and clustering and data mining. In this paper, we present methods for performing such pairwise computations efficiently in parallel on Cell processors. This problem is particularly challenging on the Cell processor due to the small sized Local Stores of the Synergistic Processing Elements, the main computational cores of the processor. We present techniques for different variants of this problem including those with large number of entities or when the dimensionality of the information per entity is large. We demonstrate our methods in the context of multiple applications drawn from fluid dynamics, materials science and systems biology, and present detailed experimental results. Our software library is an open source and can be readily used by application scientists to accelerate pairwise computations using Cell accelerators.

Jaroslaw Zola, Maneesha Aluru, Abhinav Sarje, Srinivas Aluru, "Parallel Information-Theory-Based Construction of Genome-Wide Gene Regulatory Networks", Transactions on Parallel and Distributed Systems, December 2010, 21-12:1721-1733, doi: http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.59

Abhinav Sarje, Srinivas Aluru, "Parallel Genomic Alignments on the Cell Broadband Engine", Transactions on Parallel and Distributed Systems (TPDS), November 2009, 20-11:1600-1610, doi: http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.254

Conference Papers

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, Brian Friesen, Yun (Helen) He, Thorsten Kurth, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Abhinav Sarje, Jean-Luc Vay, Henri Vincenti, Samuel Williams, Pierre Carrier, Nathan Wichmann, Marcus Wagner, Paul Kent, Christopher Kerr, John Dennis, "Evaluating and Optimizing the NERSC Workload on Knights Landing", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2016,

Abhinav Sarje, Xiaoye S Li, Nicholas Wright, "Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis", International Workshop on Multicore Software Engineering at EuroPar, 2016,

Abhinav Sarje, Douglas W. Jacobsen, Samuel W. Williams, Todd Ringler, Leonid Oliker, "Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers", Cray User Group (CUG), London, UK, May 2016,

Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker, "Parallel Performance Optimizations on Unstructured Mesh-Based Simulations", Procedia Computer Science, 1877-0509, June 2015, 51:2016-2025, doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, Denver, CO, Springer International Publishing, 2014, 8551:217-238, doi: 10.1007/978-3-319-10214-6_11

With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code [9], on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", 43rd International Conference on Parallel Processing, Minneapolis, MN, IEEE, September 2014, 201-210, doi: 10.1109/ICPP.2014.29

In the field of nanoparticle material science, X-ray scattering techniques are widely used for characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their structural properties at the micro- and nano-scales. Numerous applications utilize these, including design and fabrication of energy-relevant nanodevices such as photovoltaic and energy storage devices. Due to its size, analysis of raw data obtained through present ultra-fast light beamlines and X-ray scattering detectors has been a primary bottleneck in such characterization processes. To address this hurdle, we are developing high-performance parallel algorithms and codes for analysis of X-ray scattering data for several of the scattering methods, such as the Small Angle X-ray Scattering (SAXS), which we talk about in this paper. As an inverse modeling problem, structural fitting of the raw data obtained through SAXS experiments is a method used for extracting meaningful information on the structural properties of materials. Such fitting processes involve a large number of variable parameters and, hence, require a large amount of computational power. In this paper, we focus on this problem and present a high-performance and scalable parallel solution based on the Reverse Monte Carlo simulation algorithm, on highly-parallel systems such as clusters of multicore CPUs and graphics processors. We have implemented and optimized our algorithm on generic multi-core CPUs as well as the Nvidia GPU architectures with C++ and CUDA. We also present detailed performance results and computational analysis of our code.

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

Abhinav Sarje. Srinivas Aluru, "A MapReduce Style Framework for Computations on Trees", International Conference on Parallel Processing (ICPP), September 2010, doi: http://doi.ieeecomputersociety.org/10.1109/ICPP.2010.42

The emergence of cloud computing and Google's MapReduce paradigm is renewing interest in the development of broadly applicable high level abstractions as a means to deliver easy programmability and cyber resources to the user, while hiding complexities of system architecture, parallelism and algorithms, heterogeneity, and fault-tolerance. In this paper, we present a high-level framework for computations on tree structures. Despite the diversity and types of tree structures, and the algorithmic ways in which they are utilized, our abstraction provides sufficient generality to be broadly applicable. We show how certain frequently used operations on tree structures can be cast in terms of our framework. We further demonstrate the applicability of our framework by solving two applications -- k-nearest neighbors and fast multipole method (FMM) based simulations -- by merely using our framework in multiple ways. We developed a generic programming based implementation of the framework using C++ and MPI, and demonstrate its performance on the aforementioned applications using homogeneous multi-core clusters.

Book Chapters

Abhinav Sarje, Jaroslaw Zola, Srinivas Aluru, "Pairwise Computations on the Cell Processor with Applications in Computational Biology", Scientific Computing with Multicore and Accelerators, edited by J. Dongarra, D. A. Bader, J. Kurzak, (Chapman & Hall/CRC: December 2010) Pages: 297–327 doi: 10.1201/b10376-22

Abhinav Sarje, Srinivas Aluru, "Parallel Algorithms for Alignments on the Cell B.E.", Bioinformatics: High Performance Parallel Computer Architectures, edited by Bertil Schmidt, (Taylor & Francis Group/CRC: July 2010) Pages: 59–83 doi: 10.1201/EBK1439814888-c4

Presentation/Talks

Abhinav Sarje, Achieving High Parallel Efficiency on Modern Processors for X-ray Scattering Data Analysis, EuroPar 2016, August 22, 2016,

Abhinav Sarje, Exploiting Thread Parallelism for Ocean Modeling on Cray XC Supercomputers, Cray Users Group (CUG), May 12, 2016,

Abhinav Sarje, Particle Swarm Optimization, DUNE Wire-Cell Reconstruction Summit, December 2015,

Abhinav Sarje, High-Performance X-Ray Scattering Data Analysis, Intel Xeon Phi Users Group (IXPUG) Annual Meeting 2015, September 30, 2015,

Abhinav Sarje, Computing Nanostructures at Scale, OLCF User Meeting, June 2015,

The inverse modeling, or structural fitting, problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. X-ray scattering based extraction of structural information from material samples is an important tool for nanostructure prediction through characterization of macromolecules and nanoparticle systems, applicable to numerous applications such as design of energy-relevant nano-devices. At Berkeley Lab, we are developing high-performance solutions for analysis of such raw data. In our work we exploit the use of massive parallelism available in clusters of GPUs, such as the Titan supercomputer, to gain efficiency in the reconstruction process. We explore the application of various numerical optimization algorithms ranging from simple gradient-based quasi-Newton methods, derivativefree trust-region-based methods, to the stochastic algorithms of Particle Swarm Optimization in a massively parallel fashion. https://vimeo.com/133558018

Abhinav Sarje, Parallel Performance Optimizations on Unstructured Mesh-Based Simulations, International Conference on Computational Science, June 2015,

Abhinav Sarje, Performance Profiling of Parallel Codes, April 2015,

Abhinav Sarje, Recovering Structural Information about Nanoparticle Systems, Nvidia GPU Technology Conference, March 19, 2015,

The inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. This session will give an introduction and overview to this problem and its solutions as being developed at the Berkeley Lab. X-ray scattering based extraction of structural information from material samples is an important tool applicable to numerous applications such as design of energy-relevant nano-devices. We exploit the use of parallelism available in clusters of GPUs to gain efficiency in the reconstruction process. To develop a solution, we apply Particle Swarm Optimization (PSO) in a massively parallel fashion, and develop high-performance codes and analyze the performance.

Abhinav Sarje, Xiaoye Li, Alexander Hexemer, High-Performance Inverse Modeling with Reverse Monte Carlo Simulations, International Conference on Parallel Processing (ICPP), September 2014,

Abhinav Sarje, Towards Real-Time Nanostructure Prediction with GPUs, GPU Technology Conference, March 2014,

In the field of nanoparticle materials science, synchrotron light-sources play a
crucial role where X-ray scattering techniques are used for nanostructure prediction
through characterization of macromolecules and nanoparticle systems based on their
structural properties. Applications of these are widespread, including artificial
photosynthesis, solar cell membranes, photovoltaics and energy storage devices, smart
windows, high-density data storage media and drug discovery. Current state-of-the-art
high-throughput beamlines at light sources worldwide are capable of generating
terabytes of raw scattering data per week, and is continually growing. This has
created a big gap between data generation and data analysis. Consequently, the
beamline scientists and users have been faced with an extremely inefficient
utilization of the light sources, and they are expressing a growing need for real-time
data analysis tools to bridge this gap.

X-ray scattering comes in many flavors such as the widely used small angle X-ray
scattering (SAXS) and grazing incidence SAXS (GISAXS) which will be the case studies
in this session. Efforts are underway at Berkeley Lab to bring scattering data
analysis up to the speed of data generation through high-performance and parallel
computing. Such analysis is generally composed of two steps: 1) Forward Simulation,
and 2) Structural Fitting, which uses forward simulation as a building block. Forward
simulation of X-ray scattering experiments is an embarrassingly parallel computational
problem, making it an ideal candidate for implementation on many-core architectures
such as graphics processors and massively parallel computing. An example of such a
simulation code developed under these efforts is HipGISAXS, which is a
high-performance and massively parallel code capable of harnessing the computational
power offered by clusters of GPUs. HipGISAXS is a step towards real-time scattering
data analysis as it has already brought simulation times down to the order of
milliseconds and seconds from hours and days through the power of GPUs. The second
component, structural fitting, can be described as an inverse modeling and
computational optimization problem, involving a large number of variable parameters,
making it highly compute-intensive. An example of inverse modeling code, also
developed at Berkeley Lab, is HipRMC. This is a Reverse Monte Carlo (RMC), a popular
method to extract information from SAXS data, based implementation which utilizes GPU
computing to provide fast results.

Although GPUs are able to deliver high computational power through naive
implementations, they require intensive architecture-aware code tuning in order to
attain performance rates closer to their theoretical peaks. Such optimizations involve
mapping of computations and data transfers perfectly on to the architecture. HipGISAXS
and HipRMC include optimizations which enable them to perform significantly better
than other processor architectures.

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

Abhinav Sarje, Synchrotron Light-source Data Analysis through Massively-parallel GPU Computing, NVIDIA GPU Technology Conference (GTC), March 2013,

Light scattering techniques are widely used for the characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their properties, such as their structures and sizes, at the micro/nano-scales. One of the major applications of these is in the characterization of materials for the design and fabrication of energy-relevant nanodevices, such as photovoltaic and energy storage devices. Although current high-throughput synchrotron light-sources can generate tremendous amounts of raw data at a high rate, analysis of this data for the characterization processes remains the primary bottleneck, demanding large amounts of computational resources. In this work, we are developing high-performance parallel algorithms and codes on large-scale GPU clusters to address this bottleneck. Here, we will discuss our efforts and experiences in developing "architecture-aware" hybrid multi-GPU multi-CPU codes for two such most important analysis steps. First is simulation of X-ray scattering patterns for any given sample morphology, and second is structural fitting of such scattering patterns. Both steps involve a large number of variable parameters, and hence, require high computational power.
Our X-ray scattering pattern simulation code is based on the Distorted Wave Born Approximation (DWBA) theory, and involves a large number of compute-intensive form-factor calculations. A form-factor is computed as an integral over the shape functions of the nanoparticles in a sample. A simulated sample structure is taken as an input in the form of discretized shape-surfaces, such as a triangulated surface. Resolutions of such discretization, as well as of a spatial 3-D grid involved, also contribute toward the compute-intensity of the simulations. Our code uses hybrid GPU and multicore CPU acceleration for generation of high-resolution scattering patterns, for the given input, using various possible values of the input parameters. These parameters include a number of sample definition, experimental setup and computational parameters. The patterns obtained through the X-ray scattering simulations carry vital information about the structural properties of the materials in the sample. In order to extract meaningful structural information from the scattering patterns, structural fitting, as an inverse modeling problem, is used. Our codes implement a fast and scalable solution to this process through a Reverse Monte Carlo (RMC) simulation algorithm. This process computes structure-factors in each simulation step until a result fits the input image pattern within allowed error range. These computations require a large number of fast Fourier transform (FFTs) calculations which are also accelerated on hybrid GPU and CPU systems in our codes for high-performance.
Our codes are designed as GPU-architecture-aware implementations, and deliver high-performance through dynamic selection of the best-performing computational parameters, such as the computation decomposition parameters, block sizes, for the system being used. We perform a detailed study of the effects of these parameters on the code performance, and use this information to guide the parameter value selection process. We also carry out performance analysis of the optimal codes and study its scalings, including scaling to large GPU clusters. Our codes obtain near linear scaling with respect to the cluster size in our experiments, and we believe that these are "future-ready".

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Alexander Hexemer, Massively Parallel X-Ray Scattering Simulations, Supercomputing (SC), November 2012,

S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

Abhinav Sarje, Next-Generation Scientific Computing with Graphics Processors, Beijing Computational Science Research Center, February 2012,

Reports

"Machine learning and understanding for intelligent extreme scale scientific computing and discovery", DOE ASCR Machine Learning Workshop Report, January 2015, doi: 10.2172/1471083

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

Abhinav Sarje, Srinivas Aluru, "A Mapreduce Style Framework for Trees", Iowa State University, June 2009,

Thesis/Dissertations

Applications on emerging paradigms in parallel computing, Abhinav Sarje, PhD, Iowa State University, July 2010,

The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi- threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi- core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google’s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations.

Posters

Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism", Supercomputing (SC'15), November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.
We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.
We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

Abhinav Sarje, Xiaoye Li, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer, "Reconstructing Nanostructures from X-Ray Scattering Data", OLCF User Meeting, June 2015,

Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer, "Recovering Nanostructures from X-Ray Scattering Data", Nvidia GPU Technology Conference (GTC), March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

Abhinav Sarje, Xiaoye Li, Slim Chourou, Alexander Hexemer, "Petascale X-Ray Scattering Simulations With GPUs", GPU Technology Conference, March 2014,

Abhinav Sarje, Xiaoye Li, Alexander Hexemer, "Inverse Modeling of X-Ray Scattering Data With Reverse Monte Carlo Simulations", GPU Technology Conference, March 2014,

Abhinav Sarje, Jack Pien, Xiaoye Li, "GPU Clusters for Large-Scale Analysis of X-ray Scattering Data", GPU Technology Conference (GTC), January 2012,

Today's emerging architectures have higher levels of parallelism incorporated within a processor. They require efficient strategies to extract the performance they have to offer. In our work, we develop architecture-aware parallel strategies to perform various kinds of pairwise computations - pairwise genomic alignments, and scheduling large number of general pairwise computations with applications to computational systems biology and materials science. We present our schemes in the context of the IBM Cell BE, an example of a heterogeneous multicore, but are nevertheless applicable to any similar architecture, as well as general multicores with our strategies being cache-aware.