A - Z Index | Phone Book | Careers

Abhinav Sarje

ASarje
Abhinav Sarje
Research Scientist
Computational Research Division
Phone: 1 510 495 2793
Future Technologies Group
One Cyclotron Rd
MS 50A1148
Berkeley, CA 94720 US

Abhinav Sarje is a research scientist in the Future Technologies Group at Berkeley Lab. He was previously a postdoctoral researcher in the same group. He completed his doctoral studies in computer engineering at Iowa State University.

His research interests are in parallel algorithms and computing, high-performance scientific computing, algorithms and applications on emerging parallel architectures including multi/many core CPUs and GPUs, performance optimization, string and graph algorithms, machine learning, and computational biology.

Some Project Webpages

HipGISAXS

AllPairs

Journal Articles

Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.

Abhinav Sarje, Srinivas Aluru, "All-pairs computations on many-core graphics processors", Parallel Computing, 2013, 39-2:79-93, doi: 10.1016/j.parco.2013.01.002

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.

Developing high-performance applications on emerging multi- and many-core
architectures requires efficient mapping techniques and architecture-specific
tuning methodologies to realize performance closer to their peak compute
capability and memory bandwidth. In this paper, we develop architecture-aware
methods to accelerate all-pairs computations on many-core graphics processors.
Pairwise computations occur frequently in numerous application areas in
scientific computing. While they appear easy to parallelize due to the
independence of computing each pairwise interaction from all others, development
of techniques to address multi-layered memory hierarchies, mapping within the
restrictions imposed by the small and low-latency on-chip memories, striking the
right balanced between concurrency, reuse and memory traffic etc., are crucial
to obtain high-performance. We present a hierarchical decomposition scheme for
GPUs based on decomposition of the output matrix and input data. We demonstrate
that a careful tuning of the involved set of decomposition parameters is
essential to achieve high efficiency on the GPUs. We also compare the
performance of our strategies with an implementation on the STI Cell processor
as well as multi-core CPU parallelizations using OpenMP and Intel Threading
Building Blocks.Developing high-performance applications on emerging multi- and many-core
architectures requires efficient mapping techniques and architecture-specific
tuning methodologies to realize performance closer to their peak compute
capability and memory bandwidth. In this paper, we develop architecture-aware
methods to accelerate all-pairs computations on many-core graphics processors.
Pairwise computations occur frequently in numerous application areas in
scientific computing. While they appear easy to parallelize due to the
independence of computing each pairwise interaction from all others, development
of techniques to address multi-layered memory hierarchies, mapping within the
restrictions imposed by the small and low-latency on-chip memories, striking the
right balanced between concurrency, reuse and memory traffic etc., are crucial
to obtain high-performance. We present a hierarchical decomposition scheme for
GPUs based on decomposition of the output matrix and input data. We demonstrate
that a careful tuning of the involved set of decomposition parameters is
essential to achieve high efficiency on the GPUs. We also compare the
performance of our strategies with an implementation on the STI Cell processor
as well as multi-core CPU parallelizations using OpenMP and Intel Threading
Building Blocks.

Abhinav Sarje, Srinivas Aluru, "Accelerating Pairwise Computations on Cell Processors", Transactions on Parallel and Distributed Systems (TPDS), January 2011, 22-1:69-77, doi: 10.1109/TPDS.2010.65

Direct computation of all pairwise distances or interactions is a fundamental problem that arises in many application areas including particle or atomistic simulations, fluid dynamics, computational electromagnetics, materials science, genomics and systems biology, and clustering and data mining. In this paper, we present methods for performing such pairwise computations efficiently in parallel on Cell processors. This problem is particularly challenging on the Cell processor due to the small sized Local Stores of the Synergistic Processing Elements, the main computational cores of the processor. We present techniques for different variants of this problem including those with large number of entities or when the dimensionality of the information per entity is large. We demonstrate our methods in the context of multiple applications drawn from fluid dynamics, materials science and systems biology, and present detailed experimental results. Our software library is an open source and can be readily used by application scientists to accelerate pairwise computations using Cell accelerators.

Jaroslaw Zola, Maneesha Aluru, Abhinav Sarje, Srinivas Aluru, "Parallel Information-Theory-Based Construction of Genome-Wide Gene Regulatory Networks", Transactions on Parallel and Distributed Systems, December 2010, 21-12:1721-1733, doi: http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.59

Abhinav Sarje, Srinivas Aluru, "Parallel Genomic Alignments on the Cell Broadband Engine", Transactions on Parallel and Distributed Systems (TPDS), November 2009, 20-11:1600-1610, doi: http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.254

Conference Papers

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "High-Performance Inverse Modeling with Reverse Monte Carlo Simulations", International Conference on Parallel Processing, Minneapolis, MN, September 2014,

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems at Supercomputing (SC13), Denver, CO, 2013,

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.

S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,

Abhinav Sarje. Srinivas Aluru, "A MapReduce Style Framework for Computations on Trees", International Conference on Parallel Processing (ICPP), September 2010, doi: http://doi.ieeecomputersociety.org/10.1109/ICPP.2010.42

The emergence of cloud computing and Google's MapReduce paradigm is renewing interest in the development of broadly applicable high level abstractions as a means to deliver easy programmability and cyber resources to the user, while hiding complexities of system architecture, parallelism and algorithms, heterogeneity, and fault-tolerance. In this paper, we present a high-level framework for computations on tree structures. Despite the diversity and types of tree structures, and the algorithmic ways in which they are utilized, our abstraction provides sufficient generality to be broadly applicable. We show how certain frequently used operations on tree structures can be cast in terms of our framework. We further demonstrate the applicability of our framework by solving two applications -- k-nearest neighbors and fast multipole method (FMM) based simulations -- by merely using our framework in multiple ways. We developed a generic programming based implementation of the framework using C++ and MPI, and demonstrate its performance on the aforementioned applications using homogeneous multi-core clusters.

Book Chapters

Abhinav Sarje, Jaroslaw Zola, Srinivas Aluru, "Pairwise Computations on the Cell Processor with Applications in Computational Biology", Scientific Computing with Multicore and Accelerators, edited by J. Dongarra, D. A. Bader, J. Kurzak, (Chapman & Hall/CRC: December 2010) Pages: 297–327 doi: 10.1201/b10376-22

Abhinav Sarje, Srinivas Aluru, "Parallel Algorithms for Alignments on the Cell B.E.", Bioinformatics: High Performance Parallel Computer Architectures, edited by Bertil Schmidt, (Taylor & Francis Group/CRC: July 2010) Pages: 59–83 doi: 10.1201/EBK1439814888-c4

Presentation/Talks

Abhinav Sarje, Xiaoye Li, Alexander Hexemer, High-Performance Inverse Modeling with Reverse Monte Carlo Simulations, International Conference on Parallel Processing (ICPP), September 2014,

Abhinav Sarje, Towards Real-Time Nanostructure Prediction with GPUs, GPU Technology Conference, March 2014,

In the field of nanoparticle materials science, synchrotron light-sources play a
crucial role where X-ray scattering techniques are used for nanostructure prediction
through characterization of macromolecules and nanoparticle systems based on their
structural properties. Applications of these are widespread, including artificial
photosynthesis, solar cell membranes, photovoltaics and energy storage devices, smart
windows, high-density data storage media and drug discovery. Current state-of-the-art
high-throughput beamlines at light sources worldwide are capable of generating
terabytes of raw scattering data per week, and is continually growing. This has
created a big gap between data generation and data analysis. Consequently, the
beamline scientists and users have been faced with an extremely inefficient
utilization of the light sources, and they are expressing a growing need for real-time
data analysis tools to bridge this gap.

X-ray scattering comes in many flavors such as the widely used small angle X-ray
scattering (SAXS) and grazing incidence SAXS (GISAXS) which will be the case studies
in this session. Efforts are underway at Berkeley Lab to bring scattering data
analysis up to the speed of data generation through high-performance and parallel
computing. Such analysis is generally composed of two steps: 1) Forward Simulation,
and 2) Structural Fitting, which uses forward simulation as a building block. Forward
simulation of X-ray scattering experiments is an embarrassingly parallel computational
problem, making it an ideal candidate for implementation on many-core architectures
such as graphics processors and massively parallel computing. An example of such a
simulation code developed under these efforts is HipGISAXS, which is a
high-performance and massively parallel code capable of harnessing the computational
power offered by clusters of GPUs. HipGISAXS is a step towards real-time scattering
data analysis as it has already brought simulation times down to the order of
milliseconds and seconds from hours and days through the power of GPUs. The second
component, structural fitting, can be described as an inverse modeling and
computational optimization problem, involving a large number of variable parameters,
making it highly compute-intensive. An example of inverse modeling code, also
developed at Berkeley Lab, is HipRMC. This is a Reverse Monte Carlo (RMC), a popular
method to extract information from SAXS data, based implementation which utilizes GPU
computing to provide fast results.

Although GPUs are able to deliver high computational power through naive
implementations, they require intensive architecture-aware code tuning in order to
attain performance rates closer to their theoretical peaks. Such optimizations involve
mapping of computations and data transfers perfectly on to the architecture. HipGISAXS
and HipRMC include optimizations which enable them to perform significantly better
than other processor architectures.

Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems at Supercomputing (SC'13), November 18, 2013,

Abhinav Sarje, Synchrotron Light-source Data Analysis through Massively-parallel GPU Computing, NVIDIA GPU Technology Conference (GTC), March 2013,

Light scattering techniques are widely used for the characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their properties, such as their structures and sizes, at the micro/nano-scales. One of the major applications of these is in the characterization of materials for the design and fabrication of energy-relevant nanodevices, such as photovoltaic and energy storage devices. Although current high-throughput synchrotron light-sources can generate tremendous amounts of raw data at a high rate, analysis of this data for the characterization processes remains the primary bottleneck, demanding large amounts of computational resources. In this work, we are developing high-performance parallel algorithms and codes on large-scale GPU clusters to address this bottleneck. Here, we will discuss our efforts and experiences in developing "architecture-aware" hybrid multi-GPU multi-CPU codes for two such most important analysis steps. First is simulation of X-ray scattering patterns for any given sample morphology, and second is structural fitting of such scattering patterns. Both steps involve a large number of variable parameters, and hence, require high computational power.
Our X-ray scattering pattern simulation code is based on the Distorted Wave Born Approximation (DWBA) theory, and involves a large number of compute-intensive form-factor calculations. A form-factor is computed as an integral over the shape functions of the nanoparticles in a sample. A simulated sample structure is taken as an input in the form of discretized shape-surfaces, such as a triangulated surface. Resolutions of such discretization, as well as of a spatial 3-D grid involved, also contribute toward the compute-intensity of the simulations. Our code uses hybrid GPU and multicore CPU acceleration for generation of high-resolution scattering patterns, for the given input, using various possible values of the input parameters. These parameters include a number of sample definition, experimental setup and computational parameters. The patterns obtained through the X-ray scattering simulations carry vital information about the structural properties of the materials in the sample. In order to extract meaningful structural information from the scattering patterns, structural fitting, as an inverse modeling problem, is used. Our codes implement a fast and scalable solution to this process through a Reverse Monte Carlo (RMC) simulation algorithm. This process computes structure-factors in each simulation step until a result fits the input image pattern within allowed error range. These computations require a large number of fast Fourier transform (FFTs) calculations which are also accelerated on hybrid GPU and CPU systems in our codes for high-performance.
Our codes are designed as GPU-architecture-aware implementations, and deliver high-performance through dynamic selection of the best-performing computational parameters, such as the computation decomposition parameters, block sizes, for the system being used. We perform a detailed study of the effects of these parameters on the code performance, and use this information to guide the parameter value selection process. We also carry out performance analysis of the optimal codes and study its scalings, including scaling to large GPU clusters. Our codes obtain near linear scaling with respect to the cluster size in our experiments, and we believe that these are "future-ready".

Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Alexander Hexemer, Massively Parallel X-Ray Scattering Simulations, Supercomputing (SC), November 2012,

S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,

Tutorial session

Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,

Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.

S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,

We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.

Reports

Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", LBNL Technical Report, February 2013, LBNL 6076E,

Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.

Abhinav Sarje, Srinivas Aluru, "A Mapreduce Style Framework for Trees", Iowa State University, June 2009,

Thesis/Dissertations

Applications on emerging paradigms in parallel computing, Abhinav Sarje, PhD, Iowa State University, July 2010,

The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi- threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi- core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google’s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations.

Posters

Abhinav Sarje, Xiaoye Li, Slim Chourou, Alexander Hexemer, "Petascale X-Ray Scattering Simulations With GPUs", GPU Technology Conference, March 2014,

Abhinav Sarje, Xiaoye Li, Alexander Hexemer, "Inverse Modeling of X-Ray Scattering Data With Reverse Monte Carlo Simulations", GPU Technology Conference, March 2014,

Abhinav Sarje, Jack Pien, Xiaoye Li, "GPU Clusters for Large-Scale Analysis of X-ray Scattering Data", GPU Technology Conference (GTC), January 2012,

Today's emerging architectures have higher levels of parallelism incorporated within a processor. They require efficient strategies to extract the performance they have to offer. In our work, we develop architecture-aware parallel strategies to perform various kinds of pairwise computations - pairwise genomic alignments, and scheduling large number of general pairwise computations with applications to computational systems biology and materials science. We present our schemes in the context of the IBM Cell BE, an example of a heterogeneous multicore, but are nevertheless applicable to any similar architecture, as well as general multicores with our strategies being cache-aware.