Abhinav Sarje's research interests are in parallel algorithms and computing, high-performance scientific computing, algorithms and applications on emerging parallel architectures including multi/many core CPUs and GPUs, performance optimization, and string and graph algorithms.
He completed his doctoral studies in computer engineering at Iowa State University, under Dr. Srinvas Aluru. He is presently a postdoctoral researcher in the Future Technologies Group at Berkeley Lab since September 2011.
Some Project Webpages
Slim T. Chourou, Abhinav Sarje, Xiaoye Li, Elaine Chan and Alexander Hexemer, "HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data", Journal of Applied Crystallography, 2013, 46:1781-1795, doi: 10.1107/ S0021889813025843
We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code in the framework of the Distorted Wave Born Approximation (DWBA) that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies in a user-defined region of the reciprocal space for all possible grazing incidence angles and sample orientations. This flexibility then allows to easily tackle a wide range of possible sample structures such as nanoparticles on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests show good agreement with experimental data for a variety of commonly encountered nanostrutures.
Abhinav Sarje, Srinivas Aluru, "All-pairs computations on many-core graphics processors", Parallel Computing, 2013, 39-2:79-93, doi: 10.1016/j.parco.2013.01.002
Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.
Abhinav Sarje, Srinivas Aluru, "Accelerating Pairwise Computations on Cell Processors", Transactions on Parallel and Distributed Systems (TPDS), January 2011, 22-1:69-77, doi: 10.1109/TPDS.2010.65
Direct computation of all pairwise distances or interactions is a fundamental problem that arises in many application areas including particle or atomistic simulations, fluid dynamics, computational electromagnetics, materials science, genomics and systems biology, and clustering and data mining. In this paper, we present methods for performing such pairwise computations efficiently in parallel on Cell processors. This problem is particularly challenging on the Cell processor due to the small sized Local Stores of the Synergistic Processing Elements, the main computational cores of the processor. We present techniques for different variants of this problem including those with large number of entities or when the dimensionality of the information per entity is large. We demonstrate our methods in the context of multiple applications drawn from fluid dynamics, materials science and systems biology, and present detailed experimental results. Our software library is an open source and can be readily used by application scientists to accelerate pairwise computations using Cell accelerators.
Jaroslaw Zola, Maneesha Aluru, Abhinav Sarje, Srinivas Aluru, "Parallel Information-Theory-Based Construction of Genome-Wide Gene Regulatory Networks", Transactions on Parallel and Distributed Systems, December 2010, 21-12:1721-1733, doi: http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.59
Abhinav Sarje, Srinivas Aluru, "Parallel Genomic Alignments on the Cell Broadband Engine", Transactions on Parallel and Distributed Systems (TPDS), November 2009, 20-11:1600-1610, doi: http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.254
Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, "Tuning HipGISAXS on Multi and Many Core Supercomputers", Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems at Supercomputing (SC13), Denver, CO, 2013,
Abhinav Sarje, Xiaoye S. Li, Slim Chourou, Elaine R. Chan, Alexander Hexemer, "Massively Parallel X-ray Scattering Simulations", Supercomputing, November 2012,
Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with high-resolutions while reducing simulation times from months to minutes.
S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, "High-Performance GISAXS Code for Polymer Science", Synchrotron Radiation in Polymer Science, April 2012,
- Download File: SRPS-2012-ABSTRACT-CHOUROU-rev.pdf (pdf: 764 KB)
Abhinav Sarje. Srinivas Aluru, "A MapReduce Style Framework for Computations on Trees", International Conference on Parallel Processing (ICPP), September 2010, doi: http://doi.ieeecomputersociety.org/10.1109/ICPP.2010.42
The emergence of cloud computing and Google's MapReduce paradigm is renewing interest in the development of broadly applicable high level abstractions as a means to deliver easy programmability and cyber resources to the user, while hiding complexities of system architecture, parallelism and algorithms, heterogeneity, and fault-tolerance. In this paper, we present a high-level framework for computations on tree structures. Despite the diversity and types of tree structures, and the algorithmic ways in which they are utilized, our abstraction provides sufficient generality to be broadly applicable. We show how certain frequently used operations on tree structures can be cast in terms of our framework. We further demonstrate the applicability of our framework by solving two applications -- k-nearest neighbors and fast multipole method (FMM) based simulations -- by merely using our framework in multiple ways. We developed a generic programming based implementation of the framework using C++ and MPI, and demonstrate its performance on the aforementioned applications using homogeneous multi-core clusters.
Abhinav Sarje, Jaroslaw Zola, Srinivas Aluru, "Constructing Gene Regulatory Networks on Clusters of Cell Processors", ICPP, September 2009, doi: http://doi.ieeecomputersociety.org/10.1109/ICPP.2009.35
Abhinav Sarje, Srinivas Aluru, "Parallel biological sequence alignments on the Cell Broadband Engine", IPDPS, April 2008, doi: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2008.4536328
Abhinav Sarje, Jaroslaw Zola, Srinivas Aluru, "Pairwise Computations on the Cell Processor with Applications in Computational Biology", Scientific Computing with Multicore and Accelerators, edited by J. Dongarra, D. A. Bader, J. Kurzak, (Chapman & Hall/CRC: December 2010) Pages: 297–327 doi: 10.1201/b10376-22
Abhinav Sarje, Srinivas Aluru, "Parallel Algorithms for Alignments on the Cell B.E.", Bioinformatics: High Performance Parallel Computer Architectures, edited by Bertil Schmidt, (Taylor & Francis Group/CRC: July 2010) Pages: 59–83 doi: 10.1201/EBK1439814888-c4
Abhinav Sarje, Xiaoye S Li, Alexander Hexemer, Tuning HipGISAXS on Multi and Many Core Supercomputers, Performance Modeling, Benchmarking and Simulations of High Performance Computer Systems, November 18, 2013,
Abhinav Sarje, Synchrotron Light-source Data Analysis through Massively-parallel GPU Computing, NVIDIA GPU Technology Conference (GTC), March 2013,
Light scattering techniques are widely used for the characterization of macromolecules and particle systems (ordered, partially-ordered or custom) based on their properties, such as their structures and sizes, at the micro/nano-scales. One of the major applications of these is in the characterization of materials for the design and fabrication of energy-relevant nanodevices, such as photovoltaic and energy storage devices. Although current high-throughput synchrotron light-sources can generate tremendous amounts of raw data at a high rate, analysis of this data for the characterization processes remains the primary bottleneck, demanding large amounts of computational resources. In this work, we are developing high-performance parallel algorithms and codes on large-scale GPU clusters to address this bottleneck. Here, we will discuss our efforts and experiences in developing "architecture-aware" hybrid multi-GPU multi-CPU codes for two such most important analysis steps. First is simulation of X-ray scattering patterns for any given sample morphology, and second is structural fitting of such scattering patterns. Both steps involve a large number of variable parameters, and hence, require high computational power.
Our X-ray scattering pattern simulation code is based on the Distorted Wave Born Approximation (DWBA) theory, and involves a large number of compute-intensive form-factor calculations. A form-factor is computed as an integral over the shape functions of the nanoparticles in a sample. A simulated sample structure is taken as an input in the form of discretized shape-surfaces, such as a triangulated surface. Resolutions of such discretization, as well as of a spatial 3-D grid involved, also contribute toward the compute-intensity of the simulations. Our code uses hybrid GPU and multicore CPU acceleration for generation of high-resolution scattering patterns, for the given input, using various possible values of the input parameters. These parameters include a number of sample definition, experimental setup and computational parameters. The patterns obtained through the X-ray scattering simulations carry vital information about the structural properties of the materials in the sample. In order to extract meaningful structural information from the scattering patterns, structural fitting, as an inverse modeling problem, is used. Our codes implement a fast and scalable solution to this process through a Reverse Monte Carlo (RMC) simulation algorithm. This process computes structure-factors in each simulation step until a result fits the input image pattern within allowed error range. These computations require a large number of fast Fourier transform (FFTs) calculations which are also accelerated on hybrid GPU and CPU systems in our codes for high-performance.
Our codes are designed as GPU-architecture-aware implementations, and deliver high-performance through dynamic selection of the best-performing computational parameters, such as the computation decomposition parameters, block sizes, for the system being used. We perform a detailed study of the effects of these parameters on the code performance, and use this information to guide the parameter value selection process. We also carry out performance analysis of the optimal codes and study its scalings, including scaling to large GPU clusters. Our codes obtain near linear scaling with respect to the cluster size in our experiments, and we believe that these are "future-ready".
S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS School: The HipGISAXS Software, Advanced Light Source User Meeting, October 2012,
Eliot Gann , Slim Chourou , Abhinav Sarje , Harald Ade , Cheng Wang , Elaine Chan , Xiaodong Ding , Alexander Hexemer, An Interactive 3D Interface to Model Complex Surfaces and Simulate Grazing Incidence X-ray Scatter Patterns, American Physical Society March Meeting 2012, March 2012,
Grazing Incidence Scattering is becoming critical in characterization of the ensemble statistical properties of complex layered and nano structured thin films systems over length scales of centimeters. A major bottleneck in the widespread implementation of these techniques is the quantitative interpretation of the complicated grazing incidence scatter. To fill this gap, we present the development of a new interactive program to model complex nano-structured and layered systems for efficient grazing incidence scattering calculation.
S. Chourou, A. Sarje, X. Li, E. Chan, A. Hexemer, GISAXS simulation and analysis on GPU clusters., American Physical Society March Meeting 2012, February 2012,
We have implemented a flexible Grazing Incidence Small-Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory that effectively utilizes the parallel processing power provided by the GPUs. This constitutes a handy tool for experimentalists facing a massive flux of data, allowing them to accurately simulate the GISAXS process and analyze the produced data. The software computes the diffraction image for any given superposition of custom shapes or morphologies (e.g. obtained graphically via a discretization scheme) in a user-defined region of k-space (or region of the area detector) for all possible grazing incidence angles and in-plane sample rotations. This flexibility then allows to easily tackle a wide range of possible sample geometries such as nanostructures on top of or embedded in a substrate or a multilayered structure. In cases where the sample displays regions of significant refractive index contrast, an algorithm has been implemented to perform an optimal slicing of the sample along the vertical direction and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. Preliminary tests on a single GPU show a speedup of over 200 times compared to the sequential code.
Abhinav Sarje, Samuel Williams, David H. Bailey, "MPQC: Performance analysis and optimization", February 1, 2013, LBNL LBNL-6076E,
Abhinav Sarje, Jack Pien, Xiaoye S. Li, Elaine Chan, Slim Chourou, Alexander Hexemer, Arthur Scholz, Edward Kramer, "Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters", LBNL Tech Report, May 15, 2012, LBNL LBNL-5351E,
X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendous information on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novel high-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPC algorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on the Distorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any given sample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared to the sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to using one GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDA using one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks of runtime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization delivered an additional 9x speedup.
Abhinav Sarje, Srinivas Aluru, "A Mapreduce Style Framework for Trees", Iowa State University, June 2009,
Applications on emerging paradigms in parallel computing, Abhinav Sarje, PhD, Iowa State University, July 2010,
The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi- threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi- core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google’s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations.
Abhinav Sarje, Jack Pien, Xiaoye Li, "GPU Clusters for Large-Scale Analysis of X-ray Scattering Data", GPU Technology Conference (GTC), January 2012,
Today's emerging architectures have higher levels of parallelism incorporated within a processor. They require efficient strategies to extract the performance they have to offer. In our work, we develop architecture-aware parallel strategies to perform various kinds of pairwise computations - pairwise genomic alignments, and scheduling large number of general pairwise computations with applications to computational systems biology and materials science. We present our schemes in the context of the IBM Cell BE, an example of a heterogeneous multicore, but are nevertheless applicable to any similar architecture, as well as general multicores with our strategies being cache-aware.