CRD Research Team Optimizing Spark Data Analytics Framework for HPC
Awarded two-year, $110,000 Intel Parallel Computing Center grant to tackle scalability challenges
November 13, 2015
Contact: Kathy Kincade, firstname.lastname@example.org, 510-495-2124
A team of scientists from Berkeley Lab’s Computational Research Division (CRD) has been awarded a two-year, $110,000 grant by Intel to support their goal of enabling data analytics software stacks—notably Spark—to scale out on next-generation high performance computing (HPC) systems.
Functioning as an Intel Parallel Computing Center (IPCC), the new research effort will be led by Costin Iancu and Khaled Ibrahim, both computational scientists in CRD’s Computer Languages and Systems Software Group.
Spark is an open source computing framework for processing large datasets. It was developed in 2009 in the University of California, Berkeley’s AMPLab by then Ph.D. student Matei Zaharia and went open source in 2010 before being donated to the Apache Software Foundation in 2013. Spark’s ability to cache datasets in memory makes it well suited for large data analysis, especially on systems with large memory space. Programmers can write programs for the Spark runtime environment using Java, Python or Scala, and these programs can be executed in either a standard batch execution mode or using an interactive shell. Spark’s speed and flexibility make it ideal for rapid, iterative processes such as machine learning.
“Spark evolved in the commercial sector and is run in data centers, where the hardware is very distributed and the compute nodes assume there is a local disk,” Iancu said. “In the data center, the I/O system is optimized for latency and the networks are optimized more for throughput (or bandwidth). But if you move Spark to HPC systems, the opposite is true: the I/O systems care more about bandwidth and the networks care more about latency.”
Through the new IPCC project, Iancu and Ibrahim will address the differences between Spark as it has evolved on traditional data center system architectures versus what HPC platforms require in order to successfully adapt it to the HPC ecosystem and make it highly scalable. In the first phase of the project they will systematically redesign the Spark stack to accommodate the different performance characteristics of Lustre-based HPC systems. In the second phase of the project, they will re-examine memory management in Spark to accommodate the deeper vertical memory hierarchies present in HPC systems.
“The challenging part of the data analytic framework is the data movement,” Ibrahim said. “Where does this movement come from, the filesystem or the compute node or from movement between the nodes? So we are looking at optimizing the data movement vertically from memory to disk and also horizontally between compute nodes. We will also look at how to optimize the computation within the compute nodes.”
All of this is intended to support the project’s overarching goal: scalability. For the first year they will focus on improving execution efficiency at the scale of 1,000 cores. But that is only the beginning, Iancu emphasized.
“When we deploy Spark on an HPC system”—and he is quick to point out not just Cray architectures—“we will be able to improve its scalability to tens of thousands of cores by adapting it to the system architecture(s). Our goal is to improve Spark performance in the software stack and figure out how to make it evolve with the technology.”
They will also be testing it out on NERSC’s new Cori system as part of the center’s Burst Buffer Early Users program.
“We want to extend the use of Spark from data analytics to include more scientific computing,” Ibrahim said. “Typical applications for these frameworks are graph analytics and distributed databases, but we would like to include more scientific computing applications.”