Careers | Phone Book | A - Z Index

Intel Parallel Computing Center: Big Data Support on HPC Systems

Personnel: Costin Iancu and Khaled Ibrahim (LBNL), Nicholas Chaimov (U. Oregon)

 

 To extend their mission and to open new science frontiers, operators of large scale supercomputers have a vested interest to deploy existing data analytics frameworks such as Spark or Hadoop.  So far, this deployment has been hampered by the differences in system architecture, which are reflected in the design approach for the analytics stacks. HPC systems are the mirrored image of data centers. In a data center, the file system is optimized for latency (with local disks) and the network is optimized for bandwidth.  In a supercomputer, the file system is optimized for bandwidth (without local disk) while the network is optimized for latency. 
 We propose research and development efforts to ensure the successful adoption of data analytics frameworks in the HPC ecosystem, with emphasis on Spark.  Our initial evaluation on a Cray XC30 and on an InfiniBand cluster indicates that performance is hampered by the interaction with Lustre: we observe 2-4X slowdown that can be directly attributed to it.
 Scalability is the main target of our project. Currently, data analytics frameworks scale up to O(100) cores. We plan to demonstrate scalability up to O(10,000) cores. As a reference, we plan to use data center installations such as Amazon EC2.  Initial R&D efforts will be performed on Cray XC30 and IB clusters with server class processors. In the later stages of the project we plan to experiment with the Intel KNL based system (Cori) at NERSC. 
In the first stage of the project we plan to systematically re-design the Spark internals to accommodate the different performance characteristics of Lustre based systems. Examples include hiding the latency of expensive I/O operations or exploiting the global name space. The goal is to at least match EC2 performance at this stage. This will provide us with a decent starting point and we can initiate collaborations with external application groups. 
The initial results are very promising. We were able to improve scalability from O(100) to O(1,000) cores when using the Lustre file system. When running entirely in memory, which seems to be the de-facto HPC application setting, we do observe scalability up to O(10,000) cores. Highlights of this work include: 1) on Lustre performance is dominated by fopen(); 2) we can hide fopen() latency and improve scalability with user level caching techniques, this also reduces system calls overhead; 3) interposing NVRAM between cores and Lustre seems to help scalability up to O(1,000). For a lot more interesting  details see this summary and the "Spark on HPC" paper below.
In the second stage of the project, we plan to re-examine the memory management in Spark to accommodate the deeper vertical memory hierarchies present in HPC systems. Data center nodes provide two logical levels of storage (disk and memory) and data management is Spark reflects this hierarchy.  In a sense, Lustre already provides a third level of storage in HPC systems. Newer systems already incorporate an extra level of NVRAM memory, such as the expected NERSC Cori system. We plan to explore Spark specific modifications: 1) using compositions of existing layers (e.g. Tachyon) as a memory manager for vertical hierarchies; 2) exploring a HDFS emulation layer that targets NVRAM storage. We also plan to explore low-level memory management policies at the OS level. We feel there is an exciting research opportunity to re-examine the data coherency models in both Lustre and Spark and propose parallel file system coherency modes for data analytics.
In the third stage of the project, we plan to collaborate with the Intel Lustre team and evaluate the novel Lustre and memory hierarchy designs explored in the FastForward and DesignForward DOE projects.  As workloads we plan to use applications from several domains of interest to our DOE and DoD collaborators: SQL, graphs, machine learning, linear algebra.  Of specific interest is scalable general graph matching algorithms.  We plan to release software as open source and contribute to both Lustre and Spark.

 To extend their mission and to open new science frontiers, operators of large scale supercomputers have a vested interest to deploy existing data analytics frameworks such as Spark or Hadoop.  So far, this deployment has been hampered by the differences in system architecture, which are reflected in the design approach for the analytics stacks. HPC systems are the mirrored image of data centers. In a data center, the file system is optimized for latency (with local disks) and the network is optimized for bandwidth.  In a supercomputer, the file system is optimized for bandwidth (without local disk) while the network is optimized for latency. 

 We propose research and development efforts to ensure the successful adoption of data analytics frameworks in the HPC ecosystem, with emphasis on Spark.  Our initial evaluation on a Cray XC30 and on an InfiniBand cluster indicates that performance is hampered by the interaction with Lustre: we observe 2-4X slowdown that can be directly attributed to it.

 Scalability is the main target of our project. Currently, data analytics frameworks scale up to O(100) cores. We plan to demonstrate scalability up to O(10,000) cores. As a reference, we plan to use data center installations such as Amazon EC2.  Initial R&D efforts will be performed on Cray XC30 and IB clusters with server class processors. In the later stages of the project we plan to experiment with the Intel KNL based system (Cori) at NERSC. 

In the first stage of the project we plan to systematically re-design the Spark internals to accommodate the different performance characteristics of Lustre based systems. Examples include hiding the latency of expensive I/O operations or exploiting the global name space. The goal is to at least match EC2 performance at this stage. This will provide us with a decent starting point and we can initiate collaborations with external application groups. 

The initial results are very promising. We were able to improve scalability from O(100) to O(1,000) cores when using the Lustre file system. When running entirely in memory, which seems to be the de-facto HPC application setting, we do observe scalability up to O(10,000) cores. Highlights of this work include: 1) on Lustre performance is dominated by fopen(); 2) we can hide fopen() latency and improve scalability with user level caching techniques, this also reduces system calls overhead; 3) interposing NVRAM between cores and Lustre seems to help scalability up to O(1,000). For a lot more interesting  details see this summary and the "Spark on HPC" paper below.

In the second stage of the project, we plan to re-examine the memory management in Spark to accommodate the deeper vertical memory hierarchies present in HPC systems. Data center nodes provide two logical levels of storage (disk and memory) and data management is Spark reflects this hierarchy.  In a sense, Lustre already provides a third level of storage in HPC systems. Newer systems already incorporate an extra level of NVRAM memory, such as the expected NERSC Cori system. We plan to explore Spark specific modifications: 1) using compositions of existing layers (e.g. Tachyon) as a memory manager for vertical hierarchies; 2) exploring a HDFS emulation layer that targets NVRAM storage. We also plan to explore low-level memory management policies at the OS level. We feel there is an exciting research opportunity to re-examine the data coherency models in both Lustre and Spark and propose parallel file system coherency modes for data analytics.

In the third stage of the project, we plan to collaborate with the Intel Lustre team and evaluate the novel Lustre and memory hierarchy designs explored in the FastForward and DesignForward DOE projects.  As workloads we plan to use applications from several domains of interest to our DOE and DoD collaborators: SQL, graphs, machine learning, linear algebra.  Of specific interest is scalable general graph matching algorithms.  We plan to release software as open source and contribute to both Lustre and Spark.

 

 

 

Conference Paper

2016

Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Ibrahim, Jay Srinivasan, "Scaling Spark on HPC Systems", High Performance and Distributed Computing (HPDC), February 5, 2016,