DOE Scientists Team up to Demonstrate Scientific Potential of Big Data Infrastructure

February 3, 2015

by Jon Bashor

Craig Tull at SC14.

The pace of scientific discovery is increasingly being constrained by scientists’ ability to process and manage the large datasets originating from the world’s most advanced telescopes and microscopes, large-scale supercomputer simulations of the universe and climate, or genomics sequencers reading the code that define the living world. These growing sources of data offer new opportunities of discovery but present major challenges as researchers strive to get the science out of the data.

Over the past few months, groups of researchers supported by the Department of Energy (DOE) have taken on the challenge to demonstrate new approaches for collecting, moving, sharing and analyzing massive scientific datasets in areas ranging understanding the makeup of our universe to designing new materials at the molecular scale.

Researchers at Lawrence Berkeley National Laboratory (Berkeley Lab) led four of the 11 Science Data Pilot Projects and provided support to five others. The multi-discipline teams tapped into some of the world’s leading facilities for research, including light sources, supercomputers and particle accelerators at DOE national labs, as well as DOE’s ultra-fast network, ESnet. Berkeley Lab facilities used by the projects include the National Energy Research Scientific Computing Center (NERSC), the Advanced Light Source (ALS) and ESnet, which is managed by Berkeley Lab.

The goal was to demonstrate what could be achieved when such facilities are specifically linked to carry out specialized research projects. The idea was to demonstrate the potential of such a focused science data infrastructure, which took up to several months to aggregate, as well as the steps needed to make such heroic efforts possible as everyday events.

“Science at all scales is increasingly driven by data, whether the results of experiments, supercomputer simulations or observational data from telescopes,” said Steven Binkley, director of DOE’s Office of Advanced Scientific Computing Research (ASCR), which co-sponsored the projects. “As each new generation of instruments and supercomputers comes on line, we have to make sure that our scientists have the capabilities to get the science out of that data and these projects illustrate the future directions.”

As the largest funder of basic physical science in the U.S., the DOE Office of Science supports tens of thousands of researchers at national laboratories, universities and other institutions. The Office of Science also operates unique user facilities for research. These projects brought together the people, facilities and ideas for maximizing scientific discovery.

Members of many of these Science Data Pilot Projects gave presentations on their work at the SC14 conference held Nov. 16-21 in New Orleans. Here is a description of the four projects led by staff at Berkeley Lab.

Alexander Hexemer at SC14

Two of the projects, X-rays and Supercomputing—Opening New Frontiers in Photon Science and Investigating Organic Photovoltaics in Real-Time with an ASCR Super-Facility, were led by Craig Tull, head of the Physics and X-Ray Science Computing Group in the Computational Research Division. Both projects focused on analyzing data from experiments on new materials with potential for improving energy efficiency, such as understanding the chemistry behind photosynthesis and analyzing new materials for batteries.

The first project demonstrated the ability to use a central scientific computing facility – NERSC – serving data from multiple experimental facilities. Data from experiments at the ALS, the Advanced Photon Source (APS) at Argonne National Laboratory, the Linac Coherent Light Source at SLAC and the National Synchrotron Light Source at Brookhaven National Laboratory were moved to NERSC via ESnet. Although each of the four light source facilities are operated by DOE’s Office of Basic Energy Sciences, each facility generates data in different formats, which made the task of organizing and analyzing the data more difficult. Working closely with scientists and engineers at each institution, the team set up systems to transfer the data from each site to NERSC using an application called SPADE, also developed at Berkeley Lab.

Once the data arrived at NERSC, it was automatically or semi-automatically analyzed and visualized within the SPOT Suite framework developed by Tull’s team. A key capability within the SPOT framework is the ability to take advantage of analysis and processing algorithms developed by other groups. For example, researchers from the APS were able to adapt their software, called TomoPy, for use in the SPOT framework, which then made it possible to use TomoPy to reconstruct tomography data from both the ALS and APS. While beamlines at those two facilities do not yet use the same file formats, SPOT was able to use inter-conversion tools to handle the differences.

“When you’re working on an experiment, you don’t always have time to stop and start working on the data,” said Tull, “but SPOT Suite does this analysis automatically in the background.”

While the projects at the ALS and SLAC processed data in real-time from ongoing experiments, the other two projects used real data in “simulated real-time” as a demonstration of automating their real-time workflows using SPOT Suite. All of the projects are researching new, more efficient energy technologies. The goal is to reuse as many existing tools as possible, developing new software as necessary, and make it easier for the scientists to be able to examine their data simultaneously.

“In the past, a scientist used to save his or her data on an external device, then take it back to a PC running specialized software and analyze the results,” Tull said. “With these projects, we demonstrated an entirely new model for doing science at larger scale more efficiently and effectively and with improved collaboration.”

Other participants were Eli Dart (ESnet), Dilworth Parkinson, Nicholas Sauter and David Skinner (NERSC), all of LBNL; Amber Boehnlein, SLAC; Francesco De Carlo, Ian Foster, Doga Gursoy, and Nicholas Schwarz of ANL; and Dantong Yu, BNL.

The second project, which involved experiments with organic photovoltaic (OPV) materials at the ALS used multiple computing facilities, illustrating a concept known as a “super facility,” which supports the seamless integration of multiple, complementary DOE Office of Science user facilities into a virtual facility offering fundamentally greater capability. The facilities are the ALS, NERSC, the Oak Ridge Leadership Computing Facility (OLCF) and ESnet. Enabled by ESnet’s connectivity between ALS, NERSC and OLCF and using specialized software including ANL’s Globus Online, the project demonstrated the capability for researchers in organic photovoltaics to not only measure scattering patterns for their samples at the ALS and see real time feedback on all their samples through the SPOT Suite application running on NERSC, but also to see near-real time analysis of their samples running at the largest scale on the Titan supercomputer at OLCF. This demonstrated that soon researchers, for the first time, can understand their samples sufficiently during beamtime experiments to adjust the experiment to maximize their scientific results.

In the experiments, a specialized printer prints organic photovoltaics which show promise as less expensive, more flexible materials for converting sunlight to electricity. However, the ALS is one of the only facilities able to print and measure these materials to date. By capturing an image of the solution every second for five minutes, scientists could watch the structures crystallize during the drying process. The key information was contained in about 30 images from each five-minute run. In all, more than 36,000 images were captured, of which 865 were pulled out. The calculations were more demanding than expected and used half of Titan’s 18,688 graphics processing units.

Other participants were Shane Canon (NERSC), Eli Dart (ESnet), Alex Hexemer (ALS) and James Sethian, LBNL; Ian Foster of ANL; and Galen Shipman of ORNL.

Peter Nugent at SC14

EXDAC – EXtreme Data Analysis for Cosmology: In recent years astrophysics and cosmology have undergone a renaissance, transforming from data-starved to data-driven sciences. On one side, increasingly powerful Earth- and space-based telescopes are capturing images of the universe at a previously unattainable level of detail. Conversely, advanced supercomputers create highly detailed simulations. Together, these massive amounts of data are leading to huge advances in our understanding of the universe, with the simulations helping to correct shortcomings in observed data, which in turn leads to more accurate and detailed simulations.

In the future, cosmologists hope to be able to perform joint analysis on the two kinds of data at the same time, a goal which this project was aimed at advancing. A key step in this direction is streamlining the movement of data from a central repository to supercomputing centers where it can be analyzed. For this project, the team focused on observational data from the Dark Energy Survey, which is being stored at the National Center for Supercomputing Applications (NCSA) in Illinois. The current system uses many codes, is tuned for one specific system and small glitches can often bog down the process placing demands for more compute time.

The team built a “data pipeline” for moving and processing the data – independent of the computing center they were going to run at. Using a virtual machine software package called Docker that automates the deployment of applications inside software containers, they built self-contained packages that included all the necessary applications for analysis. These “containers” could then be pushed out from NCSA to DOE supercomputers at Berkeley, Argonne, Brookhaven and Oak Ridge national labs, fired up on the various systems pulling the data that they needed for processing, and then the push results back to NCSA over ESnet.

To process simulation data, the team built PDACS, the Portal for Data Analysis Services for Cosmological Simulations to run at both Berkeley and Argonne. Using an application developed for genomics and medical data, the team created an interface with drop-down menus and drag-and-drop capabilities to create customized workflows, allowing non-expert users to specify the kind of analysis they wanted to perform on each dataset.

“It made the analysis really easy, even though the users are moving around and analyzing huge amounts of data,” said Berkeley Lab’s Peter Nugent, who led this data pilot.

Other participants were Shane Canon (NERSC), Berkeley Lab; Salman Habib, ANL Michael Ernst and Anže Slosar, BNL; and Bronson Messer, ORNL.

Virtual Data Facility—A Service Infrastructure to Enable Science and Engineering: Major facilities and science teams across the DOE laboratory system are increasingly dependent on the ability to efficiently capture, integrate and steward large volumes of diverse data. These data-intensive workloads are often composed as complex scientific workflows that require computational and data services across multiple facilities. This project was a multi-lab effort to create proof of concept implementations for some of the common challenges encountered across domains. These challenges include authentication, data replication, portable execution of analysis, data publishing and a framework for building user interfaces.

“This project demonstrated a few core services that illustrated how a Virtual Data Facility could build upon ASCR’s computational infrastructure to better meet the needs of the DOE experimental and observational facilities and research teams,” said project co-leader Shane Canon, who leads technology integration at the National Energy Research Scientific Computing Center at Berkeley Lab. “Berkeley Lab was responsible for creating a prototype replication service, which we built using a common web-service paradigm and leveraged existing capabilities like Globus and ESnet to facilitate fast and efficient data transfers.”

Data end-points were established at Argonne, Brookhaven, Lawrence Berkeley, Oak Ridge and Pacific Northwest national laboratories and the service demonstrated data sets being replicated automatically from one site to the other four. The service also provided a metadata service that could be used to create a data catalog. While this prototype service was only intended to be a proof of principle, it followed many of the guiding principles that will be important to creating a virtual data facility and illustrates the potential value of such a facility.

Participants were Canon (NERSC) and Brian Tierney (ESnet), both of Berkeley Lab; Dan Olson, ANL; Michael Ernst, BNL; Kerstin Kleese—Van Dam, Pacific Northwest National Laboratory; and Galen Shipman, ORNL.

About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.