Berkeley Lab’s ArrayUDF Tool Turns Large-scale Scientific Array Data Analysis Into a Cakewalk

January 2, 2018

Linda Vu, lvu@lbl.gov, +1 510.495.2402

From right to left: Suren Byna, John Wu and Bin Dong

A novel scalable framework developed by researchers in the Lawrence Berkeley National Laboratory’s (Berkeley Lab’s) Computational Research Division (CRD) and at UC Merced is improving scientific productivity by allowing researchers to run user-defined custom analysis operations on large arrays of data with massively parallel supercomputers, while leaving complex data management and performance optimization tasks up to the underlying system.

After running a series of performance evaluations, the CRD researchers also showed that their framework—ArrayUDF—performs tens to thousands of times faster than current state-of the-art big data management systems, such as Apache SPARK, SciDB and RasDaMan. ArrayUDF can also handle cases where existing data management systems fail due to massive memory requirements on each compute node. Their findings were published in the Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC).

Recently, astronomers in the Laser Interferometer Gravitational-Wave Observatory (LIGO) collaboration leveraged ArrayUDF to discover two colliding neutron stars, a never before seen cosmic event that is providing valuable insights into the origin of the universe’s heavy elements.

“By allowing users to focus on the logic of their applications instead of cumbersome data management tasks, we hope to significantly improve scientific productivity,” says Kesheng “John” Wu, who leads CRD’s Scientific Data Management Group and is a co-author on the paper. The group was created more than 25 years ago to develop data management tools so scientists could focus on their research, instead of worrying about managing their data.

According to Wu, running analysis operations on large multi-dimensional data structures is an essential part of the scientific process, allowing researchers to extract meaningful insights. For instance, an array of data collected year-round by instruments in a watershed could help scientists understand the seasonal trends of that particular ecosystem in terms of temperature, rainfall, pressure, carbon dioxide levels, etc. Although data management systems like Apache SPARK, SciDB and RasDaMan are widely used today, many of these systems are not optimized to work with large multi-dimensional array structures or have limitations that make them inefficient for scientific analysis tasks. What makes ArrayUDF different is that it was designed with the needs of scientific users in mind.

Arrays appear in nearly every computer program, but in scientific programs they are often very large and multi-dimensional. For instance, the elements might be organized in a cube and positions indicated by three indexes. A benefit of this structure is that any value can quickly be called up by giving its position in the array—users can access the millionth element just as fast as the first. Users can also conveniently describe their analysis operations in terms of array indexes and offsets in the indexes. And as scientists define their custom analysis operations, data management systems—like ArrayUDF, Apache SPARK, SciDB and RasDaMan—will distribute the large multi-dimensional array onto the parallel processors without user involvement. These automated data partitioning and movement features greatly improve user productivity.

UC Merced Computer science Professor Florin Rusu (left) and graduate studuent Weijie Zhao.

“Most real-world data analysis tasks, such as computing the moving average of a time-series or the vortices of a flow field require the values of not just a single element, but also many of its neighbors. If you are trying to understand turbulence in a combustion engine design, a key variable related to turbulence is vorticity, which defines the spinning motion in a given location. Typically, four neighbor cells are required for this computation in a two-dimensional array,” says Bin Dong, a scientist in CRD’s Scientific Data Management Group and lead author of the paper

According to Dong, what sets ArrayUDF apart from its competitors is the ability to express neighboring relationships among elements in an array. He notes that current state-of-the-art array data management systems only allow users to define an operation on a single data element and cannot easily express the neighboring relationship among the elements in an array.

“In the commercial space, users are mostly interested in running data-mining operations, so the available tools were not necessarily built with scientific datasets or analysis in mind,” says Wu. “In Berkeley Lab’s Scientific Data Management Group we have a lot of experience in dealing with large scientific datasets, and we applied that expertise in building ArrayUDF.”

Once researchers define their custom analysis operations, ArrayUDF will automatically determine how to distribute the data, how to parallelize the operations and how to efficiently run the operations to take advantage of the specialized features of supercomputers, such as the Department of Energy’s (DOE) National Energy Research Scientific Computing Center (NERSC) Edison and Cori systems. As NERSC is the primary computing center for the DOE Office of Science, advances in data analysis are likely to benefit many of the center’s 7,000 users.

When the team ran a series of performance evaluations on supercomputers at the NERSC, they found that ArrayUDFconsistently outperformed state-of-the-art implementations like Apache Spark, SciDB and RasDaMan. Among these data processing systems, Apache Spark was found to scale up the best and was able to solve larger problems than others. On the test problems Apache Spark could solve, the Berkeley Lab tool performed up to three orders of magnitude faster. On the same size machine, ArrayUDF is able to solve much larger problems where Apache Spark ran out of memory.

“If you are a scientific user, you understand two things very well; your data and the special operations that you want to do on that data. The cool thing about this tool is that it allows you to concentrate on your specialties, and ArrayUDF will do the rest of the heavy lifting,” says Suren Byna, a scientist in the Scientific Data Management Group and co-author of the paper.

ArrayUDF Points Astronomers to Colliding Neutron Stars

On August 17, 2017 the LIGO collaboration detected gravitational waves—literal ripples in the fabric of space-time. Astronomers around the world knew that these were the first signals of a momentous cosmic event, but they hadn’t pinpointed the source.

Ground-based telescopes scoured the skies looking for the luminous bursts of electromagnetic radiation that should accompany any cosmic event energetic enough to produce gravitational waves detectable on earth. But in a universe brimming with objects that produce all kinds of electromagnetic radiation, from low-energy radio waves to high-energy gamma rays, how would they find what they were looking for? The answer, it turns out, was databases.

Telescopes at observatories like Palomar in Southern California that scan the sky every night can take 40 to 80 large, high-dimensional images every hour to produce a database of images for any given point in the sky. All of this data is assembled into massive catalogs (e.g., the Sloan Digital Sky Survey and Palomar Transient Factory) that provide astronomers with reference images of each point in space. These references serve as a kind of celestial baseline, letting astronomers know what the sky looks like most of the time. When something unusual appears—a “transient”—scientists compare it to the references in their databases and determine whether they’re seeing something new or if it’s just a false alarm.

“In one night, you can have over 10,000 candidates,” says Florin Rusu, a Computer Science Professor at UC Merced and ArrayUDF collaborator. “We try to reduce this number to somewhere between 10 and 100 candidates. Our focus was identifying possible candidates by applying techniques from array databases.”

Because astronomers amass their celestial catalogs in multidimensional arrays, they could use ArrayUDF to easily search it. In this case, astronomers needed to compare images across multiple catalogs to eliminate false positives and arrive at a short list of possible candidates for the neutron star collision. To do so, they used ArrayUDF.

In essence, these techniques let scientists rapidly compare huge amounts of array data across multiple databases by minimizing data transfer, reducing network congestion and eliminating redundant processing.

“Our techniques let astronomers assess far more candidate images than would otherwise be possible,” Rusu said. “But our technique goes beyond any one domain. If you have large multidimensional data sets and you’re trying to figure out what is similar to what, you can use this.”

He notes that if not for the rapid comparisons that ArrayUDF enabled, astronomers might never have found the neutron star collision. But they did, and the astronomical community was able to solve a long-standing mystery about the origin of heavy elements (like gold and platinum) while also gleaning additional insight into gravitational waves. It’s a fundamental advance in astronomy that couldn’t have happened without fundamental advances in computer science.

Wu notes that ArrayUDF was developed as part of a larger DOE Advanced Scientific Computing Research (ASCR) effort called Scientific Data Services – Autonomous Data Management on Exascale Infrastructure. In a few years, DOE mission critical applications will be generating exabytes (1,000 petabytes) of data. To manage this deluge and accelerate scientific discovery, this ASCR effort is funding data management research to ensure that they can capture, store and access the most scientifically relevant features from these datasets. Over the years, Berkeley Lab’s Scientific Data Management (SDM) Group has delivered a number of tools towards this effort, including a revolutionary indexing tool called FastBit, which sped up database searches and was honored with an R&D 100 award in 2008. FastBit was also recognized as one of the 40 research milestones from Office of Science during the recent DOE’s 40^th Anniversary celebration.

In addition to Dong, Wu, Byna and Rusu, other members of the ArrayUDF development team are Berkeley Lab’s Jialin Liu and UC Merced’s Weijie Zhao.

NERCS is a DOE Office of Science User Facility, located at Berkeley Lab. ArrayUDF was developed with support from DOE’s Office of Science.

ArrayUDF source code is available at https://bitbucket.org/arrayudf/.

For more information about ArrayUDF and the LIGO discovery: http://news.ucmerced.edu/news/2017/computer-scientists-aid-major-astronomical-discovery

About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.