NetLogger Helps Supernova Factory Improve Data Analysis
May 12, 2005
The Nearby Supernova Factory (SNfactory) project, established at Berkeley Lab in 2002, aims to dramatically increase the discovery of nearby Type 1a supernovae by applying assembly-line efficiencies to the collection, analysis and retrieval of large amounts of astronomical data.
To date, the program has resulted in the discovery of about 150 Type 1a supernovae – about three times the entire number reported before the project was started. Type Ia supernovae are important celestial bodies because they are used as “standard candles” for gauging the expansion of the universe.
Contributing to the SNfactory's remarkable discovery rate is its custom-developed “data pipeline” software. The pipeline fills with up to 50 gigabytes (billion bytes) of data per night from wide-field cameras built and operated by the Jet Propulsion Laboratory's Near Earth Asteroid Tracking program (NEAT). NEAT uses remote telescopes in Southern California and Hawaii.
Around 25,000 new images are captured each day, and the goal is to complete all processing before the next day’s images arrive. Image data is copied in real time from the Mt. Palomar Observatory in Southern California to a mass storage system at NERSC. Then the image data is copied to a large shared disk array on a 344-node cluster called PDSF. Each image is 8 MB (uncompressed), and the processing of each image requires between 5 and 25 reference images, for a total disk space requirement of about 0.5 TB each day.
Supernovae are found by comparing recently acquired telescope images with older reference images. If there is a source of light in the new image that did not exist in the old
image, it could be a supernova. Subtracting the new image from the reference image identifies new light sources. This process is quite delicate: aligning the images, matching the point-spread functions, and matching the photometry and bias all require precise calibration.
Because of the high demand put on all the resources in the pipeline, making sure that the data flow smoothly and can be analyzed quickly and correctly is critical to the overall success. While there are a number of tools for evaluating the performance of single systems, identifying the workflow bottlenecks in a distributed system such as the SNfactory requires a different type of application.
For the past 10 years, Brian Tierney and others in the Collaborative Computing Technologies Group have been developing the Netlogger toolkit as part of the Distributed Monitoring Framework project. NetLogger is a set of libraries and tools to support end-to-end monitoring of distributed applications. During the past few months, the team has been working closely with the SNfactory project to help debug and tune their application.
“NetLogger has been extremely useful in the debugging and commissioning of our data processing pipeline,” said Stephen Bailey, one of the lead developers on SNfactory project. “It has helped us identify bugs and processing bottlenecks in order to improve our efficiency and data quality. It additionally has allowed real time monitoring of the data processing to quickly identify problems that need immediate attention. This debugging, commissioning, and monitoring would have taken much longer without NetLogger.”
Tierney and Bailey, along with Dan Gunter of the Collaborative Computing Technologies Group, have written a paper entitled “Scalable Analysis of Distributed Workflow Traces,” which will be presented at the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'05) to be held June 27-30 in Las Vegas. The paper can be found at <http://dsd.lbl.gov/publications/NetLogger- SNFactory.pdf>.
“The first problem the SNFactory scientists asked us to solve was to figure out why some of their workflows where failing without any error messages as to the cause,” Tierney said. “Even when error messages were generated, the SNfactory application produced thousands of log files, and it was very difficult to locate the log messages related to failed workflows. NetLogger was very useful for easily characterizing where the failures were occurring so they would know where to focus debugging efforts.”