Berkeley Lab Checkpoint Restart Saves Big Problems

February 9, 2009

A combustion researcher may run hundreds of hours of simulations on a supercomputer in search of the most efficient fuel-air mixture for a flame. But if the system crashes, then all the data from the run might be lost and the researchers forced to start over

The new version Berkeley Lab Checkpoint Restart (BCLR) software, released in January 2009, could mean that scientists running extensive calculations will be able to recover from such a crash – if they are running on a Linux cluster. This publicly available software preemptively saves the state of applications using the Message Passing Interface, the most widely used mechanism for communication among processors working concurrently on a single problem. Automatic checkpoints are taken every few hours to ensure that in case of a hardware malfunction, work can resume from the last checkpoint instead of the beginning.

Developed by systems engineers in the Lawrence Berkeley National Laboratory’s (Berkeley Lab) Computational Research Division (CRD), BLCR was initially released to the public in November 2003. Since then, many developers from both academia and industry have integrated BLCR into their software packages, including the MVAPICH2, OpenMPI and Cray implementations of MPI; and the Cluster Resources batch system.

“BLCR benefits all system stakeholders – users, operators and owners – by reducing the productivity that is lost when failure occurs,” says Paul Hargrove, of CRD’s Future Technologies Group and one of BLCR’s developers.

According to Hargrove, there are currently other types of checkpoint/restart software for Linux clusters, but BLCR differs from others because it works with MPI. Climate modeling is one example of a complicated problem that utilizes multiple processors at once. To accurately predict and model climate conditions, scientists must take into account how the atmosphere interacts with land, ice and ocean surfaces.

On a supercomputer, multiple processors tackle parts of each problem and communicate their results through MPI. Then, all the results are calculated together to get the big picture model or prediction. Whereas most checkpoint/restart software can save state before or after MPI communications are completed, BLCR automatically saves the application no matter what state it is in, even if communication is in progress. This feature allows for flexibility in scheduling and is extremely useful for unexpected machine failure.

One beneficial application for BLCR is in “urgent computing,” which requires rededicating computing resources on short notice to solving problems of “great social importance,” like predicting the path of tornadoes, hurricanes and tsunamis. When an urgent computing request comes up, the system can now stop whatever it is doing, tackle the time-sensitive problem, and then resume work from the saved checkpoint.

“In the past, an interrupted workflow either due to component failure or to take on an urgent request, would mean starting the interrupted jobs over from the beginning. Sometimes starting over would require days of redundant processing, but now with BLCR researchers can recover from an unexpected interruption in a few hours,” says Hargrove.

He notes that another common loss of utilization on a production system is “queue draining” before a scheduled maintenance. Because no applications may be running at the time maintenance begins, it is typical for the software that schedule jobs to be put in a mode where the system will run only those applications that will be completed before maintenance occurs. Since there are not usually enough short-running jobs queued, this results in a system with lower-than-normal utilization for the day leading up to a scheduled down time.

With BLCR, system administrators no longer have to drain the queues before maintenance. Now, they can checkpoint before the system goes down for maintenance and resume the jobs when the system is up again, hence improving the machine’s productivity. The same approach allows system administrators to implement separate job queues to run the largest jobs only during certain hours of the day to improve the system’s average turnaround time.

In 1997, when checkpoint/restart software was first implemented on the Cray T3E-900 machine at NERSC, it played an integral role in helping the system users and administrators meet their goals for high utilization and throughput. Inspired by the Cray T3E checkpoint/restart, Hargrove and his colleagues developed BLCR because no other checkpoint/restart software on the market met the needs of high performance computing applications on Linux clusters, which account for more than 75 percent of systems, according to the November 2008 TOP500 Supercomputer Sites list. While NERSC is currently the largest supercomputing center committed to using the BLCR software in production, Hargrove notes that other Department of Energy facilities and National Science Foundation TerraGrid centers have expressed interest in the technology.

“The experience at NERSC will most likely determine if or when BLCR will go into use at other large facilities,” says Hargrove.

The new layer that expands BCLR’s checkpoint footprint by allowing it to simultaneously run on thousands of compute nodes was developed by the Center of Excellence (COE), which was established when the contract for NERSC 5, or Franklin, was awarded to Cray in 2006. The COE’s main goal is to develop innovative software for production-level supercomputing. This is achieved by allowing Cray employees to tap into the vast production expertise of NERSC staff by working from the Berkeley Lab’s Oakland Science Facility for two years.

The production tools and software developed by the COE will utilize Cray's release and update process, thus allowing Cray XT sites world-wide to benefit from the COE collaboration. Brian Welty, Terry Mallberg and their Cray colleagues tuned BCLR for deployment on Cray systems as part of the COE.

In addition to Hargrove, other BLCR developers include Eric Roman, also of CRD’s Future Technologies Group and Jason Duell, formerly of CRD.

About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.