Checkpointing/Restarting MPI Programs within the SSS Suite

The SciDAC Scalable Systems Software Suite (SSS) provides a "checkpoint manager" that allows MPI programs to be checkpointed and restarted if the underlying system and MPI library provides this functionality.

Requirements

At present, checkpoint/restart is only available on Linux systems which have Berkeley Lab Checkpoint/Restart (BLCR) installed. Additionally, Bamboo must currently be used as the SSS queue manager. The 'sss-cr' service (the checkpoint manager service) must be running on your system.

Finally, you must build your MPI applications with LAM/MPI. The MPICH MPI system does not yet support checkpointing with BLCR. The easiest way to make sure you are using the correct MPI implementation is to use the 'switcher' program provided by OSCAR:

    % switcher mpi = lam-7.0.6
Make sure to pick a version of LAM which has support for checkpointing with BLCR.

Submitting your job to the Queue manager

In order to be checkpointable, a LAM/MPI program must be submitted with 'lampd':
    % qsub -l nodes=2,walltime=50:00 /usr/bin/lampd my_mpi_program [arguments]
Note that you must provide the full path of 'lampd' in this command.

You can also submit non-MPI serial jobs to the batch queue, but in this case, use the 'cr_run' command to start them:

    % qsub -l nodes=1,walltime=50:00 /usr/bin/cr_run my_serial_program [arguments]

Once your job has been submitted to bamboo using one of the above commands, it can be run normally. For instance, my system lacks a job scheduler, and so I start submitted jobs manually (which must be done by root, or another user listed in the QM_MANAGERS line of /opt/bamboo/etc/bamboo.cfg):

    # qrun -H $HOSTS -J 5
Where 'HOSTS' is a space-separated list of hostnames, and '5' is the job number returned by the initial 'qsub' command above.

Suspending/resuming a running program

To see what state your program is in (e.g., running, ready, suspended, etc.) according to the queue manager, enter

    % qstat

Once your MPI application is running, you can use the 'qsig' command to cause it to be suspended

    % qsig --suspend 5
This causes a checkpoint to be taken, and the program to be terminated, but bamboo will remember it. If you now run 'qstat'', for instance, your job will still be listed, but as 'suspended' rather than 'running'. Logically your program is simply suspended (much as if it has received a SIGSTOP), but in fact is has completely exited, and its state is completely stored on disk (so, for instance, it can be resumed later even if the machine is rebooted).

To resume your MPI application, use

    % qsig --resume 5

Taking periodic checkpoints

The SSS system does not yet support taking periodic checkpoints of an MPI application for backup purposes. This support will added in the future.