Berkeley Lab Checkpoint/Restart (BLCR) User's Guide

About Berkeley Lab Checkpoint/Restart

Checkpoint/Restart allows you to save one or more processes to a file and later restart them from that file. There are three main uses for this:
  1. Scheduling: Checkpointing a program allows a program to be safely stopped at any point in its execution, so that some other program can run in its place. The original program can then be run again later.
  2. Process Migration: If a compute node appears to be likely to crash, or there is some other reason for shutting it down (routine maintenance, hardware upgrade, etc.), checkpoint/restart allows any processes running on it to be moved to a different node (or saved until the original node is available again).
  3. Failure recovery: A long running program can be checkpointed periodically, so that if it crashes due to hardware, system software, or some other non-deterministic cause, it can be restarted from a point in its execution more recent that starting from the beginning.
Berkeley Lab Checkpoint/Restart (BLCR) provides checkpoint/restart on Linux systems. BLCR can be used either with a processes on a single computer, or on parallel jobs (such as MPI applications) which may be running across multiple machines on a cluster of Linux nodes.
Note: Checkpointing parallel jobs requires a library which has integrated BLCR support. At the present time, the only MPI implementations which support checkpoint/restart with BLCR are LAM/MPI and MVAPICH2 version 0.9.8 or newer.  However, work is underway to add support to other MPI implementations, so consult your MPI's documentation for the latest information.

Checkpoint/restarting within a BLCR-aware batch control system

One way to use BLCR is with a batch scheduler system (a.k.a. "job controller", "queue manager", etc.) that knows how to use the BLCR tools to checkpoint and restart the jobs under its control. You can simply tell such a system to "suspend" or "checkpoint" a job, and then later to "resume" or "restart" it.

Unfortunately BLCR has not yet been integrated with many batch systems. Currently the only system that supports BLCR with MPI jobs is the SciDAC Scalable Systems Software (SSS) Suite. If you are running on a system that uses the SSS Suite (this is the case with some versions of the OSCAR clustering tool kit), then refer to these instructions for using checkpoint/restart.

Support for serial jobs is available through SGE. See this report for more information.

As with MPI implementations, efforts are under way to integrate BLCR with additional batch systems, so check your batch system's documentation for the latest info.

The rest of this document assumes that your batch scheduler does not have built-in support for BLCR. In this case you will manually run the BLCR commands needed to checkpoint/restart your jobs.

Note: this does not mean that you cannot checkpoint/restart your applications if you use a batch system without built-in support for BLCR. It simply means that you have to do your checkpoints/restarts manually as described in the remainder of this document. To the batch system, a job that is checkpointed and terminated manually simply looks like a job that has "completed". A restart of an application looks like a "new" job.

Checkpointing Jobs with the BLCR command-line tools

Make sure BLCR is installed and loaded

This guide assumes that BLCR has already been successfully built, installed, and configured on your system (presumably by you or your system administrator). One easy way to test this is to use the 'lsmod' command to see if the BLCR kernel module is loaded on the node(s) that your program will run on:

    % /sbin/lsmod
Module Size Used by Not tainted
blcr 47508 0
blcr_vmadump 24744 1 blcr
blcr_imports 7808 2 blcr,blcr_vmadump
iptable_filter 2412 0 (autoclean) (unused)
ip_tables 15864 1 [iptable_filter]
If you don't see the three modules that begin with 'blcr' in the output of 'lsmod', than BLCR is not yet running on your system. Consult the BLCR Administrators Guide for instructions on building and installing BLCR.

Make sure your environment is set up correctly

You must ensure that the BLCR commands, libraries and manual pages can be found in your shell.

Try running

    % cr_checkpoint --help
If 'cr_checkpoint' cannot be found, you need to modify your 'PATH' to include the directory where 'cr_checkpoint' lives. You will probably also want to modify your 'LD_LIBRARY_PATH' variable to contain the directory where 'libcr.so' lives, and add the BLCR man directory to your 'MANPATH'.

Setting up your environment with 'modules'

If your system uses the Environment Modules system to manage software packages, you may be able to get all of your needed environment settings simply by entering something like

    % module add blcr
However, there is no requirement that 'blcr' is the name of the module you'll need; your administrator may have given it a different name ('checkpoint', etc.). Or s/he may have neglected to add BLCR to the set of packages managed by modules, in which case you'll need to use the 'manual' technique below.

Manually setting up your environment

To manually set up your environment for BLCR, the first thing you need to know is where it has been installed. By default, BLCR installs into the '/usr/local' directory tree, but your system administrator may have put it elsewhere by passing '--prefix=PREFIX' when BLCR was built (where PREFIX can be any arbitrary directory). See your system documents, or try commands such as 'locate cr_checkpoint' or 'find'.

Once you have determined where BLCR is installed, enter the following commands (depending on which type of shell you are using), replacing PREFIX with the value specified for the '--prefix' option used when configuring BLCR.

To configure a bourne-type shell (such as 'bash' or 'ksh'):

    $ PATH=$PATH:PREFIX/bin
$ MANPATH=$MANPATH:PREFIX/man
$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:PREFIX/lib
$ export PATH MANPATH LD_LIBRARY_PATH

To configure a csh-type shell (such as 'csh' or 'tcsh'):

    % setenv PATH ${PATH}:PREFIX/bin
% setenv MANPATH ${MANPATH}:PREFIX/man
% setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:PREFIX/lib

The above examples set the PATH, MANPATH and LD_LIBRARY_PATH variables in your current shell only. It is strongly recommended that you make these settings permanent, to make these settings affect future sessions or windows. To do this, you must add the example commands to your shell's start up files. For a single user of BLCR, you should add the appropriate set of commands to the shell startup files in your home directory ('.bashrc' for bash, '.profile' for other bourne-type shells, or '.cshrc' for csh-type shells). For a system-wide installation, add the bourne shell commands to '/etc/bashrc' and '/etc/profile' and the csh commands to '/etc/cshrc'.

Checkpointing/restarting applications on a single machine

Types of applications supported

BLCR currently supports: However, certain applications are not supported because they use resources not restored by BLCR:

Making an application checkpointable

To be checkpointed successfully with BLCR, an application must contain some library code that BLCR provides. There are several ways of ensuring this:
  1. Start your executable via the with the 'cr_run' command:
            % cr_run your_executable [arguments ]
    'cr_run' loads the BLCR library into your application at startup time. You do not need to modify an application to have it work with 'cr_run'.  However, 'cr_run' is limited to dynamically linked executables; statically linked executables will need to use one of the approaches listed below.
  2. Link your application with BLCR's 'libcr'. For instance, you could make a simple 'hello world' C program checkpointable via
            % gcc -o hello hello.c -LPREFIX/lib -lcr
    where PREFIX is the root of your BLCR install. Your application will now look for the BLCR library whenever it starts up, but note that this does not mean it will automatically be found: you will also need to set your 'LD_LIBRARY_PATH' environment variable to 'PREFIX/lib' if libcr is not installed into a standard system library directory (or read the 'ld' man page for information on '-rpath').
  3. Link your application with some library which uses BLCR. For instance, if your MPI library has been made BLCR-aware, it will cause libcr to be loaded, and so simply linking with the MPI library is enough to make your application checkpointable.
  4. Use run-time loading to dynamically link 'libcr' (see the 'dlopen' man page).  This mechanism can be used for building applications or libraries that must work both with and without BLCR present on a system.
  5. Force the 'libcr.so' dynamic library to do loaded at startup by adding it's full pathname to the 'LD_PRELOAD' environment variable (or just the filename if the directory is listed in 'LD_LIBRARY_PATH'). In most cases, the pthread library will also be required. We do not recommend setting this in your environment in general (via 'export' or 'setenv'), since certain programs may interact poorly with the BLCR library logic. Instead, we recommend that you use a command like
            % env LD_PRELOAD=PREFIX/lib/libcr.so.0:libpthread.so.0 your_executable [arguments ]
    This is essentially how 'cr_run' works.

If you your application does not link in BLCR's library via one of the mechanisms listed above, then any attempt to checkpoint it will fail gracefully  NEW: This behavior is a change from previous BLCR releases in which this situation would cause the program to die unless you handled BLCR's real-time signal explicitly.

Checkpointing the process

To checkpoint a process, simply run
    % cr_checkpoint PID
where PID is the application's process ID.

By default, 'cr_checkpoint' saves a checkpoint, and then lets your application continue running, which is useful for saving the state of a process in case it later fails.  However, you may terminate the process after it has been checkpointed by passing the '--term' flag:

    % cr_checkpoint --term PID
This causes a 'SIGTERM' signal to be sent to the process at the end of the checkpoint. To send a different signal to your process at the end of the checkpoint, you can pass any arbitrary signal number using the '--signal=N' flag.

By default BLCR interprets the final argument (PID in the examples above) as the process id of a single (potentially multi-theaded) process to checkpoint.  However, there are three ways to request a checkpoint of multiple related processes (known as the scope of the checkpoint) :

    % cr_checkpoint --pgid PGID
% cr_checkpoint --sid SID
% cr_checkpoint --tree PID
These three examples request checkpoints over the scope of a process group, session and process tree, respectively.  The PGID is a process group identifier and SID is a session identifier. Here we take the terms "process group" and "session" to mean the set of processes having the given pgid or sid.  In most cases the pgid or sid is just the pid of the process group leader or session leader. When in doubt, try using the '-j' option to 'ps' to show PGID and SID columns. The '--tree' flag to 'cr_checkpoint' requests a checkpoint of the process with the given pid, and all its descendants (excluding those who's parent has exited and thus become children of the 'init' process). This is the same as the grouping shown by the output of the 'pstree' command.

When checkpointing multiple processes using one of the scope arguments described above, all the pipes among the processes are saved and restored. Pipes to/from processes not within the checkpoint scope are not saved (these will be replaced at restart time by the correspondingly numbered file descriptors of the 'cr_restart' process, if any).  While 'cr_checkpoint' will accept a process group or session identifier as a scope argument, BLCR does not currently restore the pgid or sid of restarted processes.  Instead restored processes inherit the pgid and sid of the 'cr_restart' process.  This is considered a sane default because an unmodified parent (such as a shell) of 'cr_restart' would lose job control over the processes if these identifiers are restored.  A future BLCR release will include the ability to request restore of these identifiers.

Files that contain checkpoints are called context files. By default, they are named 'context.ID', where ID is the pid, pgid or sid that was checkpointed, and are stored in the current working directory of the 'cr_checkpoint' process. You may specify an alternate name and location of the context file via the '-f' option.

There are a number of other options that 'cr_checkpoint' provides. See the man page (or 'cr_checkpoint --help') for details.

Restarting the process

To successfully restart from a context file, certain conditions must be met:

Of these requirements, BLCR is only able to verify the availability of the PIDs and the existence and permissions of the executable, libraries and open files.  Failure to satisfy those constraints will lead to an explicit failure from BLCR.  Violation of the rules against modification to any files will not be detected by BLCR and the resulting effects on the restarted application are unpredictable.

You may restart a program on a different machine than the one it was checkpointed on if all of these conditions are met (they often are on cluster systems, especially if you are using a shared network filesystem), and the kernels are the same.  The restriction on executables and their shared libraries being the same can be a problem for systems using prelinking; see the BLCR FAQ for information on dealing with systems that prelink.

You can restart a process by using 'cr_restart' on its context file:

    % cr_restart context.15005
The original process will be restored, and resume running in the exact state it was in at checkpoint time. Note that this includes restoring its process ID, so you cannot restart a program unless the original copy of it has exited (otherwise 'cr_restart' will fail with a message that the PID is already in use). You may restart a process from a particular context file as many times as you wish. The context file is not automatically removed at any point, so you should delete it if/when it is no longer useful to you.

Checkpointing/restarting an MPI application

The best source of information on dealing with any BLCR-aware MPI implementation is the documentation provided with the MPI.  However, here are some hints that may be helpful.

Checkpoint/restart with LAM/MPI

For more information

For more information on Checkpoint/Restart for Linux, visit the project home page: http://ftg.lbl.gov/checkpoint, and/or check out our answers to Frequently Asked Questions about BLCR.  When those resources don't answer your questions, you may e-mail checkpoint@lbl.gov for help.