Note: checkpointing parallel jobs requires a library which has integrated BLCR support. At present, the only MPI implementation which supports checkpoint/restart with BLCR is the LAM/MPI library.
Unfortunately BLCR has not yet been integrated with many batch systems. Currently the only system that supports BLCR is the SciDAC Scalable Systems Software (SSS) Suite. If you are running on a system that uses the SSS Suite (this is the case with some versions of the OSCAR clustering toolkit), then refer to these instructions for using checkpoint/restart.
The rest of this document assumes that your batch scheduler does not have built-in support for BLCR. In this case you will manually run the BLCR commands needed to checkpoint/restart your jobs.
Note: this does not mean that you cannot checkpoint/restart your applications if you use a batch system without built-in support for BLCR. It simply means that you have to do your checkpoints/restarts manually. To the batch system, a job that is checkpointed and terminated manually simply looks like a job that has "completed". A restart of an application looks like a "new" job.
This guide assumes that BLCR has already been successfully built, installed, and configured on your system (presumably by you or your system administrator). One easy way to test this is to use the 'lsmod' command to see if the BLCR kernel module is loaded on the node(s) that your program will run on:
% /sbin/lsmod Module Size Used by Not tainted blcr 46936 0 vmadump 16544 0 [blcr] iptable_filter 2412 0 (autoclean) (unused) ip_tables 15864 1 [iptable_filter]If you don't see 'blcr' and 'vmadump' in the output of 'lsmod', than BLCR is not yet available on your system. Consult the BLCR Administrators Guide for instructions on building and installing BLCR.
Try running
cr_checkpoint --helpIf 'cr_checkpoint' cannot be found, you need to modify your 'PATH' to include the directory where 'cr_checkpoint' lives. You will probably also want to modify your 'LD_LIBRARY_PATH' variable to contain the directory where 'libcr.so' lives, and add the BLCR man directory to your'MANPATH'.
If your system uses the Environment Modules system to manage software packages, you may be able to get all of your needed environment settings simply by entering something like
module add blcrHowever, there is no requirement that 'blcr' is the name of the module you'll need--your administrator may have given it a different name ('checkpoint', etc.). Or s/he may have neglected to add BLCR to the set of packages managed by modules, in which case you'll need to use the 'manual' technique below.
To manually set up your environment for BLCR, the first thing you need to know is where it has been installed. By default, BLCR installs into the '/usr/local' directory tree, but your system administrator may have put it elsewhere by passing '--prefix=PREFIX' when BLCR was built (where PREFIX can be any arbitrary directory). See your system documents, or try commands such as 'locate cr_checkpoint' or 'find'.
Once you have determined where BLCR is installed, enter the following
commands (depending on which type of shell you are using), replacing
PREFIX
with the value specified for the --prefix
option used when configuring BLCR.
To configure a bourne-type shell (such as 'bash' or 'ksh'):
PATH=$PATH:PREFIX/bin MANPATH=$MANPATH:PREFIX/man LD_LIBRARY_PATH=$LD_LIBRARY_PATH:PREFIX/lib export PATH MANPATH LD_LIBRARY_PATH
To configure a csh-type shell (such as 'csh' or 'tcsh'):
setenv PATH ${PATH}:PREFIX/bin setenv MANPATH ${MANPATH}:PREFIX/man setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:PREFIX/lib
The above examples to set the PATH, MANPATH and LD_LIBRARY_PATH variables in your
current session or window only. It is strongly recommended that you make these
settings permanent, to make these settings affect future sessions or
windows. To do this, you must add the example commands to your shell's start up files.
For a single-user of BLCR, you should add the appropriate set of commands to the
shell startup files in your home directory (.bashrc
for bash, .profile
for other bourne-type shells, or
.cshrc
for csh-type shells). For a system-wide installation, add the bourne shell commands
to /etc/bashrc
and /etc/profile
and the csh commands to /etc/cshrc
.
% cr_run your_executable [arguments ]'cr_run' loads the BLCR library into your application at startup time. You do not need to modify an application to have it work with 'cr_run'.
% gcc -o hello hello.c -LPREFIX/lib -lcrwhere PREFIX is the root of your BLCR install. Your application will now look for the BLCR library whenever it starts up, but note that this does not mean it will automatically be found: you will need to set your 'LD_LIBRARY_PATH' environment variable to 'PREFIX/lib' if libcr is not installed into a standard system library directory.
% env LD_PRELOAD=PREFIX/lib/libcr.so.0 your_executable [arguments ]This is essentially how 'cr_run' works.
If you do not start your program with 'cr_run', it will simply die with an error if you try to checkpoint it. More specifically, it will receive a real-time signal (the exact one depends on your kernel and C library versions), which will cause your program to die by default, unless you handle the signal explicitly.
% cr_checkpoint PIDwhere PID is the application's process ID.
By default, 'cr_checkpoint' saves a checkpoint, and then lets your application continue running. This is useful for backing up a process in case it fails later, for instance.
If you wish to stop the process after it has been checkpointed, pass the '--term' flag:
% cr_checkpoint --term PIDThis causes a SIGTERM signal to be received by the process at the end of the checkpoint. If you have a reason to send a different signal to your process at the end of the checkpoint, you can pass any arbitrary signal number instead via the '--signal' flag.
Files that contain checkpoints are called context files. By default, they are named 'context.PID', where PID is the process ID that was checkpointed, and are stored in the current working directory that 'cr_checkpoint' was run in. You may specify the name and location of the context file via the '-f' option.
There are a number of other options that 'cr_checkpoint' provides. See the man page (or 'cr_checkpoint --help') for details.
You can restart a process by using 'cr_restart' on its context file:
% cr_restart context.15005The original process will be restored, and resume running in the exact state it was in at checkpoint time. Note that this includes restoring its process ID, so you cannot restart a program unless the original copy of it has exited (otherwise 'cr_restart' will fail with a message that the PID is already in use).
You may restart a process from a particular context file as many times as you wish. The context file is not automatically removed at any point--delete it if/when it is no longer useful to you.
To start a checkpointable LAM/MPI application, simply run it with the regular LAM 'mpirun' launcher:
% mpirun C hello_mpi
Note: you may need to start up the LAM environment first by running 'lamboot' before starting your application.
To checkpoint the entire MPI application (across all nodes and processes), simply run
% cr_checkpoint 12305Where '12305' is the process ID of the 'mpirun' command. Do not pass the pid of your MPI executable: when 'mpirun' is checkpointed, it automatically takes care of transitively checkpointing all of the processes involved in the MPI job.
To restart your MPI job, simply run 'cr_restart' on the 'mpirun' process's context file:
% cr_restart context.12305All processes in the MPI job will be restarted as they were at checkpoint time.
See the section on Making an application checkpointable for the various ways to fix this.
If you are unlucky enough that some other, unrelated process has grabbed the PID of your application, you must figure out some way to get rid of that process. If you own the process, you can of course simply kill it (or checkpoint it!). Otherwise, consider becoming root, or consulting your system administrator. BLCR will not kill another process for you (this 'feature' would raise certain security issues).
For more information on LAM/MPI, see the LAM/MPI Documentation.