Finally, you must build your MPI applications with LAM/MPI. The MPICH MPI system does not yet support checkpointing with BLCR. The easiest way to make sure you are using the correct MPI implementation is to use the 'switcher' program provided by OSCAR:
% switcher mpi = lam-7.0.6Make sure to pick a version of LAM which has support for checkpointing with BLCR.
% qsub -l nodes=2,walltime=50:00 /usr/bin/lampd my_mpi_program [arguments]Note that you must provide the full path of 'lampd' in this command.
You can also submit non-MPI serial jobs to the batch queue, but in this case, use the 'cr_run' command to start them:
% qsub -l nodes=1,walltime=50:00 /usr/bin/cr_run my_serial_program [arguments]
Once your job has been submitted to bamboo using one of the above commands, it can be run normally. For instance, my system lacks a job scheduler, and so I start submitted jobs manually (which must be done by root, or another user listed in the QM_MANAGERS line of /opt/bamboo/etc/bamboo.cfg):
# qrun -H $HOSTS -J 5Where 'HOSTS' is a space-separated list of hostnames, and '5' is the job number returned by the initial 'qsub' command above.
To see what state your program is in (e.g., running, ready, suspended, etc.) according to the queue manager, enter
% qstat
Once your MPI application is running, you can use the 'qsig' command to cause it to be suspended
% qsig --suspend 5This causes a checkpoint to be taken, and the program to be terminated, but bamboo will remember it. If you now run 'qstat'', for instance, your job will still be listed, but as 'suspended' rather than 'running'. Logically your program is simply suspended (much as if it has received a SIGSTOP), but in fact is has completely exited, and its state is completely stored on disk (so, for instance, it can be resumed later even if the machine is rebooted).
To resume your MPI application, use
% qsig --resume 5