5.2. Examples with Checkpointing

The examples that follow apply to both interactive and batch submissions. For brevity, only batch submissions are described. Usage of the LSF checkpoint and restart capabilities, requiring echkpnt and erestart, are described as follows:

  • Serial 3D Fluent batch job under LSF with checkpoint/restart

    fluent 3d -g -i journal_file -scheduler=lsf -scheduler_opt='-k " /home/username 60"' -scheduler_opt='-a fluent'

    • In this example, the LSF -a fluent specification identifies which echkpnt/ erestart combination to use, /home/username is the checkpoint directory, and the duration between automatic checkpoints is 60 minutes.

    The following commands can then be used:

    • bjobs -l <job_ID>

      • This command returns the job information about <job_ID> in the LSF system.

    • bchkpnt <job_ID>

      • This command forces Fluent to write a case file, a data file, and a restart journal file at the end of its current iteration.

      • The files are saved in a directory named <checkpoint_directory>/<job_ID>. The <checkpoint_directory> is defined through the -scheduler_opt= option in the original fluent command.

      • Fluent then continues to iterate.

    • bchkpnt -k <job_ID>

      • This command forces Fluent to write a case file, a data file, and a restart journal file at the end of its current iteration.

      • The files are saved in a directory named <checkpoint_directory>/<job_ID> and then Fluent exits. The <checkpoint_directory> is defined through the -scheduler_opt= option in the original fluent command.

    • brestart <checkpoint_directory> <job_ID>

      • This command starts a Fluent job using the latest case and data files in the <checkpoint_directory>/<job_ID> directory.

      • The restart journal file <checkpoint_directory>/<job_ID>/#restart.inp is used to instruct Fluent to read the latest case and data files in that directory and continue iterating.

  • Parallel 3D Fluent batch job under LSF with checkpoint/restart, which specifies /home/username as the checkpoint directory, uses 4 processes, and reads a journal file called journal_file

    fluent 3d -t4 -g -i journal_file -scheduler=lsf -scheduler_opt='-k " /home/username"' -scheduler_opt='-a fluent'

    The following commands can then be used:

    • bjobs -l <job_ID>

      • This command returns the job information about <job_ID> in the LSF system.

    • bchkpnt <job_ID>

      • This command forces parallel Fluent to write a case file, a data file, and a restart journal file at the end of its current iteration.

      • The files are saved in a directory named <checkpoint_directory>/<job_ID>. The <checkpoint_directory> is defined through the -scheduler_opt= option in the original fluent command.

      • Parallel Fluent then continues to iterate.

    • bchkpnt -k <job_ID>

      • This command forces parallel Fluent to write a case file, a data file, and a restart journal file at the end of its current iteration.

      • The files are saved in a directory named <checkpoint_directory>/<job_ID>. The <checkpoint_directory> is defined through the -scheduler_opt= option in the original fluent command.

      • Parallel Fluent then exits.

    • brestart <checkpoint_directory> <job_ID>

      • This command starts a Fluent network parallel job using the latest case and data files in the <checkpoint_directory>/<job_ID> directory.

      • The restart journal file <checkpoint_directory>/<job_ID>/#restart.inp is used to instruct Fluent to read the latest case and data files in that directory and continue iterating.

      • The parallel job will be restarted using the same number of processes as that specified through the -t<x> option in the original fluent command (4 in the previous example).

    • bmig -m <host> 0

      • This command checkpoints all jobs (indicated by 0 job ID) for the current user and moves them to host <host>.