2.2. Checkpoint Configuration

Checkpointing can be configured using the min_cpu_interval field. This field specifies the time interval between checkpoints. The value of min_cpu_interval should be a reasonable amount of time. Entering a value that is too low for min_cpu_interval results in frequent checkpointing operations, and writing .cas and .dat files can be computationally expensive.

SGE requires checkpointing objects to perform checkpointing operations. Fluent provides a sample checkpointing object called sample_ckpt_obj.

Checkpoint configuration also requires root or manager privileges. While creating new checkpointing objects for Fluent, keep the default values as given in the sample/default object provided by Fluent and change only the following values:

  • queue list (queue_list)

    The queue list should contain the queues that are able to be used as checkpoint objects.

  • checkpointing and migration commands (ckpt_command and migr_command)

    These values should only be changed when the executable files are not in the default location, in which case the full path should be specified. All the files (that is, ckpt_command.fluent and migr_command.fluent) should be located in a directory that is accessible from all machines where the Fluent simulation is running. When running Fluent 2024 R2, the default location for these files is path/ansys_inc/v242/fluent/fluent24.2.0/addons/sge, where path is the Fluent installation directory.

  • checkpointing directory (ckpt_dir)

    This value dictates where the checkpointing subdirectories are created, and hence users must have the correct permission to this directory. Also, this directory should be visible to all machines where the Fluent simulation is running. The default value is NONE where Fluent uses the current working directory as the checkpointing directory.

  • checkpointing modes (when)

    This value dictates when checkpoints are expected to be generated. Valid values of this parameter are composed of the letters s, m, and x, in any order:

    • Including s causes a job to be checkpointed, aborted, and migrated when the corresponding SGE Exceed daemon is shut down.

    • Including m results in the generation of checkpoints periodically at the min_cpu_interval interval defined by the queue (see qconf).

    • Including x causes a job to be checkpointed, aborted, and migrated when a job is suspended.


    Important:  The m mode must be set to permit interval checkpointing.