17.4.3. Error Handling

This section gives suggestions on what to do if you encounter problems during a parallel run of the CFX-Solver. Most failures occur during the initial phase of the run as a result of configuration problems.

17.4.3.1. Problems with Intel MPI

17.4.3.1.1. Semaphores and Shared-memory Segments

Semaphores are constructs used by Intel MPI to enable interprocess communication locally on a shared memory multiprocessor.

Each parallel process uses its own semaphore. Shared-memory access via semaphores is managed by the operating system kernel. Semaphores are created and used by a user-id, but managed by the operating system.

When an Ansys CFX Intel MPI parallel job is finished, these semaphores are, by default, deleted. If the job terminated in some abnormal fashion, for example one of the processes was killed, then the semaphores in use are not deleted and remain in an idle state. This situation consumes free semaphores on your system. There is a maximum limit on the number of semaphores that the operating system will permit. Once this limit is reached, no new semaphores can be created. Software usage may become unstable when the semaphore limit is reached.

17.4.3.1.2. Typical Problems When You Run Out of Semaphores
  • You cannot start a new MPI run. Typical error message to console:

    p0_97244:  p4_error: semget failed for setnum=%d: 0
            An error has occurred in cfx5solve:
    
  • A new MPI job starts up, but kills an already running MPI job.

  • When two MPI jobs are running at the same time, for the same user-id, stopping one MPI job kills the second MPI job.

  • The solver hangs when starting a new job

17.4.3.1.3. Checking How Many Semaphores Are in Use, and by Whom

There is a built-in command for UNIX: "ipcs", or "ipcs -a". Use this command to see what semaphores are owned by your user-id. Typical output is as follows:

ipcs
Message Queues:
T      ID        KEY    MODE         OWNER    GROUP
q       0 0x416d02d8 --rw-------      root   system
Shared Memory:
T      ID        KEY    MODE         OWNER    GROUP
m     768    0x7a08e --rw-------      joe    all
m    1409    0x57463 --rw-------      moe    all
Semaphores:
T      ID        KEY    MODE         OWNER    GROUP
s       0 0x416d02d8 --ra-------      root   system
s       1 0x416d029e --ra-------      root   system
s      82    0x7a08e --ra-------      joe    all
s      83    0x7a08f --ra-------      joe    all
s      84    0x7a090 --ra-------      joe    all
s      85    0x7a091 --ra-------      joe    all
s     214    0x57463 --ra-------      moe    all
s     183    0x57464 --ra-------      moe    all
s     184    0x57465 --ra-------      moe    all
s     185    0x57466 --ra-------      moe    all

The user joe has one shared memory segment and 4 semaphores left over from an abnormally aborted previous CFX computation. These need to be deleted.

17.4.3.1.4. Deleting the Semaphores You Are Using

Once the system limit is reached for semaphores, new Intel MPI jobs are not possible. Because MPI does not delete these semaphores if a simulation terminates abnormally, it is up to each user to manually delete all such semaphores. It is also a good idea to also delete any remaining shared memory segments owned by your account at the end of an aborted run. This leaves the system free for other users to use for their MPI runs. Continuing from the above example, the user "joe" would issue the UNIX command "ipcrm -m xxx -s yyy" where xxx is shared memory ID number and yyy is a semaphore ID number. For example:

ipcrm -s 82 -s 83 -s 84 -s 85 -m 768

This will delete the semaphores and the shared memory segment, leaving the system free for the next user. On some systems, the exact syntax of the ipcrm command may differ from that shown above.

17.4.3.1.5. Shared-memory Segment Size Problems

In some cases, it is possible to encounter the following message:

p0_2406: (140.296294) xx_shmalloc: returning NULL; requested 65576 bytes
p0_2406: (140.296337) p4_shmalloc returning NULL; request = 65576 bytes
You can increase the amount of memory by setting the environment 
variable P4_GLOBMEMSIZE (in bytes)
p0_2406:  p4_error: alloc_p4_msg failed: 0

This may occur if Ansys CFX attempts to allocate a shared-memory segment with a size larger than what is allowed by default by Intel MPI (the default is 4 MBytes). The default size can be adjusted, as mentioned in the error message, by setting the environment variable P4_GLOBMEMSIZE. For example, to set this variable to 16 MBytes running a C-Shell:

setenv P4_GLOBMEMSIZE 16777216

You must be careful when you increase the value of this parameter that you do not exceed the maximum size allowed for a shared memory segment. This limit is set by the operating system kernel rather than Intel MPI.

17.4.3.1.6. Checking Semaphore ID and Shared Memory Segment Limits

In some cases, it may be necessary to increase the maximum number of semaphores IDs, or the maximum size of a shared memory segment on your system, so that several users/jobs can each run in parallel, simultaneously. Once the total number of semaphores are used, then more parallel MPI jobs cannot be started. You can find out the maximum number of semaphores on your system with the following commands:

17.4.3.1.6.1. Linux

Semaphores: /sbin/sysctl -a | grep -i sem

Shared-memory segments: /sbin/sysctl -a | grep -i shm

17.4.3.1.7. Increasing the Maximum Number of Semaphores for Your System

In general, this is not a simple task, and it varies greatly between different computer vendors and operating systems. It may involve changing a system resource file and rebooting the computer, or it may involve making changes to the system kernel and recompilation of the system kernel. Contact your system administrator if you need to increase the number of semaphores on your system.

17.4.3.1.8. Max Locked Memory

On Linux systems, when using the Intel MPI 4.1.3 library in CFX, Intel MPI requires 32 MB of lockable memory for interprocess communication on all parallel hosts. When running distributed parallel with Intel MPI, the solver checks the ‘max locked memory’ limit and issues a warning if it is insufficient. The default hard limit is usually less than 32 MB, requiring you to change the limit in the system configuration file. Users may raise or lower the current limit up to the hard limit, but only the super-user may raise the hard limit. To check the soft and hard limits, use the following commands:

Shell

Command

bourne shell

ulimit —a

ulimit —Ha

C shell

limit

limit —h

The soft and hard limits may be set to a value (in KB) or to ‘unlimited’ by adding the following lines to /etc/security/limits.conf:

  • * soft memlock unlimited

  • * hard memlock unlimited

Alternatively, this issue may be overcome by limiting the communication method to TCP between hosts using the environment variable setting I_MPI_FABRICS=shm:tcp.


Note:  This setting will prevent the use of any faster interconnects available on the system. The preferred solution is to increase the 'max locked memory' limit described above.


17.4.3.2. Problems with the Ansys CFX Executables

  • Check the Hosts File (hostinfo.ccl) for syntax errors.

  • Check that the specified executables for each host in the Hosts File exist, and that the files have execute permission set.

  • Check that the specified executables are correct for the specified hosts (operating system, 64-bit executables, and so on).

17.4.3.3. Problems with Ansys CFX Licenses

  • Check that you have the correct parallel licenses to run a parallel job on the assigned hosts.

  • Check that you have enough licenses for the specified maximum number of partitions.

17.4.3.4. Windows Problems

  • If you are having trouble with distributed parallel MPI, check that you have installed the MPI daemon.

  • Inconsistent host name resolution with multiple network adaptors:

    If there are problems with host name resolution, distributed parallel runs might fail. This might occur when using systems that have multiple network adaptors and IP addresses (for example, systems that are connected to both private and enterprise networks). In such cases, the system configuration should be reviewed. A remedy for this problem is to use IP addresses instead of host names.

17.4.3.5. Linux Problems

  • CFX Distributed Parallel runs fail

    On some SLES machines (typically ones with more than one network card), the default configuration of /etc/hosts will cause CFX distributed parallel runs to fail. In such cases, the problem might be solved by editing the /etc/hosts file to remove all lines that contain redundant loopback addresses. Do not remove the line with the first loopback address, which is typically 127.0.0.1.

17.4.3.6. Convergence Problems

Occasionally you may find that jobs submitted in serial will converge while those in parallel fail. This can be due to the different internal structure of the multigrid solver. The partitioned mesh leads to different coarse mesh blocking than the serial mesh, and if you have selected a timestep size that is close to the critical convergence limit, this can cause convergence problems.

Usually a reduction in the timestep size alleviates this problem.