This section gives suggestions on what to do if you encounter problems during a parallel run of the CFX-Solver. Most failures occur during the initial phase of the run as a result of configuration problems.
Semaphores are constructs used by Intel MPI to enable interprocess communication locally on a shared memory multiprocessor.
Each parallel process uses its own semaphore. Shared-memory access via semaphores is managed by the operating system kernel. Semaphores are created and used by a user-id, but managed by the operating system.
When an Ansys CFX Intel MPI parallel job is finished, these semaphores are, by default, deleted. If the job terminated in some abnormal fashion, for example one of the processes was killed, then the semaphores in use are not deleted and remain in an idle state. This situation consumes free semaphores on your system. There is a maximum limit on the number of semaphores that the operating system will permit. Once this limit is reached, no new semaphores can be created. Software usage may become unstable when the semaphore limit is reached.
You cannot start a new MPI run. Typical error message to console:
p0_97244: p4_error: semget failed for setnum=%d: 0 An error has occurred in cfx5solve:
A new MPI job starts up, but kills an already running MPI job.
When two MPI jobs are running at the same time, for the same user-id, stopping one MPI job kills the second MPI job.
The solver hangs when starting a new job
There is a built-in command for UNIX: "ipcs
", or "ipcs -a
". Use this command to see what
semaphores are owned by your user-id. Typical output is as follows:
ipcs Message Queues: T ID KEY MODE OWNER GROUP q 0 0x416d02d8 --rw------- root system Shared Memory: T ID KEY MODE OWNER GROUP m 768 0x7a08e --rw------- joe all m 1409 0x57463 --rw------- moe all Semaphores: T ID KEY MODE OWNER GROUP s 0 0x416d02d8 --ra------- root system s 1 0x416d029e --ra------- root system s 82 0x7a08e --ra------- joe all s 83 0x7a08f --ra------- joe all s 84 0x7a090 --ra------- joe all s 85 0x7a091 --ra------- joe all s 214 0x57463 --ra------- moe all s 183 0x57464 --ra------- moe all s 184 0x57465 --ra------- moe all s 185 0x57466 --ra------- moe all
The user joe
has one shared memory segment
and 4 semaphores left over from an abnormally aborted previous CFX
computation. These need to be deleted.
Once the system limit is reached for semaphores, new Intel MPI
jobs are not possible. Because MPI does not delete these semaphores
if a simulation terminates abnormally, it is up to each user to manually
delete all such semaphores. It is also a good idea to also delete
any remaining shared memory segments owned by your account at the
end of an aborted run. This leaves the system free for other users
to use for their MPI runs. Continuing from the above example, the
user "joe
" would issue the UNIX command "ipcrm -m xxx -s yyy
" where xxx
is
shared memory ID number and yyy
is a semaphore
ID number. For example:
ipcrm -s 82 -s 83 -s 84 -s 85 -m 768
This will delete the semaphores and the shared memory segment,
leaving the system free for the next user. On some systems, the exact
syntax of the ipcrm
command may differ from that
shown above.
In some cases, it is possible to encounter the following message:
p0_2406: (140.296294) xx_shmalloc: returning NULL; requested 65576 bytes p0_2406: (140.296337) p4_shmalloc returning NULL; request = 65576 bytes You can increase the amount of memory by setting the environment variable P4_GLOBMEMSIZE (in bytes) p0_2406: p4_error: alloc_p4_msg failed: 0
This may occur if Ansys CFX attempts to allocate a shared-memory segment with a size larger than what is allowed by default by Intel MPI (the default is 4 MBytes). The default size can be adjusted, as mentioned in the error message, by setting the environment variable P4_GLOBMEMSIZE. For example, to set this variable to 16 MBytes running a C-Shell:
setenv P4_GLOBMEMSIZE 16777216
You must be careful when you increase the value of this parameter that you do not exceed the maximum size allowed for a shared memory segment. This limit is set by the operating system kernel rather than Intel MPI.
In some cases, it may be necessary to increase the maximum number of semaphores IDs, or the maximum size of a shared memory segment on your system, so that several users/jobs can each run in parallel, simultaneously. Once the total number of semaphores are used, then more parallel MPI jobs cannot be started. You can find out the maximum number of semaphores on your system with the following commands:
In general, this is not a simple task, and it varies greatly between different computer vendors and operating systems. It may involve changing a system resource file and rebooting the computer, or it may involve making changes to the system kernel and recompilation of the system kernel. Contact your system administrator if you need to increase the number of semaphores on your system.
On Linux systems, when using the Intel MPI 4.1.3 library in CFX, Intel MPI requires 32 MB of lockable memory for interprocess communication on all parallel hosts. When running distributed parallel with Intel MPI, the solver checks the ‘max locked memory’ limit and issues a warning if it is insufficient. The default hard limit is usually less than 32 MB, requiring you to change the limit in the system configuration file. Users may raise or lower the current limit up to the hard limit, but only the super-user may raise the hard limit. To check the soft and hard limits, use the following commands:
Shell |
Command |
---|---|
bourne shell |
ulimit —a |
ulimit —Ha | |
C shell |
limit |
limit —h |
The soft and hard limits may be set to a value (in KB) or to
‘unlimited’ by adding the following lines to /etc/security/limits.conf
:
* soft memlock unlimited
* hard memlock unlimited
Alternatively, this issue may be overcome by limiting the communication
method to TCP between hosts using the environment variable setting I_MPI_FABRICS=shm:tcp
.
Note: This setting will prevent the use of any faster interconnects available on the system. The preferred solution is to increase the 'max locked memory' limit described above.
Check the Hosts File (hostinfo.ccl) for syntax errors.
Check that the specified executables for each host in the Hosts File exist, and that the files have execute permission set.
Check that the specified executables are correct for the specified hosts (operating system, 64-bit executables, and so on).
Check that you have the correct parallel licenses to run a parallel job on the assigned hosts.
Check that you have enough licenses for the specified maximum number of partitions.
If you are having trouble with distributed parallel MPI, check that you have installed the MPI daemon.
Inconsistent host name resolution with multiple network adaptors:
If there are problems with host name resolution, distributed parallel runs might fail. This might occur when using systems that have multiple network adaptors and IP addresses (for example, systems that are connected to both private and enterprise networks). In such cases, the system configuration should be reviewed. A remedy for this problem is to use IP addresses instead of host names.
CFX Distributed Parallel runs fail
On some SLES machines (typically ones with more than one network card), the default configuration of /etc/hosts will cause CFX distributed parallel runs to fail. In such cases, the problem might be solved by editing the /etc/hosts file to remove all lines that contain redundant loopback addresses. Do not remove the line with the first loopback address, which is typically 127.0.0.1.
Occasionally you may find that jobs submitted in serial will converge while those in parallel fail. This can be due to the different internal structure of the multigrid solver. The partitioned mesh leads to different coarse mesh blocking than the serial mesh, and if you have selected a timestep size that is close to the critical convergence limit, this can cause convergence problems.
Usually a reduction in the timestep size alleviates this problem.