4.6. Troubleshooting

This section describes problems which you may encounter while using distributed-memory parallel processing, as well as methods for overcoming these problems. Some of these problems are specific to a particular system, as noted.

4.6.1. Setup and Launch Issues

To aid in troubleshooting, you may need to view the actual MPI run command line. On Linux the command is mpirun, and you can view the command line by setting the ANS_SEE_RUN_COMMAND environment variable to 1. On Windows the command is mpiexec, and you can view the command line by setting the ANS_SEE_RUN and ANS_CMD_NODIAG environment variables to TRUE.

Job fails to launch

The first thing to check when a a DMP job fails to launch is that the MPI software you wish to use is installed and properly configured (see Configuration Requirements for DMP Processing).

Next, if running across multiple machines, ensure that the working directory path is identical on all machines (or that you are using a shared network directory) and that you have permission to write files into the working directory used by each machine.

Finally, make sure that you are running the distributed solution on a homogeneous cluster. The OS level and processors must be identical on all nodes in the cluster. If they are not, you may encounter problems. For example, when running a DMP analysis across machines using Intel MPI, if the involved cluster nodes have different processor models, the program may hang (that is, fail to launch). In this situation, no data is written to any files and no error message is output.

Error executing Ansys. Refer to System-related Error Messages in the Mechanical APDL online help. If this was a DMP run, verify that your MPI software is installed correctly, check your environment settings, or check for an invalid command line option.

You may see this error if ANSYS Inc\v242\ansys\bin\<platform> (where <platform> is intel or winx64) is not in your PATH.

If you need more detailed debugging information, use the following:

Open a Command Prompt window and set the following:
```
SET ANS_SEE_RUN=TRUE
SET ANS_CMD_NODIAG=TRUE
```
Run the following command line: ansys242 -b -dis -i myinput.inp -o myoutput.out.

A DMP analysis fails to launch when running from a fully-qualified pathname.

A DMP analysis will fail if the Ansys 2024 R2 installation path contains a space followed by a dash if %ANSYS242_DIR%\bin\<platform> (where <platform> is intel or winx64) is not in the system PATH. Add %ANSYS242_DIR%\bin\<platform> to the system PATH and invoke ansys242 (without the fully qualified pathname). For example, if your installation path is:

C:\Program Files\Ansys -Inc\v242\bin\<platform>

The following command to launch a DMP analysis will fail:

"C:\Program Files\Ansys -Inc\v242\bin\<platform>\ansys242.exe" -g

However, if you add C:\Program Files\Ansys -Inc\v242\bin\<platform> to the system PATH, you can successfully launch a DMP analysis by using the following command:

ansys242 -g

A DMP analysis fails to launch when using the Slurm job scheduler.

When using Slurm job scheduler, depending on the cluster configuration for Slurm and the command line syntax for Mechanical APDL, the program may fail to launch and cause a crash in the ansOpenMP (libansOpenMP.so) or Intel OpenMP library (libiomp5.so). This may be avoided by:

setting the KMP_AFFINITY environment variable to something other than "norespect" (for example, “disabled” or “none”) before launching the simulation, or
by switching to Open MPI (via -mpi openmpi).

The required licmsgs.dat file, which contains licensing-related messages, was not found or could not be opened. The following path was determined using environment variable ANSYS242_DIR. This is a fatal error - - exiting.

Check the ANSYS242_DIR environment variable to make sure it is set properly. Note that for Windows HPC clusters, the ANSYS242_DIR environment variable should be set to \\HEADNODE\Ansys Inc\v242\ansys, and the ANSYSLIC_DIR environment variable should be set to \\HEADNODE\Ansys Inc\Shared Files\Licensing on all nodes.

Possible runtime error if you installed an older version of MS MPI

The supported version of MS MPI (listed in Table 4.2: Platforms and MPI Software) is automatically installed when you install Ansys 2024 R2. If you install an older version of MS MPI on your machine, be aware that it can supersede the supported version and may cause runtime errors. If this occurs, you can either:

uninstall the older version and reinstall the supported version MS MPI (see Installation files for MS MPI on Windows), or
use Intel MPI instead of MS MPI.

4.6.2. Stability Issues

This section describes potential stability issues that you may encounter while running a DMP analysis.

Recovering from a Computer, Network, or Program Crash

When a distributed-memory parallel processing job crashes unexpectedly (for example, seg vi, floating point exception, out-of-disk space error), an error message may fail to be fully communicated to the master process and written into the output file. If this happens, you can view all of the output and/or error files written by each of the worker processes (e.g., Jobnamen.out and/or Jobnamen.err) in an attempt to learn why the job failed. In some rare cases, the job may hang. When this happens, you must manually kill the processes; the error files and output files written by all the processes will be incomplete but may still provide some useful information as to why the job failed.

Be sure to kill any lingering processes (Linux: type kill -9 from command level; Windows: use Task Manager) on all cores and start the job again.

Using an alternative MPI version

Ansys chooses the default MPI based on robustness and performance. On rare occasions, you may want to try an alternative MPI (specified by the comand line option -mpi) as a workaround if unexpected issues arise (see the table below and Table 4.2: Platforms and MPI Software for supported MPI software versions).

Platform	Default MPI Software	Command Line Option to Specify Alternative MPI Software
Linux	Intel MPI 2021.11.0	`-mpi intelmpi2018` (for Intel MPI 2018.3.222) `-mpi openmpi` (for Open MPI 4.0.5)
Windows 10	Intel MPI 2021.11.0	`-mpi msmpi` (for MS MPI 10.1.12)

"Rename operation failed" Error in DMP Mode

When running in distributed memory parallel (DMP) mode using multiple nodes with Open MPI, you may encounter the following error message: Rename operation failed. If this happens, switch to Intel MPI or use a single compute node.

Job Fails with SIGTERM Signal (Linux Only)

Occasionally, when running on Linux, a simulation may fail with a message like the following:

MPI Application rank 2 killed before MPI_Finalize() with signal 15

forrtl: error (78): process killed (SIGTERM)

This typically occurs when computing the solution and means that the system has killed the Mechanical APDL process. The two most common occurrences are (1) the program is using too much of the hardware resources and the system has killed the Mechanical APDL process or (2) a user has manually killed the job (that is, kill -9 system command). Users should check the size of job they are running in relation to the amount of physical memory on the machine. Most often, decreasing the model size or finding a machine with more RAM will result in a successful run.

4.6.3. Solution and Performance Issues

This section describes solution and performance issues that you may encounter while running a DMP analysis.

Poor Speedup or No Speedup

As more cores are utilized, the runtimes are generally expected to decrease. The biggest relative gains are typically achieved when using two cores compared to using a single core. When significant speedups are not seen as additional cores are used, the reasons may involve both hardware and software issues. These include, but are not limited to, the following situations.

Hardware

Oversubscribing hardware — In a multiuser environment, this could mean that more physical cores are being used by multiple simulations than are available on the machine. It could also mean that hyperthreading is activated. Hyperthreading typically involves enabling extra virtual cores, which can sometimes allow software programs to more effectively use the full processing power of the CPU. However, for compute-intensive programs such as Mechanical APDL, using these virtual cores rarely provides a significant reduction in runtime. Therefore, it is recommended you do not use hyperthreading; if hyperthreading is enabled, it is recommended you do not exceed the number of physical cores.

Lack of memory bandwidth — On some systems, using most or all of the available cores can result in a lack of memory bandwidth. This lack of memory bandwidth can affect the overall scalability.

Slow interconnect speed — When running a DMP analysis across multiple machines, the speed of the interconnect (GigE, Infiniband, etc.) can have a significant effect on the performance. Slower interconnects cause each DMP process to spend extra time waiting for data to be transferred from one machine to another. This becomes especially important as more machines are involved in the simulation. See Interconnect Configuration at the beginning of this chapter for the recommended interconnect speed.

Software

Simulation includes non-supported features — The shared and distributed-memory parallelisms work to speed up certain compute-intensive operations in /PREP7, /SOLU and /POST1. However, not all operations are parallelized. If a particular operation that is not parallelized dominates the simulation time, then using additional cores will not help achieve a faster runtime.

Simulation has too few DOF (degrees of freedom) — Some analyses (such as transient analyses) may require long compute times, not because the number of DOF is large, but because a large number of calculations are performed (that is, a very large number of time steps). Generally, if the number of DOF is relatively small, parallel processing will not significantly decrease the solution time. Consequently, for small models with many time steps, parallel performance may be poor because the model size is too small to fully utilize a large number of cores.

I/O cost dominates solution time — For some simulations, the amount of memory required to obtain a solution is greater than the physical memory (that is, RAM) available on the machine. In these cases, either virtual memory (that is, hard disk space) is used by the operating system to hold the data that would otherwise be stored in memory, or the equation solver writes extra files to the disk to store data. In both cases, the extra I/O done using the hard drive can significantly affect performance, making the I/O performance the main bottleneck to achieving optimal performance. In these cases, using additional cores will typically not result in a significant reduction in overall time to solution.

Large contact pairs — For simulations involving contact pairs with a large number of elements relative to the total number of elements in the entire model, the performance of distributed-memory parallelism is often negatively affected. These large contact pairs require a DMP analysis to do extra communication and often cause a load imbalance between each of the cores (that is, one core might have two times more computations to perform than another core). In some cases, using CNCHECK,TRIM can help trim any unnecessary contact/target elements from the larger contact pairs. In other cases, however, manual interaction will be required to reduce the number of elements involved in the larger contact pairs.

Different Results Relative to a Single Core

Distributed-memory parallel processing initially decomposes the model into domains. Typically, the number of domains matches the number of cores. Operational randomness and numerical round-off inherent to parallelism can cause slightly different results between runs on the same machine(s) using the same number of cores or different numbers of cores. This difference is often negligible. However, in some cases the difference is appreciable. This sort of behavior is most commonly seen on nonlinear static or transient analyses which are numerically unstable. The more numerically unstable the model is, the more likely the convergence pattern or final results will differ as the number of cores used in the simulation is changed.

Inexact Licensing Charge After Manually Killing a Worker Process from the Operating System

In an elastic licensing environment using DMP mode, terminating a worker process from the operating system (Linux: kill -9 command; Windows: Task Manager) could result in a license charge for up to an hour of usage for any fraction of an hour of actual time used. To avoid the inexact licensing charge, terminate the process from within Mechanical APDL (/EXIT) or terminate the master process from the operating system.