5.2. Automatic Activation of Hybrid Parallel Processing

By default, the Mechanical APDL application will run with the requested number of distributed processes unless certain criteria are met during solution that indicate activating hybrid parallel would be beneficial. The automatic hybrid parallel feature is designed to improve performance by reducing the total number of MPI processes. After entering the solution module and issuing the SOLVE command, the application works behind the scenes to determine whether trading distributed processes for threads could improve performance. There are two key scenarios that can lead to the activation of automatic hybrid parallel, and both depend on the model being solved and the available hardware resources.

The first scenario is dependent on the available memory and the number of requested processes. The Mechanical APDL application will estimate the necessary amount of memory required to proceed with the solution. These memory heuristics enable the application to determine whether to run the direct sparse solver with the in-core versus out-of-core memory modes. In some cases, memory estimation shows that reducing the number of distributed processes enables the solution to proceed in-core instead of out-of-core, the latter being substantially slower given its dependence on hard drive speed. When this occurs, the automatic hybrid parallel logic evenly reduces the number of distributed processes and activates threads on each process to avoid out-of-core execution. However, if the required memory far exceeds the available memory, the automatic hybrid parallel logic cannot help. In these cases, the run proceeds in out-of-core mode with the original number of distributed processes.

The second scenario is dependent on both the number and size of large, non-distributed domains in a model and the number of requested processes. Non-distributed domains can come from modeling features like fracture mechanics or contact. When a big model is solved with relatively few processes, each process should have a similar number of elements in its associated domain. However, as the number of requested processes increases, the number of elements in the non-distributed domains will reach a constant value. This poses a performance bottleneck because other domains will likely have far fewer elements. The deficit can vary, but four to ten times as few elements is not unusual for high-core-count runs. The domains with fewer elements finish their computations first, and must wait on the slower, non-distributed domains to finish as well. As a result, scaling performance is negatively affected as the core-count increases. In these cases, the automatic hybrid parallel logic will reduce the overall number of distributed processes, and thereby the number of distributed domains, to activate additional threads on the large, non-distributed domains. By activating additional threads with these “big domain” heuristics, the computational rate for these big domains is improved, which in turn improves scaling performance.

There are several caveats that prevent the automatic hybrid parallel logic from activating, or otherwise limit the extent to which it can activate threads. These are as follows:

For simplicity, memory heuristics are disabled when an odd number of distributed processes is requested.
“Big domain” heuristics are disabled in multi-node runs with a non-homogenous number of processes per machine.
In multi-node runs, the number of threads activated on each big domain is limited to prevent hardware oversubscription (more processes and threads than available CPU cores).
In multi-node runs, the “big domain” heuristics are disabled if there are so many big domains that their placement spans multiple compute nodes.
When restarting or continuing an analysis, the number of processes and threads determined in the upstream analysis will be enforced in the downstream analysis.
- For example, an upstream modal analysis runs with -np 2, then is continued with -np 8 in a downstream harmonic analysis. The downstream analysis will revert to -np 2 and not activate threads. Threads can be activated manually if the downstream analysis is launched with the same number of processes as the upstream analysis.
Although hardware oversubscription should not occur, in the event the program detects oversubscription, it issues a fatal error and stops the solution. This decision is made in favor of the performance degradation caused by hardware oversubscription.