Chapter 4: Using Distributed-Memory Parallel (DMP) Processing

When running a simulation, the solution time is typically dominated by three main parts: the time spent to create the element matrices and form the global matrices or global systems of equations, the time to solve the linear system of equations, and the time spent calculating derived quantities (such as stress and strain) and other requested results for each element.

Distributed-memory parallelism allows the entire solution phase to run in parallel, including the stiffness matrix generation, linear equation solving, and results calculations. As a result, a simulation using distributed-memory parallel processing usually achieves much faster solution times than a similar run performed using shared-memory parallel processing, particularly at higher core counts.

You can run a DMP solution over multiple cores on a single machine or on multiple machines (that is, a cluster). It automatically decomposes the model into smaller domains, transfers the domains to each core, solves each domain simultaneously, and creates a complete solution to the model. The memory and disk space required to complete the solution can also be distributed over multiple machines. By utilizing all of the resources of a cluster (computing power, RAM, memory and I/O bandwidth), distributed-memory parallel processing can be used to solve very large problems much more efficiently compared to the same simulation run on a single machine.

DMP Solution Behavior

Distributed-memory parallel processing works by launching multiple processes on either a single machine or on multiple machines (as specified by one of the following command line options: -np, -machines, or -mpifile). The machine that the distributed run is launched from is referred to as the head compute node, and the other machines are referred to as the compute nodes. The first process launched on the head compute node is referred to as the master process, and all other processes are referred to as worker processes.

Each DMP process is essentially running a shared-memory parallel (SMP) process. These processes are launched through the specified MPI software layer. The MPI software allows each DMP process to communicate, or exchange data, with the other processes involved in the distributed simulation.

DMP processing is not currently supported for all of the analysis types, elements, solution options, etc. that are available with SMP processing (see Supported Features). In some cases, the program stops the DMP analysis to avoid performing an unsupported action. If this occurs, you must launch an SMP analysis to perform the simulation. In other cases, the program automatically disables the distributed-memory parallel capability and performs the operation using shared-memory parallelism. This disabling of the distributed-memory parallel processing can happen at various levels in the program.

The master process handles the inputting of commands as well as all of the pre- and postprocessing actions. Only certain commands (for example, the SOLVE command and supporting commands such as /SOLU, FINISH, /EOF, /EXIT, and so on) are communicated to the worker processes for execution. Therefore, outside of the SOLUTION processor (/SOLU), a DMP analysis behaves very similar to an SMP analysis. The master process works on the entire model during these pre- and postprocessing steps and may use shared-memory parallelism to improve performance of these operations. During this time, the worker processes wait to receive new commands from the master process.

Once the SOLVE command is issued, it is communicated to the worker processes and all DMP processes become active. At this time, the program makes a decision as to which mode to use when computing the solution. In some cases, the solution will proceed using only a distributed-memory parallel (DMP) mode. In other cases, similar to pre- and postprocessing, the solution will proceed using only a shared-memory parallel (SMP) mode. In a few cases, a mixed mode may be implemented which tries to use as much distributed-memory parallelism as possible for maximum performance. These three modes are described further below.

Pure DMP mode — All features in the simulation support distributed-memory parallelism, and it is used throughout the solution. This mode typically provides optimal performance.

Mixed mode — The simulation involves a particular set of computations that do not support DMP processing. Examples include certain equation solvers and remeshing due to mesh nonlinear adaptivity. In these cases, distributed-memory parallelism is used throughout the solution, except for the unsupported set of computations. When that step is reached, the worker processes simply wait while the master process uses shared-memory parallelism to perform the computations. After the computations are finished, the worker processes continue to compute again until the entire solution is completed.

Pure SMP mode — The simulation involves an analysis type or feature that does not support DMP processing. In this case, distributed-memory parallelism is disabled at the onset of the solution, and shared-memory parallelism is used instead. The worker processes are not involved at all in the solution but simply wait while the master process uses shared-memory parallelism to compute the entire solution.

When using shared-memory parallelism within a DMP run(in mixed mode or SMP mode, including all pre- and postprocessing operations), the master process will not use more cores on the head compute node than the total cores you specify to be used for the DMP solution. This is done to avoid exceeding the requested CPU resources or the requested number of licenses.

The following table shows which steps, including specific equation solvers, can be run in parallel using SMP, DMP, and hybrid parallel processing.

Table 4.1: Parallel Capability in SMP, DMP, and Hybrid Parallel Processing

Solvers/Feature	SMP	DMP and Hybrid
Sparse	Y	Y
PCG	Y	Y
ICCG	Y	Y ^[a]
JCG	Y	Y ^[a] ^[b]
QMR ^[c]	Y	Y ^[a]
Block Lanczos eigensolver	Y	Y
PCG Lanczos eigensolver	Y	Y
Supernode eigensolver	Y	Y ^[a]
Subspace eigensolver	Y	Y
Unsymmetric eigensolver	Y	Y
Damped eigensolver	Y	Y
QRDAMP eigensolver	Y	Y
Element formulation, results calculation	Y	Y
Graphics and other pre- and postprocessing	Y	Y ^[a]

^[a]This solver/operation only runs in mixed mode.

^[b]For static analyses and transient analyses using the full method (TRNOPT,FULL), the JCG equation solver runs in pure DMP mode only when the matrix is symmetric. Otherwise, it runs in SMP mode.

^[c]The QMR solver only supports 1 core in SMP mode and in mixed mode.

The maximum number of cores allowed in a DMP analysis is currently set at 16384. Therefore, assuming the appropriate HPC licenses are available, you can run a DMP analysis using anywhere from 2 to 16384 cores. Performance results vary widely for every model when using any form of parallel processing. For every model, there is a point where using more cores does not significantly reduce the overall solution time. Therefore, it is expected that most models run in DMP can not efficiently make use of hundreds or thousands of cores.

Files generated by a DMP analysis are named Jobnamen.ext, where n is the process number. (See Differences in General Behavior for more information.) The master process is always numbered 0, and the worker processes are 1, 2, etc. When the solution is complete and you issue the FINISH command in the SOLUTION processor, the program combines all Jobnamen.rst files into a single Jobname.rst file, located on the head compute node. Other files, such as .mode, .esav, .emat, etc., may be combined as well upon finishing a distributed solution. (See Differences in Postprocessing for more information.)

The remaining sections explain how to configure your environment for distributed-memory parallel processing, how to run a DMP analysis, and what features and analysis types support distributed-memory parallelism. You should read these sections carefully and fully understand the process before attempting to run a distributed analysis. The proper configuration of your environment and the installation and configuration of the appropriate MPI software are critical to successfully running a distributed analysis.

Caution: In an elastic licensing environment using DMP mode, terminating a worker process from the operating system (Linux: kill -9 command; Windows: Task Manager) could result in a license charge for up to an hour of usage for any fraction of an hour of actual time used. To avoid the inexact licensing charge, terminate the process from within Mechanical APDL (/EXIT) or terminate the master process from the operating system.