Mechanical APDL uses the latest compilers and math libraries in order to achieve maximum per-core performance on virtually all processors that it supports. However, many simulations are still compute-bottlenecked. Therefore, the best way to speed up a simulation is often to utilize more processor cores. This is done with parallel processing. Another method to speed up simulations is to use a GPU card to accelerate some computations during the simulation.
Detailed information on parallel processing can be found in the Parallel Processing Guide. For the purpose of this discussion, some basic details are provided in the following sections.
Shared-memory parallel (SMP) is distinguished from distributed-memory parallel (DMP) by a different memory model. SMP and DMP can refer to both hardware and software offerings. In terms of hardware, SMP systems share a single global memory image that is addressable by multiple processors. DMP systems, often referred to as clusters, involve multiple machines (that is, compute nodes) connected together on a network, with each machine having its own memory address space. Communication between machines is handled by interconnects (for example, Gigabit Ethernet, Infiniband).
In terms of software, the shared-memory parallel version of Mechanical APDL refers to running the program across multiple cores on an SMP system. The distributed-memory parallel version refers to running the program across multiple processors on SMP systems or DMP systems.
Distributed-memory parallel processing assumes that the physical memory for each process is separate from all other processes. This type of parallel processing requires some form of message passing software to exchange data between the cores. The prevalent software used for this communication is called MPI (Message Passing Interface). MPI software uses a standard set of routines to send and receive messages and synchronize processes. A major attraction of the DMP model is that very large parallel systems can be built using commodity-priced components. In addition, the DMP model often obtains better parallel efficiency than the SMP model.
Mechanical APDL simulations are very computationally intensive. Most of the computations are performed within the solution phase of the analysis. During the solution, three major steps are performed:
1. Forming the element matrices and assembling them into a global system of equations |
2. Solving the global system of equations |
3. Using the global solution to derive the requested set of element and nodal results |
Each of these three major steps involves many computations and, therefore, has many opportunities for exploiting multiple cores through use of parallel processing.
All three steps of the solution phase can take advantage of SMP processing, including most of the equation solvers. However, the speedups obtained are limited by requirements for accessing globally shared data in memory, I/O operations, and memory bandwidth demands in computationally intensive solver operations.
The DMP version of Mechanical APDL parallelizes the entire solution phase, including the three steps listed above. However, the maximum speedup obtained is limited by similar issues as the SMP version of the program (I/O, memory bandwidth), as well as how well the computations are balanced among the processes, the speed of messages being passed, and the amount of work that cannot be done in a parallel manner.
It is important to note that the SMP version of the program can only run on configurations that share a common address space; it cannot run across separate machines or even across nodes within a cluster. However, the DMP version of Mechanical APDL can run using multiple cores on a single machine (SMP hardware), and it can be run across multiple machines (that is, a cluster) using one or more cores on each of those machines (DMP hardware).
You may choose SMP or DMP processing using 4 cores with the standard license. To achieve additional benefit from parallel processing, you must acquire additional Ansys HPC licenses.
GPU hardware can be used to help reduce the overall time to solution by off-loading some of the major computations (required by certain equation solvers) from the CPU(s) to the GPU. These computations are often executed much faster using the highly parallel architecture found with GPUs.
The use of GPU hardware is meant to be in addition to the existing CPU core(s), not a replacement for CPUs. The CPU core(s) will continue to be used for all other computations in and around the equation solvers. This includes the use of any shared memory parallel processing or distributed memory parallel processing by means of multiple CPU cores. The main goal of the GPU accelerator capability is to take advantage of the GPU hardware to accelerate the speed of the solver computations and, therefore, reduce the time required to complete a simulation.
GPUs have varying amounts of physical memory available for the simulation to use. The amount of available memory can limit the speedups achieved. When the memory required to perform the solver computations exceeds the available memory on the GPU, the use of the GPU is temporarily deactivated for those computations and the CPU core(s) are used instead.
The speedups achieved when using GPUs will vary widely depending on the specific CPU and GPU hardware being used, as well as the simulation characteristics. When older (and therefore typically slower) CPU cores are used, the GPU speedups will be greater. Conversely, when newer (and therefore typically faster) CPUs are used, the performance of the newer CPUs will make the GPU speedups less. Also, the speedups achieved when using GPUs will depend mainly on the analysis type, element types, equation solver, and model size (number of DOFs). However, it all relates to how much time is spent performing computations on the GPU vs. on the CPU. The more computations performed on the GPU, the more opportunity for greater speedups. When using the sparse direct solver, the use of bulkier 3D models and/or higher-order elements generally results in more solver computations off-loaded to the GPU. In the PCG iterative solver, the use of lower Lev_Diff values (see the PCGOPT command) results in more solver computations off-loaded to the GPU.