5.4. Software Issues for Scalability

This section addresses some key software aspects which affect the parallel performance of the program.

5.4.1. Program Architecture

It should be expected that only the computations performed in parallel would speedup when more processing cores are used. In the Mechanical APDL program, some computations before and after solution (for example, /PREP7 or /POST1) are set up to use some shared memory parallelism; however, the bulk of the parallel computations are performed during solution (specifically within the SOLVE command). Therefore, it would be expected that only the solution time would significantly decrease as more processing cores are used. Moreover, if a significant portion of the analysis time is spent anywhere outside solution, then adding more cores would not be expected to significantly decrease the solution time (that is, the efficiency of the program would be greatly diminished for this case).

As described in Measuring Scalability, the "Elapsed time spent computing solution" shown in the output file gives an indication of the amount of wall clock time spent actually computing the solution. If this time dominates the overall runtime, then it should be expected that parallel processing will help this model run faster as more cores are used. However, if this time is only a fraction of the overall runtime, then parallel processing should not be expected to help this model run significantly faster.

5.4.2. Distributed-Memory Parallel Processing

5.4.2.1. Contact Elements

DMP solutions are designed to balance the number of elements, nodes, and degrees of freedom for each process so that each process has roughly the same amount of work. However, this becomes a challenge when contact elements are present in the model. Contact elements often need to perform more computations at the element level (for example, searching for contact, penetration detection) than other types of elements. This can affect DMP scalability by causing one process to have much more work (that is, more computations to perform) than other processes, ultimately impeding the load balancing. This is especially true for large models with a relatively high percentage of contact/target elements compared to the total number of elements. When this is the case, the best approach is to try and limit the scope of the contact to only what is necessary.

In some cases, using the command CNCHECK,TRIM can help trim any unnecessary contact/target elements from the larger contact pairs, and therefore improve performance in a distributed-memory parallel (DMP) run.

In most other cases, you can use the command CNCHECK,SPLIT or CNCHECK,DMP to split larger contact pairs into several sub-pairs so that the smaller contact sub-pairs can be distributed into different processes. The contact pair splitting logic greatly improves load balance between processes, and the time spent on stiffness matrix generation will be much more scalable, particularly at higher core counts. Large bonded and small-sliding contact pairs will realize the most benefit form contact pair splitting. For more information, see Solving Large Contact Models in a Distributed-Memory Parallel Environment in the Contact Technology Guide.

Alternatively, hybrid parallel processing may improve scalability for large models with high element load balance ratios.

5.4.2.2. Using the Distributed PCG Solver

One issue to consider when using the PCG solver is that higher level of difficulty values (Lev_Diff on the PCGOPT command), can hurt DMP scalability. These higher values of difficulty are often used for models that have difficulty converging within the PCG solver, and they are typically necessary to obtain optimal performance in this case when using a limited number of cores. However, when using a higher number of cores, for example more than 16 cores, it may be wise to consider lowering the level of difficulty value by 1 (if possible) in order to improve the overall solver performance. Lower level of difficulty values scale better than higher level of difficulty values; thus, the optimal Lev_Diff value at a few cores will not necessary be the optimal Lev_Diff value at a high number of cores.

5.4.2.3. Using the Distributed Sparse Solver

When using the distributed sparse solver, you should always consider which memory mode is being used. For optimal scalability, the in-core memory mode should always be used. This mode avoids writing the large matrix factor file. When running in the out-of-core memory mode, each distributed process must create and access its own set of solver files, which can cause a bottleneck in performance as each process tries to access the hard drive(s). Since hard drives can only seek to one file at a time, this file access within the solver becomes a big sequential block in an otherwise parallel code.

Fortunately, the memory requirement to run in-core is divided among the number of cluster nodes used for a DMP simulation. While some single core runs may be too large for a single compute node, a 4 or 8 node configuration may easily run the distributed sparse solver in-core. In a DMP simulation, the in-core mode will be selected automatically in most cases whenever available physical memory on each node is sufficient. If very large models require out-of-core factorization, even when using several compute nodes, local I/O on each node will help to scale the I/O time as more compute nodes are used.

5.4.2.4. Combining Files

After a parallel solution successfully completes, some local files are automatically combined and written by each processor into a single, cglobal file. These include the .rst (or .rth), .esav, .emat, .mode, .ist, .mlv, and .seld files. This step can be costly due to the large amount of I/O and MPI communication involved. In some cases, this step can be a bottleneck for performance as it involves serial operations.

Automatic file combination is performed when the FINISH command is executed upon leaving the solution processor. If any of these global files are not needed to perform downstream operations, you can often reduce the overall solution time by suppressing the file combination for each individual file type that is not needed (see the DMPOPTION command for more details). In addition, reducing the amount of data written to the results file (see OUTRES command) can also help improve the performance of this step by reducing the amount of I/O and MPI communication required to combine the local results files into a single, global results file.

5.4.3. GPU Accelerator Capability

Similar to the expectations described in Program Architecture, the GPU accelerator capability will typically accelerate the computations only during solution. Thus, if the solution time is only a fraction of the overall runtime, then the GPU accelerator capability is not expected to help the model run significantly faster.

Also, different amounts of speedup are expected depending on the equation solver used as well as various model features (for example, geometry, element types, analysis options, etc.). All of these factors affect how many computations are off-loaded onto the GPU for acceleration; the more opportunity for the GPU to accelerate the solver computations, the more opportunity for improved speedups.