Mechanical APDL offers two types of linear equation solvers: direct and iterative. There are SMP and DMP differences for each of these solver types. This section describes the important details for each solver type and presents, in tabular form, a summary of solver memory requirements. Recommendations are given for managing memory use to maximize performance.
All of the solvers covered in this chapter have heuristics which automatically select certain defaults in an attempt to optimize performance for a given set of hardware and model conditions. For the majority of analyses, the best options are chosen. However, in some cases performance can be improved by understanding how the solvers work, the resource requirements for your particular analysis, and the hardware resources that are available to the program. Each of the equation solvers discussed have an option command(s) which can be used to control the behavior, and ultimately the performance, of the solver.
The following topics are covered:
The sparse solver is the default solver for virtually all analyses. It is the most robust solver available, but it is also compute- and I/O-intensive. The sparse solver is designed to run in different modes of operation, depending on the amount of memory available. Slight variations in memory requirements can occur when running different versions of the software, even for the same model run on the same hardware, because of changes in heuristics. It is important to understand that the solver's mode of operation can have a significant effect on runtime as summarized here.
Memory usage for a direct sparse solver is determined by several steps. The matrix that is input to the sparse solver is assembled entirely in memory before being written to the .full file. The sparse solver then reads the .full file, processes the matrix, factors the matrix, and computes the solution. Direct method solvers factor the input matrix into the product of a lower and upper triangular matrix in order to solve the system of equations. For symmetric input matrices (most matrices created in Mechanical APDL are symmetric), only the lower triangular factor is required since it is equivalent to the transpose of the upper triangular factor. Still, the process of factorization produces matrix factors which are 10 to 20 times larger than the input matrix. The calculation of this factor is computationally intensive. In contrast, the solution of the triangular systems is I/O or memory access-dominated with few computations required.
The following are rough estimates for the amount of memory needed for each step when using the sparse solver for most 3D analyses. For non-symmetric matrices or for complex value matrices (as found in harmonic analyses), these estimates approximately double.
The amount of memory needed to assemble the matrix in memory is approximately 1 GB per million DOFs.
The amount of memory needed to hold the factored matrix in memory is approximately 10 to 20 GB per million DOFs.
It is important to note that the shared-memory version of the sparse solver is not the same as the distributed-memory version of the sparse solver. While the fundamental steps of these solvers are the same, they are actually two independent solvers, and there are subtle differences in their modes of operation. These differences will be explained in the following sections. Table 4.1: Direct Sparse Solver Memory and Disk Estimates summarizes the direct solver memory requirements.
Table 4.1: Direct Sparse Solver Memory and Disk Estimates
Memory Mode | Memory Usage Estimate | I/0 Files Size Estimate |
Sparse Direct Solver (Shared Memory) | ||
Out-of-core | 1 GB/MDOFs | 10 GB/MDOFs |
In-core | 10 GB/MDOFs |
1 GB/MDOFs 10 GB/MDOFs if workspace is saved to Jobname.DSPsymb |
Sparse Direct Solver (Distributed Memory, Using p Cores) | ||
Out-of-core |
1 GB/MDOFs on head compute node 0.7 GB/MDOFs on all other compute nodes |
10 GB/MDOFs * 1/p Matrix factor is stored on disk evenly on p cores |
In-core |
10 GB/MDOFs * 1/p Matrix factor is stored in memory evenly in-core on p cores. Additional 1.5 GB/MDOFs required on head compute node to store input matrix. |
1 GB/MDOFs * 1/p 10 GB/MDOFs * 1/p if workspace is saved to Jobname.DSPsymb |
Comments:
|
For out-of-core factorization, the factored matrix is held on disk.
If sufficient memory is available for the assembly process, it is almost always more than enough to run the sparse solver factorization in out-of-core mode. This mode uses some additional memory to make sure that the largest of all frontal matrices (dense matrix structures within the large factored matrix) can be held completely in memory. This approach attempts to achieve an optimal balance between memory usage and I/O. For larger jobs, the program will typically run the sparse solver using out-of-core memory mode (by default) unless a specific memory mode is defined.
The distributed memory sparse solver can be run in the out-of-core mode. It is important to note that when running this solver in out-of-core mode, the additional memory allocated to make sure each individual frontal matrix is computed in memory is allocated on all processes. Therefore, as more distributed processes are used (that is, the solver is used on more cores) the solver's memory usage for each process is not decreasing, but rather staying roughly constant, and the total sum of memory used by all processes is actually increasing (see Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver). Keep in mind, however, that the computations do scale for this memory mode as more cores are used.
In-core factorization requires that the factored matrix be held in memory and, thus, often requires 10 to 20 times more memory than out-of-core factorization. However, larger memory systems are commonplace today, and users of these systems will benefit from in-core factorization. A model with 1 million DOFs can, in many cases, be factored using 10 GB of memory—easily achieved on desktop systems with 16 GB of memory. Users can run in-core using several different methods. The simplest way to set up an in-core run is to use the BCSOPTION,,INCORE command (or DSPOPTION,,INCORE for the distributed memory sparse solver). This option tells the sparse solver to try allocating a block of memory sufficient to run using the in-core memory mode after solver preprocessing of the input matrix has determined this value. However, this method requires preprocessing of the input matrix using an initial allocation of memory to the sparse solver.
Another way to get in-core performance with the sparse solver is to start the sparse solver with enough memory to run in-core. Users can start Mechanical APDL with an initial large -m allocation (see Specifying Memory Allocation) such that the largest block available when the sparse solver begins is large enough to run the solver using the in-core memory mode. This method will typically obtain enough memory to run the solver factorization step with a lower peak memory usage than the simpler method described above, but it requires prior knowledge of how much memory to allocate in order to run the sparse solver using the in-core memory mode.
The in-core factorization should be used only when the computer system has enough memory to easily factor the matrix in-core. Users should avoid using all of the available system memory or extending into virtual memory to obtain an in-core factorization. However, users who have long-running simulations should understand how to use the in-core factorization to improve elapsed time performance.
The BCSOPTION command controls the shared-memory sparse solver memory modes and also enables performance debug summaries. See the documentation on this command for usage details. Sparse solver memory usage statistics are usually printed in the output file and can be used to determine the memory requirements for a given model, as well as the memory obtained from a given run. Following is an example output.
Memory allocated for solver = 1536.42 MB Memory required for in-core = 10391.28 MB Memory required for out-of-core = 1191.67 MB
This sparse solver run required 10391 MB to run in-core and 1192 MB to run in out-of-core mode. "Memory allocated for solver" indicates that the amount of memory used for this run was just above the out-of-core memory requirement, so this job will use the out-of-core memory mode.
The DSPOPTION command controls the distributed-memory sparse solver memory modes and also enables performance debug summaries. Similar memory usage statistics for this solver are also printed in the output file for each distributedprocess. When running the distributed sparse solver using multiple cores on a single node or when running on a cluster with a slow I/O configuration, using the in-core mode can significantly improve the overall solver performance as the costly I/O time is avoided.
The memory required per core to run in out-of-core mode approaches a constant value as the number of cores increases because each core in the distributed sparse solver has to store a minimum amount of information to carry out factorization in the optimal manner. The more cores that are used, the more total memory that is needed (it increases slightly at 32 or more cores) for out-of-core performance.
In contrast to the out-of-core mode, the memory required per core to run in the in-core mode decreases as more processes are used with the distributed memory sparse solver (see the left-hand side of Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver). This is because the portion of total matrices stored and factorized in one core is getting smaller and smaller. The total memory needed will increase slightly as the number of cores increases.
At some point, as the number of processes increases (usually between 8 and 32), the total memory usage for these two modes approaches the same value (see the right-hand side of Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver). When the out-of-core mode memory requirement matches the in-core requirement, the solver automatically runs in-core. This is an important effect; it shows that when a job is spread out across enough machines, the distributed memory sparse solver can effectively use the memory of the cluster to automatically run a very large job in-core.
Sparse solver partial pivoting is an important detail that may inhibit in-core factorization. Pivoting in direct solvers refers to a dynamic reordering of rows and columns to maintain numerical stability. This reordering is based on a test of the size of the diagonal (called the pivot) in the current matrix factor column during factorization.
Pivoting is not required for most analysis types, but it is enabled when certain element types and options are used (for example, pure Lagrange contact and mixed u-P formulation). When pivoting is enabled, the size of the matrix factor cannot be known before the factorization; thus, the in-core memory requirement cannot be accurately computed. As a result, it is generally recommended that pivoting-enabled factorizations with the sparse solver use the out-of-core memory mode.
The iterative solvers offer a powerful alternative to more expensive sparse direct methods. They do not require a costly matrix factorization of the assembled matrix, and they always run in memory and do only minimal I/O. However, iterative solvers proceed from an initial random guess to the solution by an iterative process and are dependent on matrix properties that can cause the iterative solver to fail to converge in some cases. Hence, the iterative solvers are not the default solvers in Mechanical APDL.
The most important factor determining the effectiveness of the iterative solvers for a simulation is the preconditioning step. The preconditioned conjugate gradient (PCG) iterative solver now uses two different proprietary preconditioners which have been specifically developed for a wide range of element types. The newer node-based preconditioner (added at Release 10.0) requires more memory and uses an increasing level of difficulty setting, but it is especially effective for problems with poor element aspect ratios.
The specific preconditioner option can be specified using the
Lev_Diff
argument on the PCGOPT
command. Lev_Diff
= 1 selects the original element-based
preconditioner for the PCG solver, and Lev_Diff
values of
2, 3, and 4 select the new node-based preconditioner with differing levels of
difficulty. Finally, Lev_Diff
= 5 uses a preconditioner
that requires a complete factorization of the assembled global matrix. This last
option (which is discussed in PCG Lanczos Solver) is mainly used for the
PCG Lanczos solver (LANPCG) and is only recommended for smaller problems where there
is sufficient memory to use this option. The program uses heuristics to choose the
default preconditioner option and, in most cases, makes the best choice. However, in
cases where the program automatically selects a high level of difficulty and the
user is running on a system with limited memory, it may be necessary to reduce
memory requirements by manually specifying a lower level of difficulty (via the
PCGOPT command). This is because peak memory usage for the
PCG solvers often occurs during preconditioner construction.
The basic memory formula for iterative solvers is 1 GB per million DOFs. Using a higher level of difficulty preconditioner raises this amount, and higher-order elements also increase the basic memory requirement. An important memory saving feature for the PCG solvers is implemented for several key element types. This option, invoked via the MSAVE command, avoids the need to assemble the global matrix by computing the matrix/vector multiplications required for each PCG iteration at the element level. The MSAVE option can save up to 70 percent of the memory requirement for the PCG solver if the majority of the elements in a model are elements that support this feature. MSAVE is automatically turned on for some linear static analyses when SOLID186 and/or SOLID187 elements that meet the MSAVE criteria are present. It is turned on because it often reduces the overall solution time in addition to reducing the memory usage. It is most effective for these analyses when dominated by SOLID186 elements using reduced integration, or by SOLID187 elements. For large deflection nonlinear analyses, the MSAVE option is not on by default since it increases solution time substantially compared to using the assembled matrix for this analysis type; however, it can still be turned on manually to achieve considerable memory savings.
The total memory usage of the DMP version of the iterative solver is higher than the corresponding SMP version due to some duplicated data structures required on each process for the DMP version. However, the total memory requirement scales across the processes so that memory use per process reduces as the number of processes increases. The preconditioner requires an additional data structure that is only stored and used by the master process, so the memory required for the master process is larger than all other processes. Table 4.2: Iterative PCG Solver Memory and Disk Estimates summarizes the memory requirements for iterative solvers.
The table shows that for very large models running in DMP mode, the most significant term becomes the 300 MB/MDOFs requirement for the master process. This term does not scale (reduce) as more cores are used. A 10 MDOFs model using the iterative solver would require 3 GB of memory for this part of PCG solver memory, in addition to 12 GB distributed evenly across the nodes in the cluster. A 100 MDOFs model would require 30 GB of memory in addition to 120 GB of memory divided evenly among the nodes of the cluster.
Table 4.2: Iterative PCG Solver Memory and Disk Estimates
PCG Solver Memory and Disk Estimates (Shared Memory) |
|
PCG Solver Memory and Disk Estimates (Distributed Memory, Using p Cores) |
|
Comments:
|
Finding the natural frequencies and mode shapes of a structure is one of the most computationally demanding tasks. Specific equation solvers, called eigensolvers, are used to solve for the natural frequencies and mode shapes. Mechanical APDL offers three eigensolvers for modal analyses of undamped systems: the sparse solver-based Block Lanczos solver, the PCG Lanczos solver, and the Supernode solver.
The memory requirements for the two Lanczos-based eigensolvers are related to the memory requirements for the sparse and PCG solvers used in each method, as described above. However, there is additional memory required to store the mass matrices as well as blocks of vectors used in the Lanczos iterations. For the Block Lanczos solver, I/O is a critical factor in determining performance. For the PCG Lanczos solver, the choice of the preconditioner is an important factor.
The Block Lanczos solver (MODOPT,LANB) uses the sparse
direct solver. However, in addition to requiring a minimum of one matrix
factorization, the Block Lanczos algorithm also computes blocks of vectors that
are stored on files during the Block Lanczos iterations. The size of these files
grows as more modes are computed. Each Block Lanczos iteration requires multiple
solves using the large matrix factor file (or in-memory factor if the in-core
memory mode is used) and one in-memory block of vectors. The larger the
BlockSize
(input on the MODOPT
command), the fewer block solves are required, reducing the I/O cost for the
solves.
If the amount of memory allocated for the solver is less than the recommended
amount for a Block Lanczos run, the block size used internally for the Lanczos
iterations will be automatically reduced. Smaller block sizes will require more
block solves, the most expensive part of the Lanczos algorithm for I/O
performance. Typically, the default block size of 8 is optimal. On machines with
limited physical memory where the I/O cost in Block Lanczos is very high (for
example, machines without enough physical memory to run the Block Lanczos
eigensolver using the in-core memory mode), forcing a larger
BlockSize
(such as 12 or 16) on the
MODOPT command can reduce the amount of I/O and, thus,
improve overall performance.
Finally, multiple matrix factorizations may be required for a Block Lanczos run. (See the following table for Block Lanczos memory requirements.) The algorithm decides dynamically whether to refactor using a new shift point or to continue Lanczos iterations using the current shift point. This decision is influenced by the measured speed of matrix factorization versus the rate of convergence for the requested modes and the cost of each Lanczos iteration. This means that performance characteristics can change when hardware is changed, when the memory mode is changed from out-of-core to in-core (or vice versa), or when shared memory parallelism is used.
Table 4.3: Block Lanczos Eigensolver Memory and Disk Estimates
Memory Mode | Memory Usage Estimate | I/0 Files Size Estimate |
---|---|---|
Out-of-core | 1.5 GB/MDOFs | 15-20 GB/MDOFs |
In-core | 15-20 GB/MDOFs | ~1.5 GB/MDOFs |
Comments:
|
The PCG Lanczos solver (MODOPT,LANPCG) represents a breakthrough in modal analysis capability because it allows users to extend the maximum size of models used in modal analyses well beyond the capacity of direct solver-based eigensolvers. The PCG Lanczos eigensolver works with the PCG options command (PCGOPT) as well as with the memory saving feature (MSAVE). Both shared-memory parallel performance and distributed-memory parallel performance can be obtained by using this eigensolver.
Controlling PCG Lanczos Parameters
The PCG Lanczos eigensolver can be controlled using several options on the
PCGOPT command. The first of these options is the Level
of Difficulty value (Lev_Diff
). In most cases,
choosing a value of AUTO (which is the default) for
Lev_Diff
is sufficient to obtain an efficient
solution time. However, in some cases you may find that manually adjusting the
Lev_Diff
value further reduces the total solution
time. Setting the Lev_Diff
value equal to 1 uses less
memory compared to other Lev_Diff
values; however,
the solution time is longer in most cases. Setting higher
Lev_Diff
values (for example, 3 or 4) can help
for problems that cause the PCG solver to have some difficulty in converging.
This typically occurs when elements are poorly shaped or are very elongated
(that is, having high aspect ratios).
A Lev_Diff
value of 5 causes a fundamental change
to the equation solver being used by the PCG Lanczos eigensolver. This
Lev_Diff
value makes the PCG Lanczos eigensolver
behave more like the Block Lanczos eigensolver by replacing the PCG iterative
solver with a direct solver similar to the sparse direct solver. As with the
Block Lanczos eigensolver, the numeric factorization step can either be done in
an in-core memory mode or in an out-of-core memory mode. The
Memory
field on the PCGOPT
command can allow the user to force one of these two modes or let the program
decide which mode to use. By default, only a single matrix factorization is done
by this solver unless the Sturm check option on the PCGOPT
command is enabled, which results in one additional matrix factorization.
Due to the amount of computer resources needed by the direct solver, choosing
a Lev_Diff
value of 5 essentially eliminates the
reduction in computer resources obtained by using the PCG Lanczos eigensolver
compared to using the Block Lanczos eigensolver. Thus, this option is generally
only recommended over Lev_Diff
values 1 through 4 for
problems that have less than one million degrees of freedom, though its
efficiency is highly dependent on several factors such as the number of modes
requested and I/O performance. Lev_Diff
= 5 is more
efficient than other Lev_Diff
values when more modes
are requested, so larger numbers of modes may increase the size of problem for
which a value of 5 should be used. The Lev_Diff
value
of 5 requires a costly factorization step which can be computed using an in-core
memory mode or an out-of-core memory mode. Thus, when this option runs in the
out-of-core memory mode on a machine with slow I/O performance, it decreases the
size of problem for which a value of 5 should be used.
Using Lev_Diff = 5 with PCG Lanczos in DMP Analyses
The Lev_Diff
value of 5 is supported in DMP Analyses. When used with the PCG Lanczos eigensolver,
Lev_Diff
= 5 causes this eigensolver to run in a
completely distributed fashion. This is similar to Block Lanczos in DMP mode for
modal analysis.
The Lev_Diff
= 5 setting can require a large amount
of memory or disk I/O compared to Lev_Diff
values of
1 through 4 because this setting uses a direct solver approach (that is, a matrix
factorization) within the Lanczos algorithm. However, by running in a
distributed fashion it can spread out these resource requirements over multiples
machines, thereby helping to achieve significant speedup and extending the class
of problems for which the PCG Lanczos eigensolver is a good candidate. If
Lev_Diff
= 5 is specified, choosing the option to
perform a Sturm check (via the PCGOPT command) does not
require additional resources (for example, additional memory usage or disk space). A
Sturm check does require one additional factorization for the run to guarantee
that no modes were skipped in the specified frequency range, and so it does
require more computations to perform this extra factorization. However, since
the Lev_Diff
= 5 setting already does a matrix
factorization for the Lanczos procedure, no extra memory or disk space is
required.
Table 4.4: PCG Lanczos Memory and Disk Estimates
PCG Lanczos Solver (Shared Memory) |
|
PCG Lanczos Solver (Distributed Memory, Using p Cores) |
|
Comments:
|
The Supernode eigensolver (MODOPT,SNODE) is designed to efficiently solve modal analyses in which a high number of modes is requested. For this class of problems, this solver often does less computation and uses considerably less computer resources than the Block Lanczos eigensolver. By utilizing fewer resources than Block Lanczos, the Supernode eigensolver becomes an ideal choice when solving this sort of analysis on the typical desktop machine, which can often have limited memory and slow I/O performance.
The MODOPT command allows you to specify how many frequencies are desired and what range those frequencies lie within. With other eigensolvers, the number of modes requested affects the performance of the solver, and the frequency range is essentially optional; asking for more modes increases the solution time, while the frequency range generally decides which computed frequencies are output. On the other hand, the Supernode eigensolver behaves completely opposite to the other solvers with regard to the MODOPT command input. This eigensolver will compute all of the frequencies within the requested range regardless of the number of modes the user requests. For maximum efficiency, it is highly recommended that you input a range that only covers the spectrum of frequencies between the first and last mode of interest. The number of modes requested on the MODOPT command then determines how many of the computed frequency modes are output.
The Supernode eigensolver benefits from shared-memory parallelism. Also, for users who want full control of this modal solver, the SNOPTION command gives you control over several important parameters that affect the accuracy and efficiency of the Supernode eigensolver.
Controlling Supernode Parameters
The Supernode eigensolver computes approximate eigenvalues. Typically, this
should not be an issue as the lowest modes in the system (which are often used
to compute the resonant frequencies) are computed very accurately (<< 1%
difference compared to the same analysis performed with the Block Lanczos
eigensolver). However, the accuracy drifts somewhat with the higher modes. For
the highest requested modes in the system, the difference (compared to Block
Lanczos) is often a few percent, and so it may be desirable in certain cases to
tighten the accuracy of the solver. This can be done using the range factor
(RangeFact
) field on the
SNOPTION command. Higher values of
RangeFact
lead to more accurate solutions at the
cost of extra memory and computations.
When computing the final mode shapes, the Supernode eigensolver often does the
bulk of its I/O transfer to and from disk. While the amount of I/O transfer is
often significantly less than that done in a similar run using Block Lanczos, it
can be desirable to further minimize this I/O, thereby maximizing the Supernode
solver efficiency. You can do this by using the block size
(BlockSize
) field on the
SNOPTION command. Larger values of
BlockSize
will reduce the amount of I/O transfer
done by holding more data in memory, which generally speeds up the overall
solution time. However, this is only recommended when there is enough physical
memory to do so.