4.1. Linear Equation Solver Memory Requirements

4.1.1. Direct (Sparse) Solver Memory Usage

The sparse solver is the default solver for virtually all analyses. It is the most robust solver available, but it is also compute- and I/O-intensive. The sparse solver is designed to run in different modes of operation, depending on the amount of memory available. Slight variations in memory requirements can occur when running different versions of the software, even for the same model run on the same hardware, because of changes in heuristics. It is important to understand that the solver's mode of operation can have a significant effect on runtime as summarized here.

Memory usage for a direct sparse solver is determined by several steps. The matrix that is input to the sparse solver is assembled entirely in memory before being written to the .full file. The sparse solver then reads the .full file, processes the matrix, factors the matrix, and computes the solution. Direct method solvers factor the input matrix into the product of a lower and upper triangular matrix in order to solve the system of equations. For symmetric input matrices (most matrices created in Mechanical APDL are symmetric), only the lower triangular factor is required since it is equivalent to the transpose of the upper triangular factor. Still, the process of factorization produces matrix factors which are 10 to 20 times larger than the input matrix. The calculation of this factor is computationally intensive. In contrast, the solution of the triangular systems is I/O or memory access-dominated with few computations required.

The following are rough estimates for the amount of memory needed for each step when using the sparse solver for most 3D analyses. For non-symmetric matrices or for complex value matrices (as found in harmonic analyses), these estimates approximately double.

The amount of memory needed to assemble the matrix in memory is approximately 1 GB per million DOFs.
The amount of memory needed to hold the factored matrix in memory is approximately 10 to 20 GB per million DOFs.

It is important to note that the shared-memory version of the sparse solver is not the same as the distributed-memory version of the sparse solver. While the fundamental steps of these solvers are the same, they are actually two independent solvers, and there are subtle differences in their modes of operation. These differences will be explained in the following sections. Table 4.1: Direct Sparse Solver Memory and Disk Estimates summarizes the direct solver memory requirements.

Table 4.1: Direct Sparse Solver Memory and Disk Estimates

Memory Mode	Memory Usage Estimate	I/0 Files Size Estimate
Sparse Direct Solver (Shared Memory)
Out-of-core	1 GB/MDOFs	10 GB/MDOFs
In-core	10 GB/MDOFs	1 GB/MDOFs 10 GB/MDOFs if workspace is saved to Jobname.DSPsymb
Sparse Direct Solver (Distributed Memory, Using p Cores)
Out-of-core	1 GB/MDOFs on head compute node 0.7 GB/MDOFs on all other compute nodes	10 GB/MDOFs * 1/p Matrix factor is stored on disk evenly on p cores
In-core	10 GB/MDOFs * 1/p Matrix factor is stored in memory evenly in-core on p cores. Additional 1.5 GB/MDOFs required on head compute node to store input matrix.	1 GB/MDOFs * 1/p 10 GB/MDOFs * 1/p if workspace is saved to Jobname.DSPsymb
Comments: By default, smaller jobs typically run in the in-core memory mode, while larger jobs run in the out-of-core memory mode. Double memory estimates for non-symmetric systems. Double memory estimates for complex valued systems. Add 30 percent to out-of-core memory for 3D models with higher-order elements. Subtract 40 percent to out-of-core memory for 2D or beam/shell element dominated models. Add 50 -100 percent to file and memory size for in-core memory for 3D models with higher-order elements. Subtract 50 percent to file and memory size for in-core memory for 2D or beam/shell element dominated models.

4.1.1.1. Out-of-Core Factorization

For out-of-core factorization, the factored matrix is held on disk.

If sufficient memory is available for the assembly process, it is almost always more than enough to run the sparse solver factorization in out-of-core mode. This mode uses some additional memory to make sure that the largest of all frontal matrices (dense matrix structures within the large factored matrix) can be held completely in memory. This approach attempts to achieve an optimal balance between memory usage and I/O. For larger jobs, the program will typically run the sparse solver using out-of-core memory mode (by default) unless a specific memory mode is defined.

The distributed memory sparse solver can be run in the out-of-core mode. It is important to note that when running this solver in out-of-core mode, the additional memory allocated to make sure each individual frontal matrix is computed in memory is allocated on all processes. Therefore, as more distributed processes are used (that is, the solver is used on more cores) the solver's memory usage for each process is not decreasing, but rather staying roughly constant, and the total sum of memory used by all processes is actually increasing (see Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver). Keep in mind, however, that the computations do scale for this memory mode as more cores are used.

4.1.1.2. In-core Factorization

In-core factorization requires that the factored matrix be held in memory and, thus, often requires 10 to 20 times more memory than out-of-core factorization. However, larger memory systems are commonplace today, and users of these systems will benefit from in-core factorization. A model with 1 million DOFs can, in many cases, be factored using 10 GB of memory—easily achieved on desktop systems with 16 GB of memory. Users can run in-core using several different methods. The simplest way to set up an in-core run is to use the BCSOPTION,,INCORE command (or DSPOPTION,,INCORE for the distributed memory sparse solver). This option tells the sparse solver to try allocating a block of memory sufficient to run using the in-core memory mode after solver preprocessing of the input matrix has determined this value. However, this method requires preprocessing of the input matrix using an initial allocation of memory to the sparse solver.

Another way to get in-core performance with the sparse solver is to start the sparse solver with enough memory to run in-core. Users can start Mechanical APDL with an initial large -m allocation (see Specifying Memory Allocation) such that the largest block available when the sparse solver begins is large enough to run the solver using the in-core memory mode. This method will typically obtain enough memory to run the solver factorization step with a lower peak memory usage than the simpler method described above, but it requires prior knowledge of how much memory to allocate in order to run the sparse solver using the in-core memory mode.

The in-core factorization should be used only when the computer system has enough memory to easily factor the matrix in-core. Users should avoid using all of the available system memory or extending into virtual memory to obtain an in-core factorization. However, users who have long-running simulations should understand how to use the in-core factorization to improve elapsed time performance.

The BCSOPTION command controls the shared-memory sparse solver memory modes and also enables performance debug summaries. See the documentation on this command for usage details. Sparse solver memory usage statistics are usually printed in the output file and can be used to determine the memory requirements for a given model, as well as the memory obtained from a given run. Following is an example output.

Memory allocated for solver =  1536.42 MB
Memory required for in-core =  10391.28 MB
Memory required for out-of-core =  1191.67 MB

This sparse solver run required 10391 MB to run in-core and 1192 MB to run in out-of-core mode. "Memory allocated for solver" indicates that the amount of memory used for this run was just above the out-of-core memory requirement, so this job will use the out-of-core memory mode.

The DSPOPTION command controls the distributed-memory sparse solver memory modes and also enables performance debug summaries. Similar memory usage statistics for this solver are also printed in the output file for each distributedprocess. When running the distributed sparse solver using multiple cores on a single node or when running on a cluster with a slow I/O configuration, using the in-core mode can significantly improve the overall solver performance as the costly I/O time is avoided.

The memory required per core to run in out-of-core mode approaches a constant value as the number of cores increases because each core in the distributed sparse solver has to store a minimum amount of information to carry out factorization in the optimal manner. The more cores that are used, the more total memory that is needed (it increases slightly at 32 or more cores) for out-of-core performance.

In contrast to the out-of-core mode, the memory required per core to run in the in-core mode decreases as more processes are used with the distributed memory sparse solver (see the left-hand side of Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver). This is because the portion of total matrices stored and factorized in one core is getting smaller and smaller. The total memory needed will increase slightly as the number of cores increases.

At some point, as the number of processes increases (usually between 8 and 32), the total memory usage for these two modes approaches the same value (see the right-hand side of Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver). When the out-of-core mode memory requirement matches the in-core requirement, the solver automatically runs in-core. This is an important effect; it shows that when a job is spread out across enough machines, the distributed memory sparse solver can effectively use the memory of the cluster to automatically run a very large job in-core.

Figure 4.1: In-core vs. Out-of-core Memory Usage for Distributed Memory Sparse Solver

4.1.1.3. Partial Pivoting

Sparse solver partial pivoting is an important detail that may inhibit in-core factorization. Pivoting in direct solvers refers to a dynamic reordering of rows and columns to maintain numerical stability. This reordering is based on a test of the size of the diagonal (called the pivot) in the current matrix factor column during factorization.

Pivoting is not required for most analysis types, but it is enabled when certain element types and options are used (for example, pure Lagrange contact and mixed u-P formulation). When pivoting is enabled, the size of the matrix factor cannot be known before the factorization; thus, the in-core memory requirement cannot be accurately computed. As a result, it is generally recommended that pivoting-enabled factorizations with the sparse solver use the out-of-core memory mode.

4.1.2. Iterative (PCG) Solver Memory Usage

The iterative solvers offer a powerful alternative to more expensive sparse direct methods. They do not require a costly matrix factorization of the assembled matrix, and they always run in memory and do only minimal I/O. However, iterative solvers proceed from an initial random guess to the solution by an iterative process and are dependent on matrix properties that can cause the iterative solver to fail to converge in some cases. Hence, the iterative solvers are not the default solvers in Mechanical APDL.

The most important factor determining the effectiveness of the iterative solvers for a simulation is the preconditioning step. The preconditioned conjugate gradient (PCG) iterative solver now uses two different proprietary preconditioners which have been specifically developed for a wide range of element types. The newer node-based preconditioner (added at Release 10.0) requires more memory and uses an increasing level of difficulty setting, but it is especially effective for problems with poor element aspect ratios.

The specific preconditioner option can be specified using the Lev_Diff argument on the PCGOPT command. Lev_Diff = 1 selects the original element-based preconditioner for the PCG solver, and Lev_Diff values of 2, 3, and 4 select the new node-based preconditioner with differing levels of difficulty. Finally, Lev_Diff = 5 uses a preconditioner that requires a complete factorization of the assembled global matrix. This last option (which is discussed in PCG Lanczos Solver) is mainly used for the PCG Lanczos solver (LANPCG) and is only recommended for smaller problems where there is sufficient memory to use this option. The program uses heuristics to choose the default preconditioner option and, in most cases, makes the best choice. However, in cases where the program automatically selects a high level of difficulty and the user is running on a system with limited memory, it may be necessary to reduce memory requirements by manually specifying a lower level of difficulty (via the PCGOPT command). This is because peak memory usage for the PCG solvers often occurs during preconditioner construction.

The basic memory formula for iterative solvers is 1 GB per million DOFs. Using a higher level of difficulty preconditioner raises this amount, and higher-order elements also increase the basic memory requirement. An important memory saving feature for the PCG solvers is implemented for several key element types. This option, invoked via the MSAVE command, avoids the need to assemble the global matrix by computing the matrix/vector multiplications required for each PCG iteration at the element level. The MSAVE option can save up to 70 percent of the memory requirement for the PCG solver if the majority of the elements in a model are elements that support this feature. MSAVE is automatically turned on for some linear static analyses when SOLID186 and/or SOLID187 elements that meet the MSAVE criteria are present. It is turned on because it often reduces the overall solution time in addition to reducing the memory usage. It is most effective for these analyses when dominated by SOLID186 elements using reduced integration, or by SOLID187 elements. For large deflection nonlinear analyses, the MSAVE option is not on by default since it increases solution time substantially compared to using the assembled matrix for this analysis type; however, it can still be turned on manually to achieve considerable memory savings.

The total memory usage of the DMP version of the iterative solver is higher than the corresponding SMP version due to some duplicated data structures required on each process for the DMP version. However, the total memory requirement scales across the processes so that memory use per process reduces as the number of processes increases. The preconditioner requires an additional data structure that is only stored and used by the master process, so the memory required for the master process is larger than all other processes. Table 4.2: Iterative PCG Solver Memory and Disk Estimates summarizes the memory requirements for iterative solvers.

The table shows that for very large models running in DMP mode, the most significant term becomes the 300 MB/MDOFs requirement for the master process. This term does not scale (reduce) as more cores are used. A 10 MDOFs model using the iterative solver would require 3 GB of memory for this part of PCG solver memory, in addition to 12 GB distributed evenly across the nodes in the cluster. A 100 MDOFs model would require 30 GB of memory in addition to 120 GB of memory divided evenly among the nodes of the cluster.

Table 4.2: Iterative PCG Solver Memory and Disk Estimates

PCG Solver Memory and Disk Estimates (Shared Memory)

Basic Memory requirement is 1 GB/MDOFs
Basic I/O Requirement is 1.5 GB/MDOFs

PCG Solver Memory and Disk Estimates (Distributed Memory, Using p Cores)

Basic Memory requirement is 1.5 GB/MDOFs
- Each process uses 1.2 GB/MDOFs * 1/p
- Add ~300 MB/MDOFs for master process
Basic I/O Requirement is 2.0 GB/MDOFs
- Each process consumes 2 GB/MDOFs * 1/p file space
- File sizes are nearly evenly divided on cores

Comments:

Add 30 percent to memory requirement for higher-order solid elements
Add 10-50 percent to memory requirement for higher level of difficulty preconditioners (PCGOPT 2-4)
Save up to 70 percent for memory requirement from MSAVE option by default, when applicable
- SOLID186 / SOLID187 elements
- Static analyses with small deflections (NLGEOM,OFF)
Save up to 50 percent for memory requirement forcing MSAVE,ON
- SOLID185 elements (at the expense of possibly longer runtime)
- NLGEOM,ON for SOLID185 / SOLID186 / SOLID187 elements (at the expense of possibly longer runtime)

4.1.3. Modal (Eigensolvers) Solver Memory Usage

Finding the natural frequencies and mode shapes of a structure is one of the most computationally demanding tasks. Specific equation solvers, called eigensolvers, are used to solve for the natural frequencies and mode shapes. Mechanical APDL offers three eigensolvers for modal analyses of undamped systems: the sparse solver-based Block Lanczos solver, the PCG Lanczos solver, and the Supernode solver.

The memory requirements for the two Lanczos-based eigensolvers are related to the memory requirements for the sparse and PCG solvers used in each method, as described above. However, there is additional memory required to store the mass matrices as well as blocks of vectors used in the Lanczos iterations. For the Block Lanczos solver, I/O is a critical factor in determining performance. For the PCG Lanczos solver, the choice of the preconditioner is an important factor.

4.1.3.1. Block Lanczos Solver

The Block Lanczos solver (MODOPT,LANB) uses the sparse direct solver. However, in addition to requiring a minimum of one matrix factorization, the Block Lanczos algorithm also computes blocks of vectors that are stored on files during the Block Lanczos iterations. The size of these files grows as more modes are computed. Each Block Lanczos iteration requires multiple solves using the large matrix factor file (or in-memory factor if the in-core memory mode is used) and one in-memory block of vectors. The larger the BlockSize (input on the MODOPT command), the fewer block solves are required, reducing the I/O cost for the solves.

If the amount of memory allocated for the solver is less than the recommended amount for a Block Lanczos run, the block size used internally for the Lanczos iterations will be automatically reduced. Smaller block sizes will require more block solves, the most expensive part of the Lanczos algorithm for I/O performance. Typically, the default block size of 8 is optimal. On machines with limited physical memory where the I/O cost in Block Lanczos is very high (for example, machines without enough physical memory to run the Block Lanczos eigensolver using the in-core memory mode), forcing a larger BlockSize (such as 12 or 16) on the MODOPT command can reduce the amount of I/O and, thus, improve overall performance.

Finally, multiple matrix factorizations may be required for a Block Lanczos run. (See the following table for Block Lanczos memory requirements.) The algorithm decides dynamically whether to refactor using a new shift point or to continue Lanczos iterations using the current shift point. This decision is influenced by the measured speed of matrix factorization versus the rate of convergence for the requested modes and the cost of each Lanczos iteration. This means that performance characteristics can change when hardware is changed, when the memory mode is changed from out-of-core to in-core (or vice versa), or when shared memory parallelism is used.

Table 4.3: Block Lanczos Eigensolver Memory and Disk Estimates

Memory Mode	Memory Usage Estimate	I/0 Files Size Estimate
Out-of-core	1.5 GB/MDOFs	15-20 GB/MDOFs
In-core	15-20 GB/MDOFs	~1.5 GB/MDOFs
Comments: Add 30 percent to out-of-core memory for 3D models with higher-order elements Subtract 40 percent to out-of-core memory for 2D or beam/shell element dominated models Add 50 -100 percent to file size for in-core memory for 3D models with higher-order elements Subtract 50 percent to file size for in-core memory for 2D or beam/shell element dominated models

4.1.3.2. PCG Lanczos Solver

The PCG Lanczos solver (MODOPT,LANPCG) represents a breakthrough in modal analysis capability because it allows users to extend the maximum size of models used in modal analyses well beyond the capacity of direct solver-based eigensolvers. The PCG Lanczos eigensolver works with the PCG options command (PCGOPT) as well as with the memory saving feature (MSAVE). Both shared-memory parallel performance and distributed-memory parallel performance can be obtained by using this eigensolver.

Controlling PCG Lanczos Parameters

The PCG Lanczos eigensolver can be controlled using several options on the PCGOPT command. The first of these options is the Level of Difficulty value (Lev_Diff). In most cases, choosing a value of AUTO (which is the default) for Lev_Diff is sufficient to obtain an efficient solution time. However, in some cases you may find that manually adjusting the Lev_Diff value further reduces the total solution time. Setting the Lev_Diff value equal to 1 uses less memory compared to other Lev_Diff values; however, the solution time is longer in most cases. Setting higher Lev_Diff values (for example, 3 or 4) can help for problems that cause the PCG solver to have some difficulty in converging. This typically occurs when elements are poorly shaped or are very elongated (that is, having high aspect ratios).

A Lev_Diff value of 5 causes a fundamental change to the equation solver being used by the PCG Lanczos eigensolver. This Lev_Diff value makes the PCG Lanczos eigensolver behave more like the Block Lanczos eigensolver by replacing the PCG iterative solver with a direct solver similar to the sparse direct solver. As with the Block Lanczos eigensolver, the numeric factorization step can either be done in an in-core memory mode or in an out-of-core memory mode. The Memory field on the PCGOPT command can allow the user to force one of these two modes or let the program decide which mode to use. By default, only a single matrix factorization is done by this solver unless the Sturm check option on the PCGOPT command is enabled, which results in one additional matrix factorization.

Due to the amount of computer resources needed by the direct solver, choosing a Lev_Diff value of 5 essentially eliminates the reduction in computer resources obtained by using the PCG Lanczos eigensolver compared to using the Block Lanczos eigensolver. Thus, this option is generally only recommended over Lev_Diff values 1 through 4 for problems that have less than one million degrees of freedom, though its efficiency is highly dependent on several factors such as the number of modes requested and I/O performance. Lev_Diff = 5 is more efficient than other Lev_Diff values when more modes are requested, so larger numbers of modes may increase the size of problem for which a value of 5 should be used. The Lev_Diff value of 5 requires a costly factorization step which can be computed using an in-core memory mode or an out-of-core memory mode. Thus, when this option runs in the out-of-core memory mode on a machine with slow I/O performance, it decreases the size of problem for which a value of 5 should be used.

Using Lev_Diff = 5 with PCG Lanczos in DMP Analyses

The Lev_Diff value of 5 is supported in DMP Analyses. When used with the PCG Lanczos eigensolver, Lev_Diff = 5 causes this eigensolver to run in a completely distributed fashion. This is similar to Block Lanczos in DMP mode for modal analysis.

The Lev_Diff = 5 setting can require a large amount of memory or disk I/O compared to Lev_Diff values of 1 through 4 because this setting uses a direct solver approach (that is, a matrix factorization) within the Lanczos algorithm. However, by running in a distributed fashion it can spread out these resource requirements over multiples machines, thereby helping to achieve significant speedup and extending the class of problems for which the PCG Lanczos eigensolver is a good candidate. If Lev_Diff = 5 is specified, choosing the option to perform a Sturm check (via the PCGOPT command) does not require additional resources (for example, additional memory usage or disk space). A Sturm check does require one additional factorization for the run to guarantee that no modes were skipped in the specified frequency range, and so it does require more computations to perform this extra factorization. However, since the Lev_Diff = 5 setting already does a matrix factorization for the Lanczos procedure, no extra memory or disk space is required.

Table 4.4: PCG Lanczos Memory and Disk Estimates

PCG Lanczos Solver (Shared Memory)

Basic Memory requirement is 1.5 GB/MDOFs
Basic I/O Requirement is 2.0 GB/MDOFs

PCG Lanczos Solver (Distributed Memory, Using p Cores)

Basic Memory requirement is 2.2 GB/ MDOFs
- Each process uses 1.9 GB/MDOFs * 1/p
- Add ~300 MB/MDOFs for master process
Basic I/O Requirement is 3.0 GB/MDOFs
- Each process consumes 3.0 GB/MDOFs * 1/p file space
- File sizes are nearly evenly divided on cores

Comments:

Add 30 percent to memory requirement for higher-order solid elements
Add 10-50 percent to memory requirement for higher level of difficulty preconditioners (PCGOPT 2-4)
Save up to 70 percent for memory requirement from MSAVE option, when applicable
- SOLID186 / SOLID187 elements (at the expense of possibly longer runtime)
- SOLID185 elements (at the expense of possibly much longer runtime)

4.1.3.3. Supernode Solver

The Supernode eigensolver (MODOPT,SNODE) is designed to efficiently solve modal analyses in which a high number of modes is requested. For this class of problems, this solver often does less computation and uses considerably less computer resources than the Block Lanczos eigensolver. By utilizing fewer resources than Block Lanczos, the Supernode eigensolver becomes an ideal choice when solving this sort of analysis on the typical desktop machine, which can often have limited memory and slow I/O performance.

The MODOPT command allows you to specify how many frequencies are desired and what range those frequencies lie within. With other eigensolvers, the number of modes requested affects the performance of the solver, and the frequency range is essentially optional; asking for more modes increases the solution time, while the frequency range generally decides which computed frequencies are output. On the other hand, the Supernode eigensolver behaves completely opposite to the other solvers with regard to the MODOPT command input. This eigensolver will compute all of the frequencies within the requested range regardless of the number of modes the user requests. For maximum efficiency, it is highly recommended that you input a range that only covers the spectrum of frequencies between the first and last mode of interest. The number of modes requested on the MODOPT command then determines how many of the computed frequency modes are output.

The Supernode eigensolver benefits from shared-memory parallelism. Also, for users who want full control of this modal solver, the SNOPTION command gives you control over several important parameters that affect the accuracy and efficiency of the Supernode eigensolver.

Controlling Supernode Parameters

The Supernode eigensolver computes approximate eigenvalues. Typically, this should not be an issue as the lowest modes in the system (which are often used to compute the resonant frequencies) are computed very accurately (<< 1% difference compared to the same analysis performed with the Block Lanczos eigensolver). However, the accuracy drifts somewhat with the higher modes. For the highest requested modes in the system, the difference (compared to Block Lanczos) is often a few percent, and so it may be desirable in certain cases to tighten the accuracy of the solver. This can be done using the range factor (RangeFact) field on the SNOPTION command. Higher values of RangeFact lead to more accurate solutions at the cost of extra memory and computations.

When computing the final mode shapes, the Supernode eigensolver often does the bulk of its I/O transfer to and from disk. While the amount of I/O transfer is often significantly less than that done in a similar run using Block Lanczos, it can be desirable to further minimize this I/O, thereby maximizing the Supernode solver efficiency. You can do this by using the block size (BlockSize) field on the SNOPTION command. Larger values of BlockSize will reduce the amount of I/O transfer done by holding more data in memory, which generally speeds up the overall solution time. However, this is only recommended when there is enough physical memory to do so.