The PCG solver performance summary information is not written to the output file. Instead, a separate file named Jobname.pcs is always written when the PCG solver is used. This file contains useful information about the computational costs of the iterative PCG solver. Iterative solver computations typically involve sparse matrix operations rather than the dense block kernels that dominate the sparse solver factorizations. Thus, for the iterative solver, performance metrics reflect measures of memory bandwidth rather than peak processor speeds. The information in Jobname.pcs is also useful for identifying which preconditioner option was chosen for a given simulation and allows users to try other options to eliminate performance bottlenecks. Example 6.3: Jobname.pcs Output File shows a typical Jobname.pcs file.
The memory information in Example 6.3: Jobname.pcs Output File shows that this 2.67 million DOF PCG solver run requires only 1.5 GB of memory (A). This model uses SOLID186 (Structural Solid) elements which, by default, use the MSAVE,ON feature for this static analysis. The MSAVE feature uses an "implicit" matrix-vector multiplication algorithm that avoids using a large "explicitly" assembled stiffness matrix. (See the MSAVE command description for more information.) The PCS file reports the number of elements assembled and the number that use the memory-saving option (B).
The PCS file also reports the number of iterations (C) and which preconditioner was
used by means of the level of difficulty (D). By default, the level of difficulty is
automatically set, but can be user-controlled by the Lev_Diff
option on the PCGOPT command. As the value of
Lev_Diff
increases, more expensive preconditioner options
are used that often increase memory requirements and computations. However, increasing
Lev_Diff
also reduces the number of iterations required
to reach convergence for the given tolerance.
As a rule of thumb, when using the default tolerance of 1.0e-8 and a level of
difficulty of 1 (Lev_Diff
= 1), a static or full transient
analysis with the PCG solver that requires more than 2000 iterations per equilibrium
iteration probably reflects an inefficient use of the iterative solver. In this
scenario, raising the level of difficulty to bring the number of iterations closer to
the 300-750 range will usually result in the most efficient solution. If increasing the
level of difficulty does not significantly drop the number of iterations, then the PCG
solver is probably not an efficient option, and the matrix could possibly require the
use of the sparse direct solver for a faster solution time.
The key here is to find the best preconditioner option using
Lev_Diff
that balances the cost per iteration as well as
the total number of iterations. Simply reducing the number of iterations with an
increased Lev_Diff
does not always achieve the expected end
result: lower elapsed time to solution. The reason is that the cost per iteration may
increase too greatly for this case. Another option which can add complexity to this
decision is parallel processing. For both shared memory and distributed memory
processing, using more cores to help with the computations will reduce the cost per
iteration, which typically shifts the optional Lev_Diff
value
slightly lower. Also, lower Lev_Diff
values in general scale
better with the preconditioner computations when parallel processing is used. Therefore,
when using 16 or more cores it is recommended that you decrease by one the optimal
Lev_Diff
value found when using one core in an attempt to
achieve better scalability and improve overall solver performance.
The CPU performance reported in the PCS file is divided into matrix multiplication using the stiffness matrix (E) and the various compute kernels of the preconditioner (F). It is normal that the Mflop rates reported in the PCS file are a lot lower than those reported with the sparse solver matrix factorization kernels, but they provide a good measure to compare relative performance of memory bandwidth on different hardware systems.
The I/O reported (G) in the PCS file is much less than that required for matrix
factorization in the sparse solver. This I/O occurs only during solver preprocessing
before the iterative solution and is generally not a performance factor for the PCG
solver. The one exception to this rule is when the Lev_Diff
=
5 option on the PCGOPT command is specified, and the factored matrix
used for this preconditioner is out-of-core. Normally, this option is only used for the
iterative PCG Lanczos eigensolver and only for smaller problems (under 1 million DOFs)
where the factored matrix (matrices) usually fit in memory.
This example shows a model that performed quite well with the PCG solver. Considering that it converged in about 1000 iterations, the Lev_Diff value of 1 is probably optimal for this model (especially at higher core counts). However, in this case it might be worthwhile to try Lev_Diff = 2 to see if it improves the solver performance. Using more than one core would also certainly help to reduce the time to solution.
Example 6.3: Jobname.pcs Output File
Number of cores used: 1 Degrees of Freedom: 2671929 DOF Constraints: 34652 Elements: 211280 <---B Assembled: 0 <---G (MSAVE,ON does not apply) Implicit: 211280 <---G (MSAVE,ON applies) Nodes: 890643 Number of Load Cases: 1 Nonzeros in Upper Triangular part of Global Stiffness Matrix : 0 Nonzeros in Preconditioner: 46325031 *** Precond Reorder: MLD *** Nonzeros in V: 30862944 Nonzeros in factor: 10118229 Equations in factor: 25806 *** Level of Difficulty: 1 (internal 0) *** <---D (Preconditioner) Total Operation Count: 2.07558e+12 Total Iterations In PCG: 1042 <---C (Convergence) Average Iterations Per Load Case: 1042.0 Input PCG Error Tolerance: 1e-08 Achieved PCG Error Tolerance: 9.93822e-09 DETAILS OF PCG SOLVER SETUP TIME(secs) Cpu Wall Gather Finite Element Data 0.30 0.29 Element Matrix Assembly 6.89 6.80 DETAILS OF PCG SOLVER SOLUTION TIME(secs) Cpu Wall Preconditioner Construction 6.73 6.87 Preconditioner Factoring 1.32 1.32 Apply Boundary Conditions 0.24 0.25 Preconditioned CG Iterations 636.98 636.86 Multiply With A 470.76 470.64 <---E (Matrix Mult. Time) Multiply With A22 470.76 470.64 Solve With Precond 137.10 137.01 <---F (Preconditioner Time) Solve With Bd 26.63 26.42 Multiply With V 89.47 89.37 Direct Solve 14.87 14.94 ****************************************************************************** TOTAL PCG SOLVER SOLUTION CP TIME = 645.89 secs TOTAL PCG SOLVER SOLUTION ELAPSED TIME = 648.25 secs ****************************************************************************** Total Memory Usage at CG : 1514.76 MB <---A (Memory) PCG Memory Usage at CG : 523.01 MB Memory Usage for MSAVE Data : 150.90 MB *** Memory Saving Mode Activated : Jacobians Precomputed *** ****************************************************************************** Multiply with A MFLOP Rate : 3911.13 MFlops Solve With Precond MFLOP Rate : 1476.93 MFlops Precond Factoring MFLOP Rate : 0.00 MFlops ****************************************************************************** Total amount of I/O read : 1873.54 MB <---G Total amount of I/O written : 1362.51 MB <---G ******************************************************************************