6.5. PCG Solver Performance Output

The PCG solver performance summary information is not written to the output file. Instead, a separate file named Jobname.pcs is always written when the PCG solver is used. This file contains useful information about the computational costs of the iterative PCG solver. Iterative solver computations typically involve sparse matrix operations rather than the dense block kernels that dominate the sparse solver factorizations. Thus, for the iterative solver, performance metrics reflect measures of memory bandwidth rather than peak processor speeds. The information in Jobname.pcs is also useful for identifying which preconditioner option was chosen for a given simulation and allows users to try other options to eliminate performance bottlenecks. Example 6.3: Jobname.pcs Output File shows a typical Jobname.pcs file.

The memory information in Example 6.3: Jobname.pcs Output File shows that this 2.67 million DOF PCG solver run requires only 1.5 GB of memory (A). This model uses SOLID186 (Structural Solid) elements which, by default, use the MSAVE,ON feature for this static analysis. The MSAVE feature uses an "implicit" matrix-vector multiplication algorithm that avoids using a large "explicitly" assembled stiffness matrix. (See the MSAVE command description for more information.) The PCS file reports the number of elements assembled and the number that use the memory-saving option (B).

The PCS file also reports the number of iterations (C) and which preconditioner was used by means of the level of difficulty (D). By default, the level of difficulty is automatically set, but can be user-controlled by the Lev_Diff option on the PCGOPT command. As the value of Lev_Diff increases, more expensive preconditioner options are used that often increase memory requirements and computations. However, increasing Lev_Diff also reduces the number of iterations required to reach convergence for the given tolerance.

As a rule of thumb, when using the default tolerance of 1.0e-8 and a level of difficulty of 1 (Lev_Diff = 1), a static or full transient analysis with the PCG solver that requires more than 2000 iterations per equilibrium iteration probably reflects an inefficient use of the iterative solver. In this scenario, raising the level of difficulty to bring the number of iterations closer to the 300-750 range will usually result in the most efficient solution. If increasing the level of difficulty does not significantly drop the number of iterations, then the PCG solver is probably not an efficient option, and the matrix could possibly require the use of the sparse direct solver for a faster solution time.

The key here is to find the best preconditioner option using Lev_Diff that balances the cost per iteration as well as the total number of iterations. Simply reducing the number of iterations with an increased Lev_Diff does not always achieve the expected end result: lower elapsed time to solution. The reason is that the cost per iteration may increase too greatly for this case. Another option which can add complexity to this decision is parallel processing. For both shared memory and distributed memory processing, using more cores to help with the computations will reduce the cost per iteration, which typically shifts the optional Lev_Diff value slightly lower. Also, lower Lev_Diff values in general scale better with the preconditioner computations when parallel processing is used. Therefore, when using 16 or more cores it is recommended that you decrease by one the optimal Lev_Diff value found when using one core in an attempt to achieve better scalability and improve overall solver performance.

The CPU performance reported in the PCS file is divided into matrix multiplication using the stiffness matrix (E) and the various compute kernels of the preconditioner (F). It is normal that the Mflop rates reported in the PCS file are a lot lower than those reported with the sparse solver matrix factorization kernels, but they provide a good measure to compare relative performance of memory bandwidth on different hardware systems.

The I/O reported (G) in the PCS file is much less than that required for matrix factorization in the sparse solver. This I/O occurs only during solver preprocessing before the iterative solution and is generally not a performance factor for the PCG solver. The one exception to this rule is when the Lev_Diff = 5 option on the PCGOPT command is specified, and the factored matrix used for this preconditioner is out-of-core. Normally, this option is only used for the iterative PCG Lanczos eigensolver and only for smaller problems (under 1 million DOFs) where the factored matrix (matrices) usually fit in memory.

This example shows a model that performed quite well with the PCG solver. Considering that it converged in about 1000 iterations, the Lev_Diff value of 1 is probably optimal for this model (especially at higher core counts). However, in this case it might be worthwhile to try Lev_Diff = 2 to see if it improves the solver performance. Using more than one core would also certainly help to reduce the time to solution.

Example 6.3: Jobname.pcs Output File

        Number of cores used: 1
        Degrees of Freedom: 2671929
        DOF Constraints: 34652
        Elements: 211280 <---B
                Assembled: 0 <---G (MSAVE,ON does not apply)
                Implicit: 211280 <---G (MSAVE,ON applies)
        Nodes: 890643
        Number of Load Cases: 1

        Nonzeros in Upper Triangular part of
                     Global Stiffness Matrix : 0
        Nonzeros in Preconditioner: 46325031
                *** Precond Reorder: MLD ***
                Nonzeros in V: 30862944
                Nonzeros in factor: 10118229
                Equations in factor: 25806
        *** Level of Difficulty: 1   (internal 0) *** <---D (Preconditioner)

        Total Operation Count: 2.07558e+12
        Total Iterations In PCG: 1042 <---C (Convergence)
        Average Iterations Per Load Case: 1042.0
        Input PCG Error Tolerance: 1e-08
        Achieved PCG Error Tolerance: 9.93822e-09


        DETAILS OF PCG SOLVER SETUP TIME(secs)        Cpu        Wall
             Gather Finite Element Data              0.30        0.29
             Element Matrix Assembly                 6.89        6.80

        DETAILS OF PCG SOLVER SOLUTION TIME(secs)     Cpu        Wall
             Preconditioner Construction             6.73        6.87
             Preconditioner Factoring                1.32        1.32
             Apply Boundary Conditions               0.24        0.25
             Preconditioned CG Iterations          636.98      636.86
                  Multiply With A                  470.76      470.64 <---E (Matrix Mult. Time)
                       Multiply With A22           470.76      470.64
                  Solve With Precond               137.10      137.01 <---F (Preconditioner Time)
                       Solve With Bd                26.63       26.42
                       Multiply With V              89.47       89.37
                       Direct Solve                 14.87       14.94
******************************************************************************
             TOTAL PCG SOLVER SOLUTION CP TIME      =     645.89 secs
             TOTAL PCG SOLVER SOLUTION ELAPSED TIME =     648.25 secs
******************************************************************************
        Total Memory Usage at CG         :    1514.76 MB <---A (Memory)
        PCG Memory Usage at CG           :     523.01 MB
        Memory Usage for MSAVE Data      :     150.90 MB
        *** Memory Saving Mode Activated : Jacobians Precomputed ***
******************************************************************************
        Multiply with A MFLOP Rate       :    3911.13 MFlops
        Solve With Precond MFLOP Rate    :    1476.93 MFlops
        Precond Factoring MFLOP Rate     :       0.00 MFlops
******************************************************************************
        Total amount of I/O read         :    1873.54 MB <---G
        Total amount of I/O written      :    1362.51 MB <---G
******************************************************************************