6.3. Sparse Solver Performance Output

Performance information for the sparse solver is printed by default to the file Jobname.DSP. Use the command DSPOPTION,,,,,,PERFORMANCE to print this same information, along with additional memory usage information, to the standard output file.

For jobs that call the sparse solver multiple times (nonlinear, transient, etc.), a good technique to use for studying performance output is to add the command NCNV,1,,n, where n specifies a fixed number of cumulative iterations. The job will run up to n cumulative iterations and then stop. For most nonlinear jobs, 3 to 5 calls to the sparse solver is sufficient to understand memory usage and performance for a long run. Then, the NCNV command can be removed, and the entire job can be run using memory settings determined from the test run.

Example 6.1: Sparse Solver Performance Summary shows an example of the output from the sparse solver performance summary. The times reported in this summary use CPU time and wall clock time. In general, CPU time reports only time that a processor spends on the user's application, leaving out system time and I/O wait time. When using a single core, the CPU time is a subset of the wall time. However, when using multiple cores, some systems accumulate the CPU times from all cores, so the CPU time reported by the program will exceed the wall time. Therefore, the most meaningful performance measure is wall clock time because it accurately measures total elapsed time. Wall times are reported in the second column of numbers in the sparse solver performance summary.

When comparing CPU and wall times for a single core, if the wall time is excessively greater than the CPU time, it is usually an indication that a large amount of elapsed time was spent doing actual I/O. If this is typically the case, determining why this I/O was done (that is, looking at the memory mode and comparing the size of matrix factor to physical memory) can often have dramatic improvements on performance.

The most important performance information from the sparse solver performance summary is matrix factorization time and rate, solve time and rate, and the file I/O statistics (items marked A, B, and C in Example 6.1: Sparse Solver Performance Summary).

Matrix factorization time and rate measures the performance of the computationally intensive matrix factorization. The factorization rate provides the best single measure of peak obtainable speed for most hardware systems because it uses highly tuned math library routines for the bulk of the matrix factorization computations. The rate is reported in units of Mflops (millions of floating point operations per second ) and is computed using an accurate count of the total number of floating point operations required for factorization (also reported in Example 6.1: Sparse Solver Performance Summary) in millions of flops, divided by the total elapsed time for the matrix factorization. While the factorization is typically dominated by a single math library routine, the total elapsed time is measured from the start of factorization until the finish. The compute rate includes all overhead (including any I/O required) for factorization.

On modern hardware, the factorization rates typically observed in sparse matrix factorization range from 10 Gflops to over 30 Gflops on a single core. For parallel factorization, compute rates can now exceed 300 Gflops using the fastest multicore processors. Factorization rates do not always determine the fastest computer system for some simulations, but they do provide a meaningful and accurate comparison of processor peak performance. I/O performance and memory size are also important factors in determining overall system performance.

Sparse solver I/O performance can be measured by the forward/backward solve required for each call to the solver; it is reported in the output in MB/sec. When the sparse solver runs in-core, the effective I/O rate is really a measure of memory bandwidth, and rates of 5000 MB/sec or higher will be observed on most modern processors. When out-of-core factorization is used and the system buffer cache is large enough to contain the matrix factor file in memory, the effective I/O rate will be 2000+ MB/sec. This high rate does not indicate disk speed, but rather indicates that the system is effectively using memory to cache the I/O requests to the large matrix factor file. Typical effective I/O performance for a single drive ranges from 50 to 150 MB/sec. Higher performance—over 100 MB/sec and up to 500 MB/sec—can be obtained from RAID0 drives in Windows (or multiple drive, striped disk arrays on high-end Linux servers). With experience, a glance at the effective I/O rate will reveal whether a sparse solver analysis ran in-core, out-of-core using the system buffer cache, or truly out-of-core to disk using either a single drive or a multiple drive fast RAID system.

The I/O statistics reported in the sparse solver summary list each file used by the sparse solver and shows the unit number for each file. The most important file used in the sparse solver is the matrix factor file, Jobname.DSPtri (see D in Example 6.1: Sparse Solver Performance Summary). In the example, the DSPtri file is 18904 MB and is written once and read twice for a total of 56712 MB of data transfer. If the size of the DSPtri file exceeds the physical memory of the system or is near the physical memory, it is usually best to run the sparse solver in the out-of-core memory mode, saving the extra memory for system buffer cache. It is best to use in-core memory only when the memory required fits comfortably within the available physical memory.

This example shows a well-balanced system (high factor mflops and adequate I/O rate from multiple disks in a RAID0 configuration). Additional performance gains could be achieved by using the in-core memory mode on a machine with more physical memory and by using more than one core.

Example 6.1: Sparse Solver Performance Summary

number of equations                     =         1058610
no. of nonzeroes in lower triangle of a =        25884795
no. of nonzeroes in the factor l        =      2477769015.
ratio of nonzeroes in factor (min/max)  =           1.000
number of super nodes                   =           12936
maximum order of a front matrix         =           22893
maximum size of a front matrix          =       262056171
maximum size of a front trapezoid       =         1463072
no. of floating point ops for factor    =      2.2771D+13
no. of floating point ops for solve     =      8.0370D+09
ratio of flops for factor (min/max)     =           1.000
actual no. of nonzeroes in the factor l =      2477769015.
negative pivot monitoring activated
number of negative pivots encountered   =              0.
factorization panel size                =             128
number of cores used                    =               1
time (cpu & wall) for structure input   =        3.030000        3.025097
time (cpu & wall) for ordering          =       22.860000       22.786610
time (cpu & wall) for symbolic factor   =        0.400000        0.395507
time (cpu & wall) for value input       =        3.310000        3.304702
time (cpu & wall) for numeric factor    =     2051.500000     2060.382486  <---A (Factor)
computational rate (mflops) for factor  =    11099.910696    11052.058028  <---A (Factor)
time (cpu & wall) for numeric solve     =       14.050000      109.646978  <---B (Solve)
computational rate (mflops) for solve   =      572.028766       73.298912  <---B (Solve)
effective I/O rate (MB/sec) for solve   =     2179.429565      279.268850  <---C (I/O)

i/o stats: unit-core           file length              amount transferred
                           words     mbytes            words     mbytes
             ----          -----     ------            -----     ------
          90-   0    2477769015.    18904. MB    7433307045.    56712. MB  <---D (File)

          -------     ----------     --------      ----------    --------
          Totals:    2477769015.    18904. MB    7433307045.    56712. MB


  Sparse Matrix Solver      CPU Time (sec) =       2097.670
  Sparse Matrix Solver  ELAPSED Time (sec) =       2204.057
  Sparse Matrix Solver   Memory Used ( MB) =       2357.033