6.4. Distributed Sparse Solver Performance Output

Similar to the shared-memory sparse solver, performance information for the distributed-memory version of the sparse solver is printed by default to the DMP file Jobname.DSP. Also, the command DSPOPTION,,,,,,PERFORMANCE can be used to print this same information, along with additional solver information, to the standard output file.

Example 6.2: Distributed Sparse Solver Performance Summary for 4 Processes and 1 GPU shows an example of the performance summary output from the distributed memory sparse solver. Most of this performance information is identical in format and content to what is described in Sparse Solver Performance Output. However, a few items are unique or are presented differently when using the distributed memory sparse solver, which we will discuss next.

Example 6.2: Distributed Sparse Solver Performance Summary for 4 Processes and 1 GPU

number of equations                     =          774144
no. of nonzeroes in lower triangle of a =        30765222
no. of nonzeroes in the factor l        =      1594957656
ratio of nonzeroes in factor (min/max)  =          0.4749 
number of super nodes                   =           29048
maximum order of a front matrix         =           18240
maximum size of a front matrix          =       166357920
maximum size of a front trapezoid       =       162874176
no. of floating point ops for factor    =      1.1858D+13
no. of floating point ops for solve     =      6.2736D+09
ratio of flops for factor (min/max)     =          0.2607
negative pivot monitoring activated
number of negative pivots encountered   =               0
factorization panel size                =             128
number of cores used                    =               4
GPU acceleration activated                                <---C
percentage of GPU accelerated flops     =         99.6603 <---C
time (cpu & wall) for structure input   =        0.250000        0.252863
time (cpu & wall) for ordering          =        4.230000        4.608260
time (cpu & wall) for value input       =        0.280000        0.278143
time (cpu & wall) for matrix distrib.   =        1.060000        1.063457
time (cpu & wall) for numeric factor    =      188.690000      405.961216
computational rate (mflops) for factor  =    62841.718360    29208.711028
time (cpu & wall) for numeric solve     =       90.560000      318.875234
computational rate (mflops) for solve   =       69.275294       19.674060
effective I/O rate (MB/sec) for solve   =      263.938866       74.958169

i/o stats: unit-Core          file length             amount transferred
                            words       mbytes          words       mbytes
              ----     ----------     --------     ----------     --------
           90-   0     332627968.     2538. MB     868420752.     6626. MB <---A
           90-   1     287965184.     2197. MB     778928589.     5943. MB <---A
           90-   2     397901824.     3036. MB    1091969277.     8331. MB <---A
           90-   3     594935808.     4539. MB    1617254575.    12339. MB <---A
           93-   0      58916864.      450. MB     274514418.     2094. MB
           93-   1      66748416.      509. MB     453739218.     3462. MB
           93-   2      64585728.      493. MB     910612446.     6947. MB
           93-   3     219578368.     1675. MB     968105082.     7386. MB
           94-   0      10027008.       76. MB      20018352.      153. MB
           94-   1      10256384.       78. MB      20455440.      156. MB
           94-   2      10584064.       81. MB      21149352.      161. MB
           94-   3      11272192.       86. MB      22481832.      172. MB

           -------     ----------     --------     ----------     --------
           Totals:    2065399808.    15758. MB    7047649333.    53769. MB


  Memory allocated on core    0        =    1116.498 MB <---B
  Memory allocated on core    1        =     800.319 MB <---B
  Memory allocated on core    2        =    1119.103 MB <---B
  Memory allocated on core    3        =    1520.533 MB <---B
  Total Memory allocated by all cores  =    4556.453 MB


  DSP Matrix Solver         CPU Time (sec) =        285.240
  DSP Matrix Solver     ELAPSED Time (sec) =        739.525
  DSP Matrix Solver      Memory Used ( MB) =       1116.498
   

The I/O statistics for the distributed memory sparse solver are presented a little differently than they are for the shared memory sparse solver. The first column shows the file unit followed by the core to which that file belongs ("unit-Core" heading). With this solver, the matrix factor file is JobnameN.DSPtri, or unit 90. Items marked (A) in the above example show the range of sizes for this important file. This also gives some indication of the computational load balance within the solver. The memory statistics printed at the bottom of this output (items marked (B)) help give some indication of how much memory was used by each utilized core to run the distributed memory sparse solver. If multiple cores are used on any node of a cluster (or on a workstation), the sum of the memory usage and disk usage for all cores on that node/workstation should be used when comparing the solver requirements to the physical memory and hard drive capacities of the node/workstation. If a single node on a cluster has slow I/O performance or cannot buffer the solver files in memory, it will drag down the performance of all cores since the solver performance is only as fast as the slowest core.

Items marked (C) show that GPU acceleration was enabled and used for this model. In this example, over 99% of the matrix factorization flops were accelerated on the GPU hardware. However, the numeric solve time is almost as great as the numeric factorization time. This is because the job is heavily I/O bound, as evidenced by the slow 75 MB/sec of I/O performance in the numeric solve computations and by the large difference in CPU and elapsed times for the numeric factorization computations. Due to this high I/O cost, the overall effect of using a GPU to accelerate the factorization computations was certainly lessened.

This example shows a very unbalanced system with high factorization speed but very poor I/O performance. Significant performance improvements could be achieved by simply running on a machine with more physical memory. Alternatively, making significant improvements to the disk configuration, possibly through the use of multiple SSD drives in a RAID0 configuration, would also be beneficial.