Similar to the shared-memory sparse solver, performance information for the distributed-memory version of the sparse solver is printed by default to the DMP file Jobname.DSP. Also, the command DSPOPTION,,,,,,PERFORMANCE can be used to print this same information, along with additional solver information, to the standard output file.
Example 6.2: Distributed Sparse Solver Performance Summary for 4 Processes and 1 GPU shows an example of the performance summary output from the distributed memory sparse solver. Most of this performance information is identical in format and content to what is described in Sparse Solver Performance Output. However, a few items are unique or are presented differently when using the distributed memory sparse solver, which we will discuss next.
Example 6.2: Distributed Sparse Solver Performance Summary for 4 Processes and 1 GPU
number of equations = 774144 no. of nonzeroes in lower triangle of a = 30765222 no. of nonzeroes in the factor l = 1594957656 ratio of nonzeroes in factor (min/max) = 0.4749 number of super nodes = 29048 maximum order of a front matrix = 18240 maximum size of a front matrix = 166357920 maximum size of a front trapezoid = 162874176 no. of floating point ops for factor = 1.1858D+13 no. of floating point ops for solve = 6.2736D+09 ratio of flops for factor (min/max) = 0.2607 negative pivot monitoring activated number of negative pivots encountered = 0 factorization panel size = 128 number of cores used = 4 GPU acceleration activated <---C percentage of GPU accelerated flops = 99.6603 <---C time (cpu & wall) for structure input = 0.250000 0.252863 time (cpu & wall) for ordering = 4.230000 4.608260 time (cpu & wall) for value input = 0.280000 0.278143 time (cpu & wall) for matrix distrib. = 1.060000 1.063457 time (cpu & wall) for numeric factor = 188.690000 405.961216 computational rate (mflops) for factor = 62841.718360 29208.711028 time (cpu & wall) for numeric solve = 90.560000 318.875234 computational rate (mflops) for solve = 69.275294 19.674060 effective I/O rate (MB/sec) for solve = 263.938866 74.958169 i/o stats: unit-Core file length amount transferred words mbytes words mbytes ---- ---------- -------- ---------- -------- 90- 0 332627968. 2538. MB 868420752. 6626. MB <---A 90- 1 287965184. 2197. MB 778928589. 5943. MB <---A 90- 2 397901824. 3036. MB 1091969277. 8331. MB <---A 90- 3 594935808. 4539. MB 1617254575. 12339. MB <---A 93- 0 58916864. 450. MB 274514418. 2094. MB 93- 1 66748416. 509. MB 453739218. 3462. MB 93- 2 64585728. 493. MB 910612446. 6947. MB 93- 3 219578368. 1675. MB 968105082. 7386. MB 94- 0 10027008. 76. MB 20018352. 153. MB 94- 1 10256384. 78. MB 20455440. 156. MB 94- 2 10584064. 81. MB 21149352. 161. MB 94- 3 11272192. 86. MB 22481832. 172. MB ------- ---------- -------- ---------- -------- Totals: 2065399808. 15758. MB 7047649333. 53769. MB Memory allocated on core 0 = 1116.498 MB <---B Memory allocated on core 1 = 800.319 MB <---B Memory allocated on core 2 = 1119.103 MB <---B Memory allocated on core 3 = 1520.533 MB <---B Total Memory allocated by all cores = 4556.453 MB DSP Matrix Solver CPU Time (sec) = 285.240 DSP Matrix Solver ELAPSED Time (sec) = 739.525 DSP Matrix Solver Memory Used ( MB) = 1116.498
The I/O statistics for the distributed memory sparse solver are
presented a little differently than they are for the shared memory sparse solver. The
first column shows the file unit followed by the core to which that file belongs
("unit-Core" heading). With this solver, the matrix factor file is
JobnameN
.DSPtri, or unit 90.
Items marked (A) in the above example show the range of sizes for this important file.
This also gives some indication of the computational load balance within the solver. The
memory statistics printed at the bottom of this output (items marked (B)) help give some
indication of how much memory was used by each utilized core to run the distributed
memory sparse solver. If multiple cores are used on any node of a cluster (or on a
workstation), the sum of the memory usage and disk usage for all cores on that
node/workstation should be used when comparing the solver requirements to the physical
memory and hard drive capacities of the node/workstation. If a single node on a cluster
has slow I/O performance or cannot buffer the solver files in memory, it will drag down
the performance of all cores since the solver performance is only as fast as the slowest
core.
Items marked (C) show that GPU acceleration was enabled and used for this model. In this example, over 99% of the matrix factorization flops were accelerated on the GPU hardware. However, the numeric solve time is almost as great as the numeric factorization time. This is because the job is heavily I/O bound, as evidenced by the slow 75 MB/sec of I/O performance in the numeric solve computations and by the large difference in CPU and elapsed times for the numeric factorization computations. Due to this high I/O cost, the overall effect of using a GPU to accelerate the factorization computations was certainly lessened.
This example shows a very unbalanced system with high factorization speed but very poor I/O performance. Significant performance improvements could be achieved by simply running on a machine with more physical memory. Alternatively, making significant improvements to the disk configuration, possibly through the use of multiple SSD drives in a RAID0 configuration, would also be beneficial.