6.8. Identifying CPU, I/O, and Memory Performance

Table 6.1: Obtaining Performance Statistics from Solvers summarizes the information in the previous sections for the most commonly used solver choices. CPU and I/O performance are best measured using sparse solver statistics. Memory and file size information from the sparse solver are important because they set boundaries indicating which problems can run efficiently using the in-core memory mode and which problems should use the out-of-core memory mode.

The expected results summarized in the table below are for current systems. Continual improvements in processor performance are expected, although processor clock speeds have plateaued due to power requirements and heating concerns. I/O performance is also expected to improve as wider use of inexpensive RAID0 configurations is anticipated and as SSD technology improves. The sparse solver effective I/O rate statistic can determine whether a given system is getting in-core performance, in-memory buffer cache performance, RAID0 speed, or is limited by single disk speed.

PCG solver statistics are mostly used to tune preconditioner options, but they also provide a measure of memory bandwidth. The computations in the iterative solver place a high demand on memory bandwidth, thus comparing performance of the iterative solver is a good way to compare the effect of memory bandwidth on processor performance for different systems.

Table 6.1: Obtaining Performance Statistics from Solvers

SolverPerformance Stats SourceExpected Results on a Balanced System
Sparse DSPOPTION,,,,,,PERFORMANCE command or Jobname.DSP file

10-30 Gflops factor rate for 1 core

50-150 MB/sec effective I/O rate for single conventional drive

150-500 MB/sec effective I/O rate for Windows 64-bit RAID0, 4 conventional drives, or striped Linux configuration

250-1500 MB/sec using high-end SSDs in a RAID0 configuration

5000-10000 MB/sec effective I/O rate for in-core or when system cache is larger than file size

Note: I/O performance can significantly degrade if many MPI processes are writing to the same disk resource. In-core memory mode or use of SSDs is recommended.

PCGJobname.pcs - always written by PCG iterative solver

Total iterations in hundreds for well conditioned problems. Over 2000 iterations indicates difficult PCG job, slower times expected.

Level of Difficulty - 1 or 2 typical. Higher level reduces total iterations but increases memory and CPU cost per iteration.

Elements: Assembled indicates elements are not using MSAVE,ON feature. Implicit indicates MSAVE,ON elements (reduces memory use for PCG).

LANPCG - PCG LanczosJobname.pcs - always written by PCG Lanczos eigensolverSame as PCG above but add:

Number of Load Steps - 2 - 3 times number of modes desired

Average iterations per load case - few hundred or less is desired.

Level of Difficulty: 2-4 best for very large models. 5 uses direct factorization - best only for smaller jobs when system can handle factorization cost well.