17.4.4. Measuring Parallel Performance

In general, there are two reasons why you might want to run an Ansys CFX job in parallel:

  • Getting results faster by combining the processing power of two or more processors. The performance of this set up is measured by the system speedup.

  • Performing simulations that require more memory than can be provided by a single machine. The performance of this feature is best measured using the memory efficiency.

In a real-world computer environment, it is not always easy to measure parallel performance. Measured execution times depend on relative speed of processors, which in turn is limited by the processor architecture, as well as its load. For heterogeneous networks, the parallel performance is directly determined by the speed of the slowest processor.

17.4.4.1. Wall Clock Performance

In order to measure wall clock performance, two parameters can be defined:

(17–1)

(17–2)

where the subscripts and refer to the sequential and parallel wall clock execution times respectively, and is the number of partitions. The best wall clock performance increase that can be expected is a linear speed up (), and this corresponds to an efficiency of 100%.

The performance determination can sometimes not be straightforward due to the fact that only part of the CFX-Solver run is actually performed in parallel. The reading and distributing of the CFX-Solver input file data, and the collecting and writing of results file data are highly I/O dependent and not parallelized. These stages therefore depend on high disk speeds and fast network communication for fast operation, and should be neglected in any parallel performance evaluation. The number of seconds taken for the parallelized calculation stage of CFX-Solver can be found on the following line of the CFX-Solver Output file:

CFD Solver wall clock seconds: 2.5113E+04

This line is found at the end of the outer loop iterations (steady-state run) or timesteps (transient run). This number is accurate only if no backup or transient results have been written during this stage of the calculation because file i/o is a serial process.

17.4.4.2. Memory Efficiency

The memory efficiency

(17–3)

where

(17–4)

can be used to compare the total memory required by all processes of a parallel run, , with the memory required by that of the equivalent serial run, . The memory efficiency for Ansys CFX will always be smaller than 100%, but should generally be in the range of 80-95% for meshes with reasonable partitions. With respect to workstation clusters, it is important to know that all processes (operating on both leader and follower machines) use a nearly identical amount of memory.

Detailed information about the memory requirements of a parallel run are written to the CFX-Solver Output file.

17.4.4.3. Visualizing Mesh Partitions

Partitioning information is appended to the results file during a parallel run of the CFX-Solver. You can view the partitions in CFD-Post by loading the results file and viewing the variable Real partition number on a locator to display the partitions of the mesh.