43.9. Checking and Improving Parallel Performance

Fluent offers several tools to help you optimize the performance of your parallel computations. You can check the utilization of your hardware using the parallel check feature. To determine how well the parallel solver is working, you can measure computation and communication times, and the overall parallel efficiency, using the performance meter. You can also control the amount of communication between compute nodes in order to optimize the parallel solver, and take advantage of the automatic load balancing feature of Ansys Fluent.

Information about checking and improving parallel performance is provided in the following sections:

43.9.1. Parallel Check

You can use the Check command in the Parallel ribbon tab to check various factors that affect parallel performance. Checks are performed to identify the following issues:

  • CPU cores are overloaded

  • CPU clock is throttled

  • System memory usage is too high

  • A faster interconnect is available

  • Partitions are imbalanced (if a valid mesh is loaded)

43.9.2. Checking Parallel Performance

The performance meter allows you to report the wall clock time elapsed during a computation, as well as message-passing statistics. Since the performance meter is always enabled, you can access the statistics by displaying them after the computation is completed. To view the current statistics, click Usage in the Parallel ribbon tab (Timer group box).

 Parallel Timer Usage

Performance statistics will be displayed in the console.

To clear the performance meter so that you can eliminate past statistics from the future report, click Reset in the Parallel ribbon tab (Timer group box).

 Parallel Timer Reset

The following example demonstrates how the current parallel statistics are displayed in the console:

Performance Timer for 472 iterations on 4 compute nodes
  Average wall-clock time per iteration:                0.823 sec
  Global reductions per iteration:                         93 ops
  Global reductions time per iteration:                 0.000 sec (0.0%)
  Message count per iteration:                           3842 messages
  Data transfer per iteration:                         17.226 MB
  LE solves per iteration:                                  7 solves
  LE wall-clock time per iteration:                     0.242 sec (29.4%)
  LE global solves per iteration:                           3 solves
  LE global wall-clock time per iteration:              0.001 sec (0.1%)
  LE global matrix maximum size:                           24
  AMG cycles per iteration:                             8.017 cycles
  Relaxation sweeps per iteration:                        374 sweeps
  Relaxation exchanges per iteration:                       0 exchanges
  LE early protections (stall) per iteration:           0.000 times
  LE early protections (divergence) per iteration:      0.000 times
  Total SVARS touched:                                    385
  Time-step updates per iteration:                       0.11 updates
  Time-step wall-clock time per iteration:              0.003 sec (0.4%)

  Total wall-clock time:                              388.669 sec


Simulation wall-clock time for 472 iterations             426 sec

A description of the parallel statistics is as follows:

  • Average wall-clock time per iteration describes the average real (wall clock) time per iteration.

  • Global reductions per iteration describes the number of global reduction operations (such as variable summations over all processes). This requires communication among all processes.

    A global reduction is a collective operation over all processes for the given job that reduces a vector quantity (the length given by the number of processes or nodes) to a scalar quantity (for example, taking the sum or maximum of a particular quantity). The number of global reductions cannot be calculated from any other readily known quantities. The number is generally dependent on the algorithm being used and the problem being solved.

  • Global reductions time per iteration describes the time per iteration for the global reduction operations.

  • Message count per iteration describes the number of messages sent between all processes per iteration. This is important with regard to communication latency, especially on high-latency interconnects.

    A message is defined as a single point-to-point, send-and-receive operation between any two processes. This excludes global, collective operations such as global reductions. In terms of domain decomposition, a message is passed from the process governing one subdomain to a process governing another (usually adjacent) subdomain.

    The message count per iteration is usually dependent on the algorithm being used and the problem being solved. The message count and the number of messages that are reported are totals for all processors.

    The message count provides some insight into the impact of communication latency on parallel performance. A higher message count indicates that the parallel performance may be more adversely affected if a high-latency interconnect is being used. Ethernet has a higher latency than InfiniBand. Therefore, a high message count will more adversely affect performance with Ethernet than with InfiniBand.

    To check the latency of the overall cluster interconnect, refer to Checking Latency and Bandwidth.

  • Data transfer per iteration describes the amount of data communicated between processors per iteration. This is important with respect to interconnect bandwidth.

    Data transfer per iteration is usually dependent on the algorithm being used and the problem being solved. This number generally increases with increases in problem size, number of partitions, and physics complexity.

    The data transfer per iteration may provide some insight into the impact of communication bandwidth (speed) on parallel performance. The precise impact is often difficult to quantify because it is dependent on many things including: ratio of data transfer to calculations, and ratio of communication bandwidth to CPU speed. The unit of data transfer is a byte.

    To check the bandwidth of the overall cluster interconnect, refer to Checking Latency and Bandwidth.

  • LE solves per iteration describes the number of linear systems being solved per iteration. This number is dependent on the physics (non-reacting versus reacting flow) and the algorithms (pressure-based versus density-based solver), but is independent of mesh size. For the pressure-based solver, this is usually the number of transport equations being solved (mass, momentum, energy, and so on).

  • LE wall-clock time per iteration describes the time (wall-clock) spent doing linear equation solvers (that is, multigrid).

  • LE global solves per iteration describes the number of solutions on the coarsest level of the AMG solver where the entire linear system has been pushed to a single processor (n0). The system is pushed to a single processor to reduce the computation time during the solution on that level. Scaling generally is not adversely affected because the number of unknowns is small on the coarser levels.

  • LE global wall-clock time per iteration describes the time (wall-clock) per iteration for the linear equation global solutions.

  • AMG cycles per iteration describes the average number of multigrid cycles (V, W, flexible, and so on) per iteration.

  • Relaxation sweeps per iteration describes the number of relaxation sweeps (or iterative solutions) on all levels for all equations per iteration. A relaxation sweep is usually one iteration of Gauss-Siedel or ILU.

  • Relaxation exchanges per iteration describes the number of solution communications between processors during the relaxation process in AMG. This number may be less than the number of sweeps because of shifting the linear system on coarser levels to a single node/process.

  • Time-step updates per iteration describes the number of sub-iterations on the time step per iteration.

  • Time-step wall-clock time per iteration describes the time per sub-iteration.

  • Total wall-clock time describes the total wall-clock time that elapses during the simulation, accounting for all solver related timings, including communications.

  • Simulation wall-clock time describes the same simulation time operations as Total wall-clock time plus the time attributed to postprocessing operations, such as animations, monitors, and file input/output, and any user interaction-related time.

The most relevant quantity is the Total wall clock time. This quantity can be used to gauge the parallel performance (speedup and efficiency) by comparing this quantity to that from the serial analysis.

43.9.2.1. Checking Latency and Bandwidth

You can check the latency and bandwidth of the overall cluster interconnect, to help identify any issues affecting Ansys Fluent scalability, by clicking Latency and Bandwidth in the Parallel ribbon tab (Network group box).

 Parallel Network Latency

Depending on the number of machines and processors being used, a table containing information about the communication speed for each node will be displayed in the console. The table will also summarize the minimum and maximum latency between two nodes.

Consider the following example when checking for latency:

Latency (usec) with 1000 samples [1.83128 sec]
 ------------------------------------------
 ID       n0     n1     n2    n3    n4    n5
 ------------------------------------------
 n0            48.0   48.2  48.2  48.3   *50 
 n1    48.0           48.2  48.3  48.3   *48 
 n2    48.2    48.2         48.8  49.1   *53 
 n3    48.2    48.3    *49        48.6  48.5 
 n4    48.3    48.3   49.1  48.6         *50 
 n5    49.7    48.5    *53  48.5  49.7      
 
 ------------------------------------------
 Min: 47.9956 [n0<-->n1]
 Max: 52.6836 [n5<-->n2]
 ------------------------------------------

Important:  In the above table, (*) is the maximum value in that row. The smaller the latency, the better.


Six processors (n0 to n5) are spawned. The latency between n0 and n1 is 48.0 . Similarly, the latency between n1 and n2 is 48.2 . The minimum latency occurs between n0 and n1 and the maximum latency occurs between n2 and n5, as noted in the table. Checking the latency is particularly useful when you are not seeing expected speedup on a cluster.

 Parallel Network Bandwidth

In addition to checking for latency, you can check your bandwidth. A table containing information about the amount of data communicated within one second between two nodes is displayed in the console. The table will also summarize the minimum and maximum bandwidth between two nodes.

Consider the following example when checking for bandwidth:

 Bandwidth (MB/s) with 5 messages of size 4MB [4.36388 sec]
 --------------------------------------------
 ID      n0     n1     n2     n3     n4     n5
 --------------------------------------------
 n0          111.8    *55  111.8   97.5  101.3 
 n1   111.8          69.2   98.7  111.7    *51 
 n2    54.7   69.2          72.9  104.8    *45 
 n3   111.8   98.7   72.9          64.0    *45 
 n4    97.6  111.7  104.8    *64          76.9 
 n5   101.2   50.9   45.5    *45    76.9    
 
 --------------------------------------------
 Min: 45.1039 [n5<-->n3]
 Max: 111.847 [n0<-->n3]
 --------------------------------------------

Important:  In the above table, (*) is the minimum value in that row. The larger the bandwidth, the better.


The bandwidth between n0 and n1 is 111.8 MB/s. Similarly, the bandwidth between n1 and n2 is 69.2 MB/s. The minimum amount of bandwidth occurs between n3 and n5 and the maximum occurs between n0 and n3, as noted in the table. Checking the bandwidth is particularly useful when you cannot see good scalability with relatively large cases.

43.9.3. Optimizing the Parallel Solver

43.9.3.1. Increasing the Report Interval

In Ansys Fluent, you can reduce communication and improve parallel performance by increasing the report interval for residual printing/plotting or other solution monitoring reports. You can modify the value for Reporting Interval in the Run Calculation Task Page.

 Solution Run Calculation Calculate...


Important:  Note that you will be unable to interrupt iterations until the end of each report interval.


43.9.3.2. Accelerating View Factor Calculations Using General Purpose Graphics Processing Units (GPGPUs)

When using the Surface to Surface (S2S) radiation model, you can accelerate the calculation of the view factors with the following methods:

  • the Hemicube method with the Cluster to Cluster basis

  • the Ray Tracing method

For the accelerated hemicube method, a combination of MPI/OpenMP/OpenCL models is used to speed up the view factor computations. Irrespective of the number of MPI processes launched, only one MPI process/machine is used for computing view factors. On each machine, one MPI process spawns several OpenMP threads that actually compute the view factors. The number of accelerated hemicube processes will be same as the number of Ansys Fluent processes. If OpenCL-capable GPUs are available, then a portion of the view factor computations are done on GPUs in offload mode using OpenCL, to further speed up the computation.

To use the accelerated hemicube method, set up the View Factors and Clustering dialog box and then enter the following text command to compute, write, and read a file with the surface cluster information and view factors: /define/models/radiation/s2s-parameters/compute-clusters-and-vf-accelerated. You will be prompted for the name of the file, the number of GPUs to be used, and the ratio of the work load on 1 GPU vs 1 CPU OpenMP thread (this is based on the time consumed by the GPU and the CPU). At the end of the view factor computations, a recommendation is printed for the GPU/CPU work load ratio to use in future computations.

Note that for the accelerated hemicube method, the OpenCL library should be accessible through the appropriate environment variable (LD_LIBRARY_PATH on lnamd64 or %path% on win64) in order to use the GPU for view factor computations. By default on lnamd64, /usr/lib64 is searched, but if the library is installed in another location, then that location should be specified in the LD_LIBRARY_PATH variable.

When using the accelerated ray tracing method with a GPU in offload mode, you must have one NVIDIA GPU (and only one) available on node-0. The NVIDIA Optix library is used for tracing the rays. Currently the Ampere GPUs are not supported with this method.

To use the accelerated ray tracing method, set up the View Factors and Clustering dialog box and then enter the following text command to compute, write, and read a file with the surface cluster information and view factors: /define/models/radiation/s2s-parameters/compute-clusters-and-vf-accelerated. You will be prompted for the name of the file.


Note:  CFF (.h5) format is not supported with the accelerated view factor calculations.


43.9.3.3. Accelerating Discrete Ordinates (DO) Radiation Calculations

The accelerated discrete ordinates (DO) radiation solver is computationally faster than the standard DO solver. Note that even though the accelerated DO solver may take more iterations to converge, the overall simulation time is shorter. If NVIDIA GPGPUs are set up in offload mode for the Fluent session, the accelerated DO solver will accelerate the DO computations by using the GPGPUs; in the absence of GPGPUs, this solver can still be used with the CPU cores to accelerate the DO computations.

After you have selected the DO model in the Radiation Model dialog box, you can enable the accelerated DO solver by using the following text command:

define models radiation do-acceleration?

Note that the accelerated DO solver uses the first-order upwind scheme (and ignores whatever selection you have made for the Discrete Ordinates spatial discretization scheme in the Solution Methods task page), along with an explicit relaxation of 1.0.

If you plan to use GPGPUs with the accelerated DO solver, it is recommended that you run NVIDIA’s multi-process server (MPS) before launching Ansys Fluent using the following command: nvidia-cuda-mps-control -d. It is known to improve the robustness and performance of the GPGPU computations with the multiple Fluent processes. Then you must specify how many GPGPUs are to be used per machine when you launch Fluent: you can use the Solver GPGPUs per Machine (Offload Mode) setting in Fluent Launcher or use the -gpgpu= ngpgpus command line option. For details, refer to the following sections:

Note that using GPGPUs requires HPC licenses. Licensing details can be found in HPC Licensing in the Ansys, Inc. Licensing Guide.

Once the Fluent session is running, you can view and/or select the available GPGPUs on the system using the following TUI commands:

parallel/gpgpu/show

Displays the available GPGPUs on the system.

parallel/gpgpu/select

Selects the GPGPUs to use in offload mode. Note that you can only select up to the number of GPGPUs that you specified on the command line or in Fluent Launcher when starting the session.

Currently the Ampere GPUs are not supported with the accelerated radiation solver on Windows platforms.

43.9.3.3.1. Accelerated DO Model Limitations

The accelerated DO solver is incompatible with some models and settings. When necessary, Fluent will automatically revert to the standard DO solver when the calculation is started and print a message about the conflict. The known incompatibilities are as follows:

  • Shells

  • Axisymmetric cases

  • Eulerian multiphase

  • UDF-specified DOM sources

  • UDF-specified emissivity weighting factor

  • Porosity scaling of radiation

  • Pixelated interior

  • Non-isotropic scattering phase function

  • Non-participating medium

  • Non-conformal interfaces

  • Supports only upwind scheme and explicit relaxation of 1.0.

43.9.4. Clearing the Linux File Cache Buffers

Processing performance can significantly decrease when the file cache buffers of a Linux machine are full. While this is true in serial, it is more often a concern when solving large cases in parallel, particularly when using AMD processors. If you see a performance decrease even though the case / machine setup has not changed, that is an indication that the file cache buffers may be to blame.

The filling of the file cache buffers can happen over a period of time as a result of input-output activity. Even after the Ansys Fluent session is exited, by default the operating system does not free up the file cache buffers immediately (unless the operating system is unable to satisfy a malloc subroutine request with available free memory). During memory allocation for a parallel case, this can result in the allocation of memory from a different NUMA domain, and consequently can have a significant impact on performance.

To resolve this issue on Linux machines, you must first ensure that all of the relevant machines are idle (so that you do not adversely affect any jobs that are running). Then you can clear the file cache buffers by performing one of the following actions:

  • Include the -cflush option when launching Ansys Fluent from the command line.

    This option ensures that the file cache buffers are flushed in a separate operation. While this process may take a few minutes to complete (depending on the total memory of the system), it does not require you to have root privileges.

or

  • Enter the (drop-cache) Scheme command (either in the Ansys Fluent console or through your journal file) after launching Fluent but before you read the case file.

    This command will instantaneously clear the pagecaches, dentries, and inodes.


    Important:  Note that in order to use the (drop-cache) command, you must have sudo administrative privileges for the /sbin/sysctl vm.drop_caches=3 command.