31.3. Distribute-Memory Parallel (DMP) Analysis for Ansys Polyflow

For large 3D simulation, the direct AMF and MUMPS solver in shared memory require a large amount of core memory. While the AMF solver is not able to run in distributed memory, the MUMPS linear solver can run with Message Passing Interface (MPI) in distributed memory. It is the only linear solver offering this capability for 2024 R1. The leader host reads the files (mesh, data, restart, p3rc, etc...), builds the matrix of the system of equations and manages I/O. While all hosts, including the leader, collaborate in solving the system with the MUMPS linear solver.

Although Ansys Polyflow can run with MPI on a single machine, there is no real benefit since all memory is allocated on this machine only.  However, if several machines are used, the memory footprint is split between machines.

Table 31.1: Memory Usage and Elapsed Time for Factorization for AMF and MUMPS Solvers With Shared and Distributed Memory compares different configurations for a 3D POMPOM viscoelastic model with 3 modes and free surface (DOF: 206884, average frontal width: 1996). There are five configurations depending upon the solver type (AMF or MUMPS), the number of computers involved in the simulations (M), the total number of MPI processes (N), the number of MPI processes per machines (PN) and the number of OMP threads (TH):

AMF Solver (runs 1-4).
MUMPS solver in shared memory (runs 5-6).
MUMPS solver in distributed memory on 1 machine (runs 7-9).
MUMPS solver in distributed memory on 2 machines (runs 10-12).
MUMPS solver in distributed memory on 4 machines (runs 13-15).

With MPI, the OMP threads are used for the parallelization of the construction of the equations system matrix and for the linear algebra operations in the solver.

The solver memory for the distributed memory runs is the value of the most memory consuming processor.

You also compare the elapsed time for the factorization of one iteration and the associated number of floating-point operations per second (Gflops/s).

Table 31.1: Memory Usage and Elapsed Time for Factorization for AMF and MUMPS Solvers With Shared and Distributed Memory

Run	Solver	M	N	PN	TH	Mem/ Shared/ Distrib.	Single or Double Precision for Factorized Matrix	Solver Memory (GB)	Elapsed Time for Factorization/Iteration	Gflops/s
1	AMF	-	-	-	8	Shared	Single	9.43	42.7	200
2	AMF	-	-	-	8	Shared	Double	12.54	42.5	200
3	AMF	-	-	-	4	Shared	Single	8.71	65.5	130
4	AMF	-	-	-	4	Shared	Double	11.11	68.5	125
5	MUMPS	-	-	-	8	Shared		12.79	23.5	345
6	MUMPS	-	-	-	4	Shared		12.26	36.7	220
7	MUMPS	1	2	2	8	Distrib.		7.86	83.6	151
8	MUMPS	1	4	4	4	Distrib.		5.29	47.4	171
9	MUMPS	1	8	8	2	Distrib.		2.98	27.5	294
10	MUMPS	2	2	1	4	Distrib.		7.73	83.8	96
11	MUMPS	2	4	2	4	Distrib.		5.29	84.9	95
12	MUMPS	2	8	4	2	Distrib.		2.85	90.6	89
13	MUMPS	4	4	1	2	Distrib.		5.42	103.2	78
14	MUMPS	4	4	1	4	Distrib.		5.29	108.3	75
15	MUMPS	4	4	1	8	Distrib.		5.29	108.2	75

In shared memory, MUMPS (runs 5-6) is faster than the AMF solver (runs 1-4). However, this advantage is obtained at the cost of a larger memory footprint as MUMPS stores the factorized matrix in double precision while AMF stores it in single precision, by default. If AMF stores the factorized matrix in double precision, the memory footprint is almost similar (runs 2 and 5).

With distributed memory on one single machine (run 7-9), the elapsed time for factorization is slightly affected by data transfer between MPI threads. This configuration does not reduce the global memory use per machine. The shared memory remains the best configuration on a single machine.

In distributed memory on several machines (runs 10-15), the data transfer for a direct solver becomes significant and the benefit of the OMP threads (-th) decreases.