To submit a job on a cluster, the MPI parameters must be evaluated on the basis of the cluster features and the problem size.
The relevant cluster features are the following:
N : number of nodes on the cluster.
C : number of cores per node.
M : Memory per node.
The main feature of the problem is its size:
S : Estimated memory footprint of the problem.
The customer needs are:
n : number of nodes used.
p : number of MPI processes.
q : number of MPI processes per node.
c : number of cores used per MPI process.
Rules to evaluate the customer needs:
Satisfy the memory footprint of nodes:
n ≥ S/M Consider the additional memory needed on the leader.
The main goal being the reduction of the memory footprint per node, set the number of MPI processes per node to 1.
q = 1.
For a good performance of each MPI process, the number of cores used per MPI process should not exceed 8 as more cores usually do not bring much performance gain:
c = 8.
The number of MPI processes:
p = n × q.
The total number of cores used p × c must not exceed the number of licenses that the user can afford for his calculation.
To illustrate this, let’s consider the flow of a generalized-Newtonian fluid: 6.3*106 variables and 622*106 coefficients in the matrix. MUMPS estimated the required memory for factorized matrix to 115 GB. Let’s assume that 4 nodes are required to run in distributed memory.
In Figure 31.3: Elapsed time (solid line-left axis) of matrix building, matrix analysis, factorization, whole calculation; Speedup (dashed line-right axis) for MUMPS factorization and whole calculation with 4 nodes as function of the number of cores per task (1, 2, 4, 8, 16). the number of nodes has been fixed to 4 and it shows the elapsed time for matrix building, matrix analysis, MUMPS factorization, whole calculation, speedup for the MUMPS factorization and speedup for whole calculation as function of the of the number of cores per task (1, 2, 4, 8, 16).
The elapsed time of the matrix build is affected by the cores per task. It is reduced when the number of cores per task increases. However, too many cores per task could become a drawback. Here, there is no significant benefit for 16 cores per task with respect to 8.
The matrix analysis is not affected by the number of nodes and the number of cores per task.
For the elapsed time of the factorization, there is a speedup of 6.3 with 16 cores per task with respect to the run with 1 core per task.
Eventually, speed up for the whole calculation shows a significant reduction of the elapsed time until 16 cores per task.
Figure 31.3: Elapsed time (solid line-left axis) of matrix building, matrix analysis, factorization, whole calculation; Speedup (dashed line-right axis) for MUMPS factorization and whole calculation with 4 nodes as function of the number of cores per task (1, 2, 4, 8, 16).