When running a simulation, the solution time is typically dominated by three main parts: the time spent to create the element matrices and form the global matrices, the time to solve the linear system of equations, and the time spent calculating derived quantities (such as stress and strain) and other requested results for each element.
Shared-memory parallel (SMP) processing can run a solution over multiple cores on a single machine. When using shared-memory parallel processing, you can reduce each of the three main parts of the overall solution time by using multiple cores. However, this approach is often limited by the memory bandwidth; you typically see very little reduction in solution time beyond four cores.
The main program functions that run in parallel on shared-memory hardware are:
Solvers such as the Sparse, PCG, ICCG, Block Lanczos, PCG Lanczos, Supernode, and Subspace running over multiple cores but sharing the same memory address. These solvers typically have limited scalability when used with shared-memory parallelism. In general, very little reduction in time occurs when using more than four cores.
Forming element matrices and load vectors.
Computing derived quantities and other requested results for each element.
Pre- and postprocessing functions such as graphics, selecting, sorting, and other data and compute intensive operations.