Large Scale DSO Known Issues/Troubleshooting

Node Order

For Large Scale DSO jobs that are submitted from job submission panel using RSM, localhost must be the first node in the resource selection node list, otherwise Large Scale DSO solve with RSM will fail. Set the order in the Submit Job To window, Compute Resources tab:

Table lists local host task value to 1 and othermachine value to 1.

Cluster Configuration Shared Drive Requirement

All input files (project, etc.) must be present on a shared drive that is accessible from every node of the cluster.

Parallel Task Limitation for LS-DSO Parametric Variations

There some limitations of running short parametric runs in parallel and using a subset of cores and/or running across multiple machines instead of just one. LS-DSO does help in getting close to linear scalability for parametric runs in many scenarios, it has some overheads and typically they are small in comparison to total time of run. However, a few factors in particular scenarios can make them prominent.

The startup of each task involves copying the project and launching ansysedt to solve a subset of parametric table. Typically, one task solves several variations and this startup cost is negligible in comparison to total time. Consider a case where each task is solving only single variation. Additionally the actual solve time of one variation is relatively small, 3-5 minutes. These factors make the startup cost significant. There is also some variance in solve times of different variations. When all variations are solved in parallel, the total time is determined by slowest variation, even though the expectation intuitively might be relative to average time. That could make the perception of overhead worse in this case.

Another important factor is the number of parallel tasks on each machine relative to the total number of cores on the machine. Each task is effectively running an ansysedt of its own and solving subset of parametric table on a copy of project. The solve process does involve significant disk IO. If you run too many tasks on a single machine, they may end up competing with each other for single disk, and that could cause them to slowdown. The conflict in memory access may also become a factor.

In a particular case, 26 tasks were run on one 26 core machine, and this caused significant slowdowns. Spreading the tasks across machines helped scalability as an example with a 32 core machines in a Linux Cloud, we ran this job with 26 tasks on a single machine and then another one over two machines with 13 tasks per machine. The single machine job finishes in about 14 mins, while the job on two machine finishes in about 8.5 mins. This indicates that spreading tasks across machines helps with scalability by minimize file read/write conflicts.

Job Restart

There is no provision for stopping and restarting a job. A new job does not reuse solved results; it always solves all rows in the table. An abort or failure of a job restarts from the beginning, unless a new parametric table with the unsolved rows is created.

Linux-Only Issues