Large Scale DSO Theory

The parametric analysis command in Desktop computes simulation results as a function of model parameters, such as geometry dimensions, material properties and excitations. The parametric analysis is either performed on a local machine, where each variation is analyzed serially by a single engine, or distributed across machines through a DSO license. Desktop's DSO analysis runs multiple engines in parallel, thus generating results in a shorter time. In the Regular DSO algorithm, the parametric analysis job (Desktop) runs on master node, which in turn launches one or more distributed-parallel engines on each machine allocated to the job. Desktop distributes parametric variations among these engines running in parallel. As variations are solved, the progress/messages and variation results are sent back to Desktop, where they are persisted into the common results database on master node, as illustrated below:

A single engine can now span multiple machines with auto multi-level DSO.

Regular DSO Bottleneck

As per above illustration, Regular DSO's speedup is limited by the resources of the centralized 'Desktop' bottleneck. It's been observed that DSO becomes unreliable at a certain point, as the number of engines and number of variations is increased. The term 'large scale parallel' can be used to define this tipping point. For a given model, a 'large-scale parallel' job denotes scenarios, in terms of the number of distributed-parallel engines and the number of parametric variations, where the Regular DSO runs into centralized bottlenecks that result in one or both of: progressively smaller speedups, unreliability.

With the advent of economical availability and timely provisioning of compute resources, product designers have access to large compute clusters to run their simulations. And they are throwing larger and larger number of compute resources at simulation jobs, in order to obtain results faster. The parametric DSO needs to meet this challenge and target linear speedup for 'large-scale parallel' jobs. The Large Scale DSO feature is targeted toward 100% reliability and linear speedup of large-scale parallel DSO jobs.

Key Algorithms/Concepts for Large Scale DSO

Redistribution Feature

Large Scale Distributed Solve Operation could submit a parametric setup to be solved in multiple machines, each machine may launch multiple EM-Desktop processes to solve the assigned variations(Design Points). ​ Variations are distributed to each task(EM-Desktop process) equally, regardless of the machine hardware and each variation’s complexity. That may resultin some tasks finishing earlier than others, in some extreme case some tasks may hours behind fastest task.​

Now, redistribution occurs when a task finished its own assignment, it calls back to the L2 to ask for new assignment, L2 forward the request to L1, L1 forward it to L0, L0 may pick one of the slow task to remove some variations or return with no more assignment. ​When a task is picked to remove unsolved variation, L0 calls L1(may be different L1), L1 forward it to the selected L2, L2 makes a request to the EM-Desktop to remove some variations from its queue. EM-Desktop returns the removed variation indexes, or error code if it fails.​ In L0 if error is returned, it marks the selected task failed to respond the remove request, to void picking it again.​ L0 returns the result through L1-L2 back the EM-Desktop.​

The Ansys Electronics Desktop will ​enqueue the new assignment and then start solving those new variations.​ If the return data indicates error, it makes another call to L2 to get new assignment.​ If no more variations return, the Ansys Electronics Desktop finishs the simulation and exit.