Large Scale DSO Job Monitoring
The Monitor Job window allows you to monitor the progress and status of Large Scale DSO jobs, including information on variations solved so far, variations currently solving, and the number of variations remaining.
Additional Resources for Large Scale DSO Monitoring
Large Scale DSO avoids detailed intra-variation monitoring, as it increases network traffic for large-scale jobs. Additional monitoring resources include:
- Cluster Monitoring Tools – Standard cluster monitoring tools are ideal for job-neutral resource monitoring as they use negligible network bandwidth.
- Detailed Monitoring of Analysis of a Variation – For detailed monitoring, you may want to examine a job's log files. Large Scale DSO writes detailed logs about the machines where engines are running and the local storage location of per-engine distributed databases. You can log in to individual machines for deeper probing of each distributed engine.
The following logs are available:
- Per-Node Logs – There is one desktopjob.log file per node assigned to the job. This log contains information regarding the node such as name, local storage folder, and number of engines started on this node. It is located in <workdir>/<jobid>/r<nodeIndex>. For example, <workdir>/<jobid>/r0 contains the desktopjob.log corresponding to the engines running on the first node of job, while <workdir>/<jobid>/r2 contains the log corresponding to engines running on the third node.
- Per-Engine Logs – There is one desktopjob.log file per distributed engine. It is located in <workdir>/<jobid>/r<nodeIndex>/r<taskIndex>. For example, <workdir>/<jobid>/r0/r0 contains the log corresponding to first engine running on first node, while <workdir>/<jobid>/r1/r2 contains the log corresponding to the third engine running on the second node. Engine unique information (such as local storage of this engine) is logged here.
- Parametric Analysis Log – This log file is located in'<workdir>/<jobid>/r<nodeIndex>/r<taskIndex> and corresponds to Desktop's local-machine parametric batchsolve. It is available only at the end of analysis and contains information regarding the variations solved by this engine and any info/warning/error messages.
- Root Log – This is the top-level desktopjob.log file that logs job distribution information such as hierarchical activation and the list of nodes assigned to this job.
For a complete discussion of methods for aborting jobs or specific tasks, see the discussion of Aborting a Large Scale DSO Simulation under Large Scale DSO for Parametric Analysis.