Large Scale DSO Job Monitoring
Large scale DSO avoids detailed intra-variation monitoring is avoided as it increases network traffic for large-scale jobs. Large scale DSO jobs are monitored as below:
- Cluster Monitoring Tools: The resource usage (CPU, Memory, Network) of large scale DSO jobs is monitored using standard cluster monitoring tools. Such job-neutral resource monitoring is ideal as it uses negligible network bandwidth, CPU/Memory.
- Detailed Monitoring of Analysis of a Variation: For any detailed monitoring you must examine the information provided in the job's log files. Specifically, the large-scale DSO job writes detailed logs conveying information regarding the machines where engines are running and the local storage location of per-engine distributed database. With such information, you can login to individual machines for deeper probing of each distributed engine.
The following logs are available:
- Per-node Logs:
There is one 'desktopjob.log' file per node assigned to the job. This log contains information regarding the node such as name, local storage folder, number of engines started on this node, etc. It is located in <workdir>/<jobid>/r<nodeIndex>. E.g. <workdir>/<jobid>/r0 has desktopjob.log corresponding to the engines running on the first node of job, while <workdir>/<jobid>/r2 has logs corresponding to engines running on third node
- Per-engine Logs:
There is one desktopjob.log file per distributed engine. It is located in <workdir>/<jobid>/r<nodeIndex>/r<taskIndex>. For example, <workdir>/<jobid>/r0/r0 has logs corresponding to first engine running on first node, while <workdir>/<jobid>/r1/r2 has logs corresponding to third engine running on second node. Engine unique information (such as local storage of this engine) is logged here
- Parametric Analysis Log:
This log file is located in <workdir>/<jobid>/r<nodeIndex>/r<taskIndex> folder and corresponds to Desktop's local-machine parametric 'batchsolve'. It is available only at the end of analysis and contains information regarding the variations solved by this engine and any info/warning/error messages.
- Root Log:
This is the top-level log that logs job distribution information such as hierarchical activation and the list of nodes assigned to this job
For a complete discussion of methods for aborting jobs or specific tasks, see the discussion of Aborting a Large Scale DSO Simulation under Large Scale DSO for Parametric Analysis.