Large Scale DSO Job Monitoring
Large scale DSO avoids detailed intra-variation monitoring as it increases network traffic for large scale jobs. Large scale DSO jobs are monitored as below:
- Cluster monitoring tools – The resource usage (CPU, memory, network) of large scale DSO jobs is monitored using standard cluster monitoring tools. Such job-neutral resource monitoring is ideal as it uses negligible network bandwidth, CPU/memory.
- Detailed monitoring of analysis of a variation – For any detailed monitoring you must examine the information provided in the job's log files. Specifically, the large-scale DSO job writes detailed logs conveying information regarding the machines where engines are running and the local storage location of per-engine distributed database. With such information, you can log in to individual machines for deeper probing of each distributed engine. The following logs are available:
- Per-node logs – There is one desktopjob.log file per node assigned to the job. This log contains information regarding the node such as name, local storage folder, and number of engines started on this node. It is located in <workdir>/<jobid>/r<nodeIndex>. For example, <workdir>/<jobid>/r0 has desktopjob.log corresponding to the engines running on the first node of job, while <workdir>/<jobid>/r2 has logs corresponding to engines running on the third node.
- Per-engine logs – There is one desktopjob.log file per distributed engine. It is located in <workdir>/<jobid>/r<nodeIndex>/r<coreIndex>. For example, <workdir>/<jobid>/r0/r0 has logs corresponding to the first engine running on the first node, while <workdir>/<jobid>/r1/r2 has logs corresponding to the third engine running on the second node. Engine-unique information (such as local storage of this engine) is logged here.
- Parametric analysis log – This log file is located in '<workdir>/<jobid>/r<nodeIndex>/r<coreIndex> folder and corresponds to Desktop's local-machine parametric -batchsolve. It is available only at the end of analysis and contains information regarding the variations solved by this engine and any info/warning/error messages.
- Root desktopjob.log – This is the top-level log that logs job distribution information such as hierarchical activation and the list of nodes assigned to this job.
Related Topics