Monitoring Slurm Cluster Jobs
To check whether a Slurm cluster is ready or to get information on batch jobs that have been submitted to a Slurm cluster, you can issue Slurm commands at a command prompt on the Slurm Controller machine. For an autoscaling cluster, the Slurm Controller runs on the cluster head node. See Connecting to an Autoscaling Cluster Head Node. For standard HPC clusters, the Slurm Controller may be on a separate virtual desktop. See Connecting to a Linux Virtual Machine Using SSH.
The two commands used to get cluster and job information, sinfo and squeue, are described below. For a full list of Slurm commands, go to https://slurm.schedmd.com/quickstart.html.
Slurm Cluster State (sinfo)
To check whether a Slurm cluster is fully defined and ready, use the sinfo command.
If the cluster is ready, the number of nodes reported by sinfo will be equal to what you requested. If it is not ready, wait a few minutes and then try sinfo again. If there is a delay, typically it is only 1 to 2 minutes.
For more information about the sinfo command, go to https://slurm.schedmd.com/sinfo.html.
Slurm Job State (squeue)
A batch job goes through several states in the course of its execution. To view the current state of a batch job managed by Slurm, use the squeue command.
In the command output, the job state is indicated in the 'ST' column.
Typical job states are described below.
Job state in output | Meaning | Description |
---|---|---|
PD | Pending | The job is queued. It is waiting for resource allocation. |
R | Running | The job is allocated to a node and is running. |
CG | Completing | The job is finished but some processes on some nodes may still be active. |
CD | Completed | The job has completed successfully. |
F | Failed | The job has failed. |
TO | Timeout | The job has reached its runtime limit and has been terminated by Slurm. |
For more information about the squeue command and a complete list of possible job states, go to https://slurm.schedmd.com/squeue.html.