Monitoring Slurm Cluster Jobs
To check whether a Slurm cluster is ready or to get information about batch jobs that have been submitted to a Slurm cluster, you can connect to the cluster head node and issue Slurm commands at a command prompt.
The two commands used to get cluster and job information, sinfo and squeue, are described below. For a full list of Slurm commands, go to https://slurm.schedmd.com/quickstart.html.
Slurm Cluster State (sinfo)
To check whether a Slurm cluster is fully defined and ready, use the sinfo command.
If the cluster is ready, the number of nodes reported by sinfo will be equal to what you requested. If it is not ready, wait a few minutes and then try sinfo again. If there is a delay, typically it is only 1 to 2 minutes.
For more information about the sinfo command, go to https://slurm.schedmd.com/sinfo.html.
Slurm Job State (squeue)
A batch job goes through several states in the course of its execution. To view the current state of a batch job managed by Slurm, use the squeue command.
In the command output, the job state is indicated in the 'ST' column.
Typical job states are described below.
Job state in output | Meaning | Description |
---|---|---|
PD | Pending | The job is queued. It is waiting for resource allocation. |
R | Running | The job is allocated to a node and is running. |
CG | Completing | The job is finished but some processes on some nodes may still be active. |
CD | Completed | The job has completed successfully. |
F | Failed | The job has failed. |
TO | Timeout | The job has reached its runtime limit and has been terminated by Slurm. |
For more information about the squeue command and a complete list of possible job states, go to https://slurm.schedmd.com/squeue.html.