Mon 02 September 2024
Neuron
#ml-code, #llm, #ml-acceleration, #slurm, #hpc-concepts

Slurm Cluster usage tips for quick debug and testing

The following is a cheatsheet for Slurm commands used for running tests, debugging, and reserving compute nodes.

To start working with an idle node, you can allocate one exclusively using salloc --exclusive -N 1. This command requests a full node for your use, allowing direct work without interference. Once your allocation is granted, you can see which node you received by checking the job ID in squeue. Instead of SSHing manually into the node, it is cleaner and simpler to use srun -N 1 --pty --exclusive bash to start an interactive bash session directly on the node. This avoids common SSH permission errors.

If you want to verify which nodes are idle before requesting one, you can run sinfo -N -r -l. This will list nodes and their states, helping you identify available resources. After allocation, if you need specific information about your node, you can retrieve detailed status using scontrol show nodes.

There may be cases where you need a specific node due to setup constraints or to implement a particular network topology. In that case, instead of asking for a number of nodes, you can use --nodelist or -w followed by the desired node name when calling salloc or sbatch. Keep in mind that if a node is already exclusively allocated, your request will hang until the node becomes free.

Sometimes, you may want to occupy a node for an extended period to perform manual debugging. A simple method is to submit a job that “sleeps” for a long time using sbatch --nodes=1 --wrap "sleep 36000". This holds the node for about ten hours, giving you ample time to test and debug.

Caution with `salloc`

While using salloc for direct work is convenient, it is generally better to run structured jobs with sbatch whenever possible. Scheduled jobs are managed more fairly and usually have better priority than interactive ones. If you install custom dependencies or system-level libraries during your session, make sure they are isolated and do not impact other users. Using shared locations like /shared_inference can make it easier to move across nodes without repeated installations.

If you have an active reservation on the cluster, you can submit jobs against it with sbatch --reservation=dedicated_two your_script.sh. You can check available reservations with scontrol show res and adjust your job submissions accordingly.

Finally, if your workflow involves setting up environments or installing packages before running experiments, one efficient approach is to create a batch script that performs all setup steps and then blocks by running /bin/bash. This allows you to SSH or srun into the node and continue your work interactively without leaving stray processes.

If you found this useful, please cite this post using

Senthilkumar Gopal. (Sep 2024). Slurm Cluster usage tips for quick debug and testing. sengopal.me. https://sengopal.me/posts/slurm-cluster-usage-tips-for-quick-debug-and-testing

@article{gopal2024slurmclusterusagetipsforquickdebugandtesting,
  title   = {Slurm Cluster usage tips for quick debug and testing},
  author  = {Senthilkumar Gopal},
  journal = {sengopal.me},
  year    = {2024},
  month   = {Sep},
  url     = {https://sengopal.me/posts/slurm-cluster-usage-tips-for-quick-debug-and-testing}
}

Slurm Cluster usage tips for quick debug and testing

Caution with salloc

Caution with `salloc`