- Mon 02 September 2024
- Neuron
- #ml-code, #llm, #ml-acceleration, #slurm, #hpc-concepts
Slurm Cluster usage tips for quick debug and testing
The following is a cheatsheet for Slurm commands used for running tests, debugging, and reserving compute nodes.
To start working with an idle node, you can allocate one exclusively
using salloc --exclusive -N 1
. This command requests a full
node for your use, allowing direct work without interference. Once your
allocation is granted, you can see which node you received by checking
the job ID in squeue
. Instead of SSHing manually into the
node, it is cleaner and simpler to use
srun -N 1 --pty --exclusive bash
to start an interactive
bash session directly on the node. This avoids common SSH permission
errors.
If you want to verify which nodes are idle before requesting one, you
can run sinfo -N -r -l
. This will list nodes and their
states, helping you identify available resources. After allocation, if
you need specific information about your node, you can retrieve detailed
status using scontrol show nodes
.
There may be cases where you need a specific node due to setup
constraints or to implement a particular network topology. In that case,
instead of asking for a number of nodes, you can use
--nodelist
or -w
followed by the desired node
name when calling salloc
or sbatch
. Keep in
mind that if a node is already exclusively allocated, your request will
hang until the node becomes free.
Sometimes, you may want to occupy a node for an extended period to
perform manual debugging. A simple method is to submit a job that
“sleeps” for a long time using
sbatch --nodes=1 --wrap "sleep 36000"
. This holds the node
for about ten hours, giving you ample time to test and debug.
Caution with salloc
While using salloc
for direct work is convenient, it is
generally better to run structured jobs with sbatch
whenever possible. Scheduled jobs are managed more fairly and usually
have better priority than interactive ones. If you install custom
dependencies or system-level libraries during your session, make sure
they are isolated and do not impact other users. Using shared locations
like /shared_inference
can make it easier to move across
nodes without repeated installations.
If you have an active reservation on the cluster, you can submit jobs
against it with
sbatch --reservation=dedicated_two your_script.sh
. You can
check available reservations with scontrol show res
and
adjust your job submissions accordingly.
Finally, if your workflow involves setting up environments or
installing packages before running experiments, one efficient approach
is to create a batch script that performs all setup steps and then
blocks by running /bin/bash
. This allows you to SSH or
srun
into the node and continue your work interactively
without leaving stray processes.
If you found this useful, please cite this post using
Senthilkumar Gopal. (Sep 2024). Slurm Cluster usage tips for quick debug and testing. sengopal.me. https://sengopal.me/posts/slurm-cluster-usage-tips-for-quick-debug-and-testing
or
@article{gopal2024slurmclusterusagetipsforquickdebugandtesting, title = {Slurm Cluster usage tips for quick debug and testing}, author = {Senthilkumar Gopal}, journal = {sengopal.me}, year = {2024}, month = {Sep}, url = {https://sengopal.me/posts/slurm-cluster-usage-tips-for-quick-debug-and-testing} }