SLURM Workload Manager
Slurm is the software that manages jobs that are submitted to the compute nodes. Information about how to submit and manage jobs is listed below.
BisonNet Partitions (Queues)
There are four job submission partitions available to all users, based on the expected run time of a job. Each partition has a per-user limit on the number of cores which can be used at one time.
short (the default partition) - Max time = 1 day, Max # of cores = 40 medium - Max time = 7 days, Max # of cores = 20 long - Max time = 30 days, Max # of cores = 15 lowpriority - Max time = 30 days, Max # of cores = 30
Users are limited to two GPUs at a time across all partitions.
Job Scheduling
When all cluster cores are occupied, pending jobs are prioritized using a fairshare algorithm which incorporates job age, job size, and user resource consumption history. Pending jobs may preempt (suspend) a running job based on partition priority. Short jobs may preempt medium and long jobs, while medium jobs may preempt long jobs. All other jobs may preempt lowpriority jobs.
SLURM Commands
General commands
Get documentation on a command:
man <command>
Try the following commands:
man sbatch
man squeue
man scancel
Submitting jobs
The following example script demonstrates how to specify some important parameters for a job. It is necessary to specify number of cores (if running multi-threaded/parallelized jobs), GPUs (if using GPUs), and memory (if job is expected to use more than 8GB/core), or your job may fail or not run as expected. Please note, however, that you are not required to use all of these options and should only use them if necessary.
This script performs a simple task — it generates of file of random numbers and then sorts it.
#!/bin/bash #SBATCH -p short # partition (queue) #SBATCH -N 1 # (leave at 1 unless using multi-node specific code) #SBATCH -n 1 # number of cores #SBATCH --mem-per-cpu=8192 # memory per core #SBATCH --job-name="myjob" # job name #SBATCH -o slurm.%N.%j.stdout.txt # STDOUT #SBATCH -e slurm.%N.%j.stderr.txt # STDERR #SBATCH --mail-user=username@bucknell.edu # address to email #SBATCH --mail-type=ALL # mail events (NONE, BEGIN, END, FAIL, ALL) for i in {1..100000}; do echo $RANDOM >> SomeRandomNumbers.txt done sort -n SomeRandomNumbers.txt
If you require one or more GPUs, add a line similar to the following:
#SBATCH --gres=gpu:1 # number of GPUs
Now you can submit your job with the command:
sbatch myscript.sh
If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):
sbatch --test-only myscript.sh
Interactive Jobs
You may run an interactive session on a compute node with a command such as:
srun -n 2 -p short --pty /bin/bash
The --pty
flag indicates an interactive terminal, and the /bin/bash
denotes the shell to be run. Other options (such as -n
or -p
) may be specified as in a submission script. Please note that interactive jobs are subject to the same partition-based time/core limits as batch jobs.
Information on jobs
List all current jobs for a user:
squeue -u <username>
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List all current jobs in the short partition for a user:
squeue -u <username> -p short
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
Controlling jobs
To cancel one job:
scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
To cancel one or more jobs by name:
scancel --name myJobName
To pause a particular job:
scontrol hold <jobid>
To resume a particular job:
scontrol resume <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>
Further Information
More detailed information on Slurm commands/options can be found at: https://slurm.schedmd.com/