Slurm is the software that manages jobs that are submitted to the compute nodes. Information about how to submit and manage jobs is listed below.
There are four job submission partitions available to all users, based on the expected run time of a job. Each partition has a per-user limit on the number of cores which can be used at one time.
short (the default partition) - Max time = 1 day, Max # of cores = 40
medium - Max time = 7 days, Max # of cores = 20
long - Max time = 30 days, Max # of cores = 15
lowpriority - Max time = 30 days, Max # of cores = 30
There is a separate parition for GPU access.
gpu - Max time = 7 days, Max # of GPUs = 4
Additionally, a few people/departments have purchased compute nodes for their own use so there are some additional partitions/queues that are limited to those specific groups.
When all cluster cores are occupied, pending jobs are prioritized using a fairshare algorithm which incorporates job age, job size, and user resource consumption history. Pending jobs may preempt (suspend) a running job based on partition priority. Short jobs may preempt medium and long jobs, while medium jobs may preempt long jobs. All other jobs may preempt lowpriority jobs.
Get documentation on a command:
man <command>
Try the following commands:
man sbatch
man squeue
man scancel
The following example script demonstrates how to specify some important parameters for a job. It is necessary to specify number of cores (if running multi-threaded/parallelized jobs), GPUs (if using GPUs), and memory (if job is expected to use more than 8GB/core), or your job may fail or not run as expected. Please note, however, that you are not required to use all of these options and should only use them if necessary.
This script performs a simple task — it generates of file of random numbers and then sorts it.
#!/bin/bash
#SBATCH -p short # partition (queue)
#SBATCH -N 1 # (leave at 1 unless using multi-node specific code)
#SBATCH -n 1 # number of cores
#SBATCH --mem=8192 # total memory
#SBATCH --job-name="myjob" # job name
#SBATCH -o slurm.%N.%j.stdout.txt # STDOUT
#SBATCH -e slurm.%N.%j.stderr.txt # STDERR
#SBATCH --mail-user=username@bucknell.edu # address to email
#SBATCH --mail-type=ALL # mail events (NONE, BEGIN, END, FAIL, ALL)
for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt
done
sort -n SomeRandomNumbers.txt
If you require one or more GPUs, add a line similar to the following and change the partition:
#SBATCH -p gpu # gpu partition
#SBATCH --gres=gpu:1 # number of GPUs
Now you can submit your job with the command:
sbatch myscript.sh
If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):
sbatch --test-only myscript.sh
You may run an interactive session on a compute node with a command such as:
srun -N 1 -n 2 -p short --pty /bin/bash
The --pty
flag indicates an interactive terminal, and the /bin/bash
denotes the shell to be run. Other options (such as -n
or -p
) may be specified as in a submission script. Please note that interactive jobs are subject to the same partition-based time/core limits as batch jobs.
List all current jobs for a user:
squeue -u <username>
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List all current jobs in the short partition for a user:
squeue -u <username> -p short
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
To cancel one job:
scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
To cancel one or more jobs by name:
scancel --name myJobName
To pause a particular job:
scontrol hold <jobid>
To resume a particular job:
scontrol resume <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>
Requesting more resources (e.g. CPU cores or memory) that required for your job prevents other people from using those resources. Some software, for example, is not capable of taking advantage of multiple cores so you should only use one core. You can check the efficiency of a job after it completes using the seff command. When running this command, take note of the CPU Efficiency and Memory Efficiency fields. If the percentage used is very low, consider reducing your resources requests (i.e. decrease the CPU cores or memory requests in your batch script).
seff <jobid>
More detailed information on Slurm commands/options can be found at: https://slurm.schedmd.com/