Running jobs on Grex

Why running jobs in batch mode?#


There are many reasons for adopting a batch mode for running jobs on a cluster. From providing user’s computations with fairness, traffic control to prevent resource congestion and wasting, enforcing organizational priorities, to better understanding the workload, utilization and resource needs for future capacity planning; the scheduler provides it all. After being long-time PBS/Moab users, we have switched to the SLURM batch system since December 2019 with the Linux/SLURM update project .

Accounting groups#


Users belong to “accounting groups”, led by their Principal Investigators (PIs). The accounting groups on Grex match CCDB roles. By default, a user’s jobs are assigned to his primary/default accounting group. But it is possible for a user to belong to more than one accounting group; then the SLURM directive --account= can be used for sbatch or salloc to select a non-default account to run jobs under.

For example, a user who belongs to two accounting groups: def-sponsor1 and def-sponsor2 (sponsored by two PIs), can specify which one to use:

#SBATCH --account=def-sponsor1

or

#SBATCH --account=def-sponsor2

QOSs#


QOS stands for Quality of Service. It is a mechanism to modify scheduler’s limits, hardware access policies and modify job priorities and job accounting/billing. Presently, QOS might be used by our Scheduler machinery internally, but not specified by the users. Jobs that specify explicit --qos= will be rejected by the SLURM job submission wrapper.

Limits and policies#


  • In order to prevent monopolization of the entire cluster by a single user or single accounting group, we enforce a MAXPS like limit; for non-RAC accounts it is set to 4 M CPU-minutes and 400 CPU cores. Accounts that have allocation (RAC) on Grex get higher MAXPS and max CPU cores limits in order to let them utilize their usage targets.

  • To see these limits, one could run the command:

sacctmgr show assoc where user=$USER

or with a specific format for the output:

export SACCTMGR_FORMAT="cluster%9,user%12,account%20,share%5,qos%24,maxjobs%15,grptres%12,grptresrunmin%20"
sacctmgr show assoc where user=$USER format=$SACCTMGR_FORMAT
  • Partitions for preemptible jobs running on contributed hardware might be further limited, so that they cannot occupy the whole contributed hardware.

  • In cases when Grex is underutilized, but some jobs exist in the queue that can be run if not for the above-mentioned limits, we might relax the limits as a temporary “bonus”.

SLURM commands#


Naturally, SLURM provides a command line user interface. Some of the most useful commands are listed below.

Exploring the system#


The SLURM command that shows the state of nodes and partitions is sinfo:

Examples:

  • sinfo to list the state of all the nodes (idle, down, allocated, mixed) and partitions.
  • sinfo --state=idle to list all idle nodes.
  • sinfo -p compute to list information about a given partition (compute in this case).
  • sinfo -p skylake –state=idle to list idle nodes on a given partition (skylake in this case).
  • sinfo -R: to list all down nodes.
  • sinfo -R -N -o"%.12N %15T [ %50E ]"|uniq -c: to list all down nodes and print the output in a specific format.
  • sinfo -s –format="# %20P %12A %.12l %.11L %.6a %.22C %.10m" to show all the partitions and their characteristics.

Submitting jobs#


Batch jobs are submitted as follow:

sbatch [options] myfile.job

Interactive jobs are submitted in exactly same way, but they do not need the job script because they will give you an interactive session:

salloc [options]

For more information about the usage of sbatch and salloc commands, please visit the dedicated sections: interactive and batch jobs.

The command sbatch returns a number called JobID and used by SLURM to identify the job in the queuing system.

The options are used to specify resources (wall time, tractable resources such as cores and nodes and GPUs) and accounts and QOS and partitions under which the jobs should run, and various other options like specifying whether jobs need X11 GUI (--x11), where its output should go, whether email should be sent when job changes its states and so on. Here is a list of the most frequent options to sbatch command:

Resources (tractable resources in SLURM speak) are CPU time, memory, and GPU time.

Generic resources can be software licenses, etc. There are also options to control job placement such as partitions and QOSs.

  • --ntasks= (specifies number of tasks (MPI processes) per job)
  • --nodes= (specifies number of nodes (servers) per job)
  • --ntasks-per-node= (specifies number of tasks (MPI processes) per node)
  • --cpus-per-task= (specifies number of threads per task)
  • --mem-per-cpu= (specifies memory per task (or thread?))
  • --mem= (specifies the memory per node)
  • --gpus= (specifies number of GPUs per job. There are also --gpus-per-XXX and --XXX-per-gpu)
  • --time- (specifies wall time in format DD-HH:MM)
  • --qos= (specifies a QOS by its name (Should not be used on Grex!))
  • --partition= (specifies a partition by its name (Can be very useful on Grex!))

An example of using some of these options with sbatch and salloc are listed below:

sbatch --nodes=1 --ntasks-per-node=1 --cpus-per-task=12 --mem=40G --time=0-48:00 gaussian.job

and

salloc --nodes=1 --ntasks-per-node=4 --mem-per-cpu=4000M --x11 --partition=compute

And so on. The options for batch jobs can be either in command line, or (perhaps better) in the special comments in the job file, like:

#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task=12
#SBATCH --mem=40G
#SBATCH --time=0-48:00
#SBATCH --partition=compute

Refer to the subsection for batch jobs and interactive jobs for more information, examples of job scripts and how to actually submit jobs.

Monitoring jobs#


Checking on the queued jobs:

  • squeue -u someuser (to see all queued jobs of the user someuser)
  • squeue -u someuser -t R (to see all queued and running jobs of the user someuser)
  • squeue -u someuser -t PD (to see all queued and pending jobs of the user someuser)
  • squeue -A def-sponsor1 (to see all queued jobs for the accounting group def-sponsor1)

Without the above parameters, squeue would return all the jobs in the system. There is a shortcut sq for squeue -u $USER

Canceling jobs:

  • scancel JobID (to cancel a job JobID)
  • echo “Deleting all the jobs by $USER” && scancel -u $USER (to cancel all your queued jobs at once).
  • echo “Deleting all the pending jobs by $USER” && scancel -u $USER –state=pending (to cancel all your pending jobs at once).

Hold and release queued jobs:

To put hold some one or more jobs, use:

  • scontrol hold JobID (to put on hold the job JobID).
  • scontrol hold JobID01,JobID02,JobID03 (to put on hold the jobs JobID01,JobID02,JobID03).

To release them, use:

  • scontrol release JobID (to release the job JobID).
  • scontrol release JobID01,JobID02,JobID03 (to release the jobs JobID01,JobID02,JobID03).

Checking job efficiency:

The command seff is a wrapper around the command sacct that gives a friendly output, like the actual utilization of walltime and memory:

  • seff JobID
  • seff -d JobID

Note that the output from the seff command is not accurate if the job was not successful.

Checking resource limits and usage for past and current jobs:

  • sacct -j {JobID} -l
  • sacct -u $USER -s {STARTDATE} -e {ENDDATE} -l --parsable2

Getting info on accounts and priorities#


Fairshare and accounting information for the accounting group def-someprofessor:

sshare -l -A def-someprofessor
sshare -l -A def-someprofessor --format="Account,EffectvUsage,LevelFS"

sshare -a -l -A def-someprofessor
sshare -a -l -A def-someprofessor --format="Account,User,EffectvUsage,LevelFS"

Fairshare and accounting information for a user:

sshare -l -U --user $USER
sshare -l -U --user $USER --format="Cluster,User,EffectvUsage,FairShare,LevelFS"

Limits and settings for an account:

sacctmgr list assoc account=def-someprofessor format=account,user,qos



Since the HPC technology is widely used by most universities and National labs, simply googling your SLURM question will likely return a few useful links to their HPC/ARC documentation. Please remember to adapt your script according to the available hardware and software as some SLURM directives and modules are specific to a given cluster.