Partitions#
The current Grex system spans more than one kind of computer hardware: Intel Cascade lake and AMD Genoa CPUs, regular and large (double) memory per node, and several kinds of GPUs (NVidia V100, A30 and L40s). A large fraction of the system is also made of researcher-contributed nodes. This makes Grex a very heterogeneous HPC system. With SLURM, as a scheduler, this requires partitioning: a “partition” is a set of compute nodes, grouped by a characteristic, usually by the kind of hardware the nodes have, and sometimes by who “owns” the hardware as well. The following SLURM settings are important to know:
There is no fully automatic selection of partitions, other than the default skylake for most of the users for the short jobs. For the contributors’ group members, the default partition will be their contributed nodes. Thus, in many cases users have to specify the partition manually when submitting their jobs!
Jobs cannot run on several partitions at the same time; but it is possible to specify more than one partition, like in --partition=skylake,largemem, so that the job will be directed by the scheduler to the first partition available.
Jobs will be rejected by the SLURM scheduler if partition’s hardware and requested resources do not match (that is, asking for GPUs on compute, largemem or skylake partitions is not possible). So, in some cases, explicitly adding --partition= flag to SLURM job submission is needed.
Jobs that request GPU-containing partitions (like agro-b or lgpu ), have to use GPUs (with a corresponding TRES flag like --gpus=), otherwise they will be rejected; this is to prevent bogging up the expensive GPU nodes with CPU-only jobs!
Memory per node is as a rule different between partitions of different hardware. For the special case of --mem=0 SLURM would set a specific amount of memory per partition, based on the lowest available memory in that partition.
On the special partition test, oversubscription is enabled in SLURM, to facilitate better turnaround of interactive OOD jobs.
Currently, the following partitions are available on Grex:
General purpose CPU partitions#
| Partition | Nodes | CPUs/Node | CPUs | Mem/Node | Notes |
|---|---|---|---|---|---|
| skylake | 42 | 52 | 2184 | 187 Gb | CascadeLakeRefresh |
| largemem | 12 | 40 | 480 | 380 Gb | CascadeLake |
| genoa | 27 | 192 | 5184 | 750 Gb | AMD EPYC 9654 |
| genlm | 3 | 192 | 576 | 1500 Gb | AMD EPYC 9654 |
| test | 1 | 18 | 36 | 512 Gb | CascadeLake |
All CPU partitions support a common subset of the AVX512 architecture. However, AMD EPYC CPUs have Zen4 architecture with extended set of AVX512 commands compared to CascadeLake. Thus host-optimized code compiled on genoa or genlm nodes may throw ‘illegal instruction’ error on skylake and largemem nodes
General purpose GPU partitions#
| Partition | Nodes | GPU type | CPUs/Node | Mem/Node | Notes |
|---|---|---|---|---|---|
| gpu | 2 | 4 - V100/32GB | 32 | 187 Gb | Intel AVX512 CPU, NVLink |
| lgpu | 2 | 2 - L40s/48GB | 64 | 380 Gb | AMD AVX512 CPU |
Contributed CPU partitions#
| Partition | Nodes | CPU type | CPUs/Node | Mem/Node | Notes |
|---|---|---|---|---|---|
| mcordcpu 1 | 5 | AMD EPYC 9634 84-Core | 168 | 1500 Gb | - |
| chrim | 4 | AMD EPYC 9654 96-Core | 192 | 750 Gb | - |
| chrimlm | 1 | AMD EPYC 9654 96-Core | 192 | 1500 Gb | - |
| hsc2 | 1 | AMD EPYC 9654 96-Core | 192 | 1500 Gb | - |
| pgs3 | 1 | AMD EPYC 9655 96-Core | 192 | 750 Gb | - |
Contributed GPU partitions#
| Partition | Nodes | GPU type | CPUs/Node | Mem/Node | Notes |
|---|---|---|---|---|---|
| stamps 4 | 3 | 4 - V100/16GB | 32 | 187 Gb | AVX512 CPU, NVLink |
| livi 5 | 1 | 16 -V100/32GB | 48 | 1500 Gb | NVSwitch, AVX512 CPU |
| agro 6 | 2 | 2 - A30/24GB | 24 | 250 Gb | AMD AVX2 CPU |
| mcordgpu 7 | 2 | 4 - A30/24GB | 32 | 512 Gb | AMD AVX2 CPU |
Note that newer GPU nodes with NVidia A30 GPUs have an older CPU architecture, up to AVX2 instruction set. Host-optimized code compiled on a CascadeLake or AMD Genoa CPU which have AVX512 instruction set will throw ‘illegal instruction’ errors on the AVX2 GPU nodes.
Preemptible partitions#
The following preemptible partition are set for general use of the contributed nodes:
| Partition | Contributed by |
|---|---|
| stamps-b | Prof. R. Stamps |
| livi-b | Prof. L. Livi |
| agro-b | Faculty of Agriculture |
| genoacpu-b | Spans all contributed AMD Genoa CPU nodes |
| mcordgpu-b | Prof M. Cunha Cordeiro |
The following partitions (skylake, largemem, test, gpu, lgpu) are generally accessible. The other partitions (stamps, livi, agro, mcordcpu and mcordgpu, chrim and chrimlm ) are open only to the contributor’s groups.
On the contributed resources , the owners’ group has preferential access. However, users belonging to other groups can submit jobs to one of the preemptible partitions (ending with -b) to run on the contributed hardware as long as it is unused, on the condition that their jobs can be preempted (that is, killed) should owners’ jobs need the hardware. There is a minimum runtime guaranteed to preemptible jobs, which is as of now 1 hour. The maximum wall time for the preemptible partition is set per partition (and can be seen in the output of the sinfo command). To have a global overview of all partitions on Grex, run the custom script partition-list from your terminal.
Note that the owners’ and corresponding preeemptible partitions do overlap! This means, that owners’ group should not submit their jobs to both of the contributed and the corresponding preemptible partitions, otherwise their jobs may preeempt their other jobs!
mcordcpu CPU nodes contributed by Prof. Marcos Cunha Cordeiro. ↩︎
hsc GPU node contributed by Prof. Harmeet Singh Chawla. ↩︎
pgs GPU node contributed by Dr. Britt Drogemoller and Prof. Galen Wright. ↩︎
stamps: GPU nodes contributed by Prof. R. Stamps. ↩︎
livi: GPU node contributed by Prof. L. Livi. ↩︎
agro: GPU node contributed by the Faculty of Agriculture. ↩︎
mcordgpu GPU nodes contributed by Prof. Marcos Cunha Cordeiro. ↩︎