Grex upgrade: SISF2023 - Aug 26 -Sep 6, 2024

Please review the brief summary of the Grex upgrades and changes that are done during the outage of Aug 26 - Sep 6, 2024.

Operating System#


Grex is now running a new version of Linux (Alma Linux 8.10). All compute and login nodes are upgraded to Alma Linux.

Login nodes#


  • The login node yak was upgraded to Alma Linux. This is the only login node that you can use for now.

  • The alias grex is now redirected to yak.

The login nodes bison and tatanka are offline.

To connect to Grex, use one of the following:

ssh -XY username@grex.hpc.umanitoba.ca

or

ssh -XY username@yak.hpc.umanitoba.ca
  • OOD is down during the outage.

Partitions#


After Grex upgrade, we added new partitions:

  • genoa: 27 nodes, 750 Gb of total usable memory, 192 CPUs per node, about 4000M of memory per core, Total CPUs 5184.
  • genlm: 3 nodes, 1.5 Tb of total usable memory, 192 CPUs per node, about 8000M of memory per core, Total CPUs 576.

The following partitions are the same as before the outage:

  • skylake: 42 nodes, 192 Gb of total usable memory, 52 CPUs per node, about 3600M of memory per core, Total CPUs 2184.
  • largemem: 12 nodes, 384 Gb of total usable memory, 40 CPUs per node, about 9600M of memory per core, Total CPUs 480.
  • gpu:
  • test:

The contributed partitions are the same as before the outage:

  • stamps and stamps-b
  • livi and livi-b
  • agro and agro-b
  • mcordgpu and mcordgpu-b
  • mcordcpu and mcordcpu-b
The partition testgenoa has been removed and the nodes assigned to the new partition genoa (see above).

Legacy nodes#


As of Aug 29, 2024, the legacy nodes (bison, tatanka and compute partition) are decommissioned.

Storage#


The storage servers for /home and /project are online. Users can have access and/or transfer data as needed.

Software Stacks#


Grex is now running one operating system (Alma Linux):

  • Alma Linux: This OS is running on the new login node yak and zebu (that serves as a host for OOD). All partions are running Alma Linux.

  • The new sotftware stack SBEnv is set as default.

SBEnv:#


This is a new software stack that is meant to be used on yak and all modern partitions on Grex. SBEnv stands for Simplified Build Environment.

SBEnv has already:

  • different compiler versions for Intel and GCC suites: gcc/13.2.0; intel/2019.5, intel/2023.2, intel-one/2024.1; intelmpi/2019.8; intelmpi/2021.10
  • a new AOCC compiler suite for new AMD nodes (genoa, genlm and mcordcpu-b partitions): aocc/4.2.0
  • a new AOCL math libraries (AMD BLAS/LAPACK/ScaLAPAC): aocl/4.2.0 and aocl/4.2.0-64
  • OpenMPI (openmpi/4.1.6 is the default version to be used in most cases)
  • some commercial software (ORCA, Gaussian, Matlab)
  • some restricted software to particular groups, like STATA, VASP, ADF.
  • some tools and popular dependencies.

We will continue to add more programs as they are requested by users. If you can not find the program or the module you want to use, please send us a request via support@tech.alliancecan.ca and we will install the module for you.

If you have compiled locally your programs, you may have to re-compile them using the compilers available under this software stack.

Since this is a new software stack managed by a different package manager, the names of the modules may have changed compared to the old software stack. For example, the modules that have uofm under their name, they no longer show this name. For example, instead of uofm/adf, the module name is adf.

The better way to find modules and see how to load them is to run the usual command:

module spider <name of the program>

CCEnv#


This environment corresponds to the software stack from the Alliance which is the same used on national systems, like cedar, graham, beluga and narval. It can be used on yak and all partitions.

To use it on Grex, you should first load the following modules on this order:

module load CCEnv
module load arch/avx512
module load StdEnv/2023

Then use module spider to search for other modules under this environment.

Note that on the AMD Genoa based partitions, CCEnv enviroments earlier than StdEnv/2023 will likely not work. It is recommended to use the latest StdEnv on the new AMD hardware.

Scheduler#


  • No major changes to mention about the scheduler since we are still using the same version as before the outage.
  • If no partition is specified, the default will be skylake.
NEW, another significant change introduced on Jun 19, 2024: For users that have more than one Account (that is, working for more than one research group), SLURM on Grex will no longer try to assume which of the accounts is default. Instead, sbatch and salloc would ask to provide the –account= option explicitly, list the possible accounts, and stop. If you are a member of more than one group, always specify the account you intend to be used for the job. To see the accounting groups, run the command sshare -U from your terminal.

Workflow summary#


As a summary of the changes, there is only one workflow on Grex:

  • connect via yak.hpc.umanitoba.ca or grex.hpc.umanitoba.ca
  • Use the new environment SBEnv for modules and/or compile your programs using the compilers available under this environment.
  • Submit your jobs to skylake, genoa, largemem or any other partition. For a complete list of partitions, run the command partition-list from your terminal.
  • You could also use CCEnv as shown above.
On this page: