Grex: High Performance Computing Cluster at University of Manitoba

Introduction#


Grex is a UManitoba High Performance Computing (HPC) system, first put in production in early 2011 as part of WestGrid consortium. “Grex” is a Latin name for “herd” (or maybe “flock”?). The names of the Grex login nodes (bison , tatanka, zebu , yak ) also refer to various kinds of bovine animals.

Please note that bison and tatanka are decommissioned during and after the outage of August - September 2024. These login nodes are no longer available. For more information, visit the updates page

Since being defunded by WestGrid (on April 2, 2018), Grex is now available only to the users affiliated with University of Manitoba and their collaborators.

If you are a new Grex user, proceed to the quick start guide and documentation right away.

Hardware#


  • The original Grex was an SGI Altix machine, with 312 compute nodes (Xeon 5560, 12 CPU cores and 48 GB of RAM per node) and QDR 40 Gb/s InfiniBand network.

    The SGI Altix machine were decommissioned on Sep 2024.

  • In 2017, a new Seagate Storage Building Blocks (SBB) based Lustre filesystem of 418 TB of useful space was added to Grex.

  • In 2020 and 2021, the University added 57 Intel CascadeLake CPU nodes, a few GPU nodes, a new NVME storage for home directories, and EDR InfiniBand interconnect.

  • On March 2023, a new storage of 1 PB was added to Grex. It is called /project filesystem.

  • On January 2024, the /project was extended by another 1 PB.

  • On Sep 2024, new AMD Genoa nodes have been added (30 nodes).

The current computing hardware available for general use is as follow:

Login nodes#


As of Sep 14, 2022, Grex is using UManitoba network. We have decommissioned the old WG and BCNET network that was used for about 11 years. Now, the DNS names use hpc.umanitoba.ca instead of the previous name westgrid.ca

On Grex, there are multiple login nodes:

  • Yak: yak.hpc.umanitoba.ca (please note that the architecture for this node is avx512).
  • Grex: grex.hpc.umanitoba.ca is now an alias to the above Yak login node
  • Zebu: https://zebu.hpc.umanitoba.ca (only used for OOD and requires VPN if used outside campus network).

To login to Grex in the text (bash) mode, connect to grex.hpc.umanitoba.ca or yak.hpc.umanitoba.ca using a secure shell client, SSH .

CPU nodes#


In addition to the original nodes, new skylake and AMD nodes have been added to Grex:

HardwareNumber of nodesCPUs/NodeMem/NodeNetwork
Intel CPU1240384 GBEDR 100GB/s IB interconnect
Intel 6230R4252188 GBEDR 100GB/s IB interconnect
AMD EPYC 965427192750 GBHDR 200GB/s IB interconnect
AMD EPYC 965431921500 GBHDR 200GB/s IB interconnect
AMD EPYC 9634151681500 GBHDR 100GB/s IB interconnect

GPU nodes#


There are also several researcher-contributed nodes (CPU and GPU) to Grex which make it a “community cluster”. The researcher-contributed nodes are available for others on opportunistic basis; the owner groups will preempt the others’ workloads.

HardwareNumber of nodesGPUs/NodeCPUs/nodeMem/Node
GPU2432192 GB
4 [V100-32 GB]22432187 GB
4 [V100-16 GB]33432187 GB
16 [V100-32 GB]4116481500 GB
AMD [A30]52218500 GB
NVIDIA AMD [A30]62432500 GB

Storage#


Grex’s compute nodes have access to three filesystems:

File systemTypeTotal spaceQuota per user
/homeNFSv4/RDMA15 TB100 GB
/projectLustre2 PBAllocated per group.

In addition to the shared file system, the compute nodes have their own local disks that can be used as temporary storage when running jobs .

Software#


Grex is a traditional HPC machine, running Linux and SLURM resource management systems. On Grex, we use different software stacks .

Web portals and GUI#


In addition to the traditional bash mode (connecting via ssh), users have access to:

  • OpenOnDemand: on Grex, it is possible to use OpenOnDemand (OOD for short) to login to Grex and run batch or GUI applications (VNC Desktops, Matlab, Gaussview, Jupyter, …). For more information, please refer to the page: OpenOnDemand


WestGrid ceased operations on April 1st, 2022. The former WestGrid institutions are now re-organized into two consortia: BC DRI group and Prairies DRI group.

  1. CPU nodes contributed by Prof. Marcos Cordeiro (Department of Agriculture). ↩︎

  2. GPU nodes available for all users (general purpose). ↩︎

  3. GPU nodes contributed by Prof. R. Stamps (Department of Physics and Astronomy). ↩︎

  4. NVSwitch server contributed by Prof. L. Livi (Department of Computer Science). ↩︎

  5. GPU nodes contributed by Faculty of Agriculture. ↩︎

  6. GPU nodes contributed by Prof. Marcos Cordeiro (Department of Agriculture). ↩︎