Introduction#
Getting Singularity or Apptainer, and Podman#
We do it as always, by using modules. Some systems may have Sing./Appt. in systems PATH or in an unusual place like somewhere on CVMFS.
which singularity
module spider apptainer
module spider singularityAssuming we have found any of the above, module load singularity or whatever we have found. Then, execute it.
singularity versionLater we will also need Podman loaded for running OpenWebUI.
module load podman
podman version
podman psFallback: use the image from /home/shared/sing on MagicCastle, or /global/software/sing-may2026 on Grex
Running local LLM interactively.#
We will need some model files. They can be obtained using Wget, but we have pre-loaded them to the ../models/ directories of our usual workshop-2026 locations. Lets create a directory $HOME/models and make symbolic link to the pre-downloaded models now.
mkdir ~/models
cd ~/models
# wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
ln -s /global/software/ws-may2026/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf $HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
ln -s /global/software/ws-may2026/models/Qwen2.5.1-Coder-7B-Instruct-Q4_K_M.gguf $HOME/models/Qwen2.5.1-Coder-7B-Instruct-Q4_K_M.gguf
# wget https://huggingface.co/bartowski/Qwen2.5.1-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5.1-Coder-7B-Instruct-Q4_K_M.gguf
ln -s /global/software/ws-may2026/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf $HOME/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
ls -al
du -h
cd On MagicCastle it would be:
mkdir ~/models
cd ~/models
# wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
ln -s /home/shared/ws-may2026/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf $HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
ln -s /home/shared/ws-may2026/models/Qwen2.5.1-Coder-7B-Instruct-Q4_K_M.gguf $HOME/models/Qwen2.5.1-Coder-7B-Instruct-Q4_K_M.gguf
ls -al
du -h
cd Now we will need the Llama.cpp executables. It is either available as a module, or can be used in Singularity.
- On Grex, use module called llama-cpp.
- On MagicCastle, pull both llama-server and llama-full images. Use CUDA-13 for newer H100 GPUs!
# Grex instructions
#
module spider llama-cpp
module load cuda/12.9 arch/avx2 gcc/14.3
module load llama-cpp
which llama-cli
#
which llama-server
#or
# MC instructions
#
module load apptainer/1.4.5
#apptainer pull docker://ghcr.io/ggml-org/llama.cpp:server-cuda
ln -s /home/shared/sing/llama.cpp_server-cuda.sif ./llama.cpp_server-cuda.sif
#apptainer pull docker://ghcr.io/ggml-org/llama.cpp:full-cuda
ln -s /home/shared/sing/llama.cpp_full-cuda.sif ./llama.cpp_full-cuda.sif
ls *.sifGet an interactive job on a GPU compute node with salloc#
On Grex, use the Workshop’s GPU reservation ws_gpu !
salloc --account=def-gshamov --mem-per-cpu=16gb --cpus-per-task=1 --gpus=1 --reservation=ws_gpu --partition=livi-b,gpu-b,stamps-b,agro-b,mcordgpu-b( On MC, use a whole GPU VM as salloc --gpus=1 --partition=gpu-node --mem=0 . Do not run it on Grex. )
Wait for it to give you and interactive prompt. Check your node name and if you have GPUs!
hostname
nvidia-smiNow we have the code, the model and the GPU, and can try loading the model into LLama.cpp and issuing a simple prompt.
llama-cli -c 4096 -ngl 60 --temp 0.7 -n 256 -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -p "Hello, how are you?"( On MC, prefix the line above with apptainer exec ./llama.cpp_full-cuda.sif /app/llama-cli . Do not run it on Grex)
Does it say? Try other models?
Running local inference engine Llama.cpp as a SLURM job#
We will need to run a persistent server job that does not quit. We will send the LLama.cpp into background with & and we will wait.
!!Do not forget to use scancel JobID when done to prevent wasting a GPU!!
Lets use VI editor and save the following job script as server.slurm:
#!/bin/bash
#SBATCH --job-name=llama-server
#SBATCH --partition=gpu,livi-b,agro-b,mcordgpu-b
#SBATCH --nodes=1
#SBATCH --gpus=1 --reservation=ws_gpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2 --mem-per-cpu=8gb
#SBATCH --output=llama-%j.out
#SBATCH --error=llama-%j.err
MODEL_PATH=~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
PORT=8295 # !!!ADD YOUR user Number here prefixed by 8!!!
module load cuda/12.9.1 arch/avx2 gcc/14.3
module load llama-cpp
# Run the server
llama-server \
-m $MODEL_PATH \
--host 0.0.0.0 \
--port $PORT \
-ngl 69 \
-c 4096 \
--n-gpu-layers 99 \
-np 2 \
-n 256 --temp 0.7 \
> server.log 2>&1 &
echo "Server started on port $PORT. Use SSH tunnel to access."
wait
Then submit the script with sbatch server.slurm command.
Now we will need to connect to the server. From a Desktop. On Grex, lets use OOD!
- Navigate to https://ood.hpc.umanitoba.ca ,
- pick Desktop, genoa partition, add 8 CPUs , submit and connect
- get a Terminal
Then, get yourself an SSH tunnel to the node your Llama is running on. In the terminal
sq
# note the hostname for the Llama-server job
export MYPORT=8295 # add your user number here; use the node instead of n999 below
ssh -fNL $MYPORT:n999$MYPORT n999.local
#no output expectd. Do not close the terminalUse the Firefox container and SSH port forwarding as follows.
# singularity pull docker://linuxserver/firefox
ln -s /home/software/sing-may2026/firefox_latest.sif ./firefox_latest.sif
singularity exec firefox_latest.sif firefoxThen, navigate the browser to http://localhost:MYPORT Must be able to chat with the model! Kill Fifefox when done.
Next exercise: Connecting OpenWebUI to LLama#
We need Podman to run the OpenWebUI service right in our Desktop session. We will use Firefox to connect to it, on a port 8080 which is hardcoded in OpenWebUI.
Later we will also need Podman loaded for running OpenWebUI. Also, created the local directory for the data:
module load podman
podman version
podman ps
mkdir -p open-webuiThen, use the Podman command below to fetch and run the OpenWebUI.
podman run -d \
-e WEBUI_AUTH=False \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:mainThen, start Firefox again and point it to the http://localhost:8080 . Confgure OpenWebUI to see our Llama at http://n999:MYPORT in the Web UI: Admin - Connections - Add Connection
Use Connecting OpenWebUI to make a RAG?#
TBD! Just explore the UI in Firefox at http://localhost:8080 “Knowledge base” can be added by uploading some Markdown files and using local builtin vectoring models in OpenWebUI.