LoRa walkthrough

High Performance Computing Workshop - May 21-23, 2025#

This is an example of training LoRa in a batch job. The example uses a script from Huggingface diffusers package. We assume the partiticipant already tried simple text to image generation with SD 1-5 as per 02-text-to-image-ipynb notebook.

Pull Huggingface diffusers, copy data#

We will downolad Huggingface source code from their Github using Git. This is needed to use their Python example training script rather than developing ours from scrath. We will also copy the datasets for training from pythonai subdirectory of our Workshop materials.

pwd
# on Grex. lets work from Project filesystem rather than Home!
# assuming the current directory is under your project as per above
cp -r /global/software/ws-may2025/pythonai .
cd ./pythonai
pwd && ls 
# should see /home/user/pythonai ; dataset1 dataset2 notebooks

# on MC
cp -r /home/shared/pythonai ~/scratch/
cd ~/scratch/pythonai
pwd && ls
# should see ; dataset1 dataset2 notebooks

# clone the HF repository and scripts
git clone https://github.com/huggingface/diffusers.git
ls
# should see disffusers directory added

We will more or less follow Diffusers source installation in an interactive job.

Start an interactive job on a GPU node#

We will use workshop reservation ws_gpu and (any of) reserved GPU partitions. Will need one GPU. Please add –account= if you have more than one active.

salloc --time=0-2:00 --partition=agro-b,mcordgpu-b,stamps-b --gpus=1 --cpus-per-gpu=6 --mem=60gb  --reservation=ws_gpu

We should see a GPU information from nvidia-smi there. Will get either a V100 or A30.

Create a virtualenv and instal packages#

We will need to load CUDA and Python 3.12. To this end, use module spider python/3.12 and pick the version that has a CUDA dependency.

#first, load modules
#on Grex
module purge
module load SBEnv
# loading according to spider, need the CUDA version!
module load  cuda/12.4.1 arch/avx2  gcc/13.2.0 python/3.12
python --version

# on MC
module load StdEnv/2023 arch/avx2 cuda python/3.12 
python --version

Now that modules are loaded, we can create a new virtualenv here and call it hf

virtualenv hf
source hf/bin/activate
#installing packages
pip install torch arrow transformers datasets peft accelerate
pip install torchvision
pip install git+https://github.com/huggingface/diffusers
python -c "import torch"
deactivate

We got our Diffusers virtual environment now, and it should be working! Hopefully. We will use it in interactive and batch jobs from now on.

Run a test LoRa text-to-image training on dataset1#

# now actiate the environment
source hf/bin/activate
# change directory to the Examples we cloned
cd diffusers/examples/text_to_image
pwd
ls 
#must have train_text_to_image_lora.py amongst the files there
# this is what we are goung to use! Note the path to dataset1 ../../../dataset1/train
python ./train_text_to_image_lora.py  --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5   --train_data_dir ../../../dataset1/train   --output_dir ../../../lora-sd-output   --resolution 256   --train_batch_size 1   --max_train_steps 200

Note that we use resolution 256 on the dataset1.

Debug , try to make it run and deliver the new Weight file under the output directory.

Run production batch jobs on dataset2#

Now that we are sure that a LoRa training environment is good, lets try to run it as a production batch job We will change to the larger dataset 2 and use the following script (also provided under pythonai ).

Always use the same modules and virtualenv! The job script below is for Grex.

If needed change directory to the project filesystem on Grex.

# must be in /home/your_user/projects/def-your_project/your_user/pythonai
sbatch trainingjob.sh

The trainingjob.sh script looks as follows:

#!/bin/bash

#SBATCH --reservation=ws_gpu
#SBATCH --partition=agro-b,mcordcpu-b
#SBATCH --cpus-per-task=6
#SBATCH --gpus=1
#SBATCH --mem=40000
#SBATCH --time=0-3:0:00

# Load requested software stack
module load SBEnv
module load cuda/12.4.1 arch/avx2 gcc/13.2.0
module load python/3.12


# now actiate the environment
source hf/bin/activate

echo "Starting run at: `date`"

python ./diffusers/examples/text_to_image/train_text_to_image_lora.py  --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5   --train_data_dir dataset2/train   --output_dir lora-sd-output2   --resolution 512   --train_batch_size 1   --max_train_steps 1200

echo "Program finished with exit code ${?} at: `date`"

#

use SD-1.5 with added LoRa-optimized weights#

Start a Jupyter job on Grex OOD or Magic Castle.

In the notebooks folder, open a 03-text-to-image-lora.ipynb , correct path to the updated LoRa weights and run the inference again. Try also merging the two LoRa weights trained from dataset1 and dataset2.