Fine-tuning a Language Model Using LoRA

deep learning
python
In this notebook I want to compare fine-tuning a pretrained model with and without using LoRA.
Author

Vishal Bakshi

Published

September 1, 2023

Background

In this notebook I want to compare the differences between fine-tuning a pretrained model with and without using LoRA. This exercise is a fastai community study group homework assignment.

Here is a comparison of the full-fine-tuning (Full FT) vs. LoRA fine-tuning (LoRA FT) process on the EleutherAI/pythia-70m model using the roneneldan/TinyStoriesInstruct dataset (which comes from the TinyStories paper):

Type Parameters Training Set Validation Set Perplexity Batch Size Epochs Train Steps Train Time (Minutes)
Full FT 70.4M 240k 60k 8.51 16 3 22500 100
LoRA FT 98k 256k 64k 12.68 16 4 32000 120

Resources

Plan of Attack

In my first iteration of this exercise (see below) I manually ran multiple different trainings with different models, dataset sizes and training arguments. The code was flexible and easy to update but I through that process I re-ran a lot of cells with different values and lost track a bit exactly the order of things I was running. In this second iteration, I’ll create a helper function get_trainer which takes various arguments (model, bs,tokz, train_ds, etc.) and returns a HuggingFace Trainer. This will help clear up some of the redundancy in my code and make it a bit cleaner to read.

# all the imports
!pip install peft accelerate evaluate -Uqq
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer, AutoModelForCausalLM, pipeline
from peft import LoraConfig, get_peft_model, TaskType
from evaluate import load
import math

get_trainer Helper Function

This function prepares and returns Trainer object for a given model, tokenizer (and tokenize function), training/validation dataset, learning rate, batch size and number of epochs:

def get_trainer(model, tokz, tok_func, train_ds, eval_ds, lr, bs, epochs):
    # get tokenized datasets
    train_tok_ds = train_ds.map(tok_func, batched=True)
    eval_tok_ds = eval_ds.map(tok_func, batched=True)
    
    # sometimes for whatever reason the datasets are not the right size so checking it here
    print(train_tok_ds)
    
    # not sure what this does but I get an error that the model didn't return a loss value without it
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokz, mlm=False)
    
    # define training arguments
    training_args = TrainingArguments(
        output_dir="outputs",
        evaluation_strategy="epoch",
        learning_rate=lr,
        lr_scheduler_type = "cosine",
        weight_decay=0.1,
        per_device_train_batch_size=bs, 
        per_device_eval_batch_size=bs,
        num_train_epochs=epochs,
        report_to='none',
        fp16=True,
        logging_steps=10,
        save_strategy="no"
    )
    
    # define Trainer
    trainer = Trainer(model, training_args, train_dataset=train_tok_ds, eval_dataset=eval_tok_ds,
                  tokenizer=tokz, data_collator=data_collator)
    
    return trainer

Load the Dataset

As recommended in the study group, I’ll use the TinyStoriesInstruct dataset which comes from the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English?.

ds = load_dataset("roneneldan/TinyStoriesInstruct")
ds
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 21755681
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 218380
    })
})

Full Fine-Tuning with EleutherAi/pythia-70m

First, I’ll fully fine-tune an existing pretrained model on a subset of the TinyStoriesInstruct dataset using the EleutherAI/pythia-70m model. I chose this model because larger models were giving me CUDA-out-of-memory errors even for small dataset and batch sizes.

model_nm = 'EleutherAI/pythia-70m'
tokz = AutoTokenizer.from_pretrained(model_nm)
tokz.add_special_tokens({'pad_token': '[PAD]'})
def tok_func(x): return tokz(x["text"])
model = AutoModelForCausalLM.from_pretrained(model_nm)
model
GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 512)
    (layers): ModuleList(
      (0-5): 6 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear(in_features=512, out_features=1536, bias=True)
          (dense): Linear(in_features=512, out_features=512, bias=True)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
          (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
          (act): GELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (embed_out): Linear(in_features=512, out_features=50304, bias=False)
)
model.num_parameters()
70426624

I first trained the model on a very small subset (1000 rows) for both full-finetuning and LoRA to make sure it worked, then slowly increased the training and validation size until I got the CUDA out-of-memory error.

For small datasets, I noticed that the validation loss started increasing after 3 epochs so I’ve kept the number of epochs at 3. With larger datasets I could try to increase the number of epochs and see if it still overfits.

I couldn’t figure out how to implement perplexity during training. I was getting a Sizes of tensors must match except in dimension 0. error when passing any function to compute_metrics so I calculate perplexity at the end of training instead.

When I tried to train the model with 240k, 220k or 200k training samples, I got the following error after 1.60, 1.75 and 1.92 epochs respectively:

RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/9: file write failed

I set the save_strategy argument in the training arguments dictionary to "no" and this resolved this error. However, in the future, if I wanted checkpoints during my training I would have to figure out how to resolve this error differently.

train_ds = ds['train'].select(range(240000))
eval_ds = ds['validation'].select(range(60000))
trainer = get_trainer(model, tokz, tok_func, train_ds, eval_ds, lr=5e-4, bs=16, epochs=3)
trainer.train()
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[22500/22500 1:40:16, Epoch 3/3]
Epoch Training Loss Validation Loss
1 2.385700 2.407521
2 2.098300 2.192903
3 1.841100 2.141196

TrainOutput(global_step=22500, training_loss=2.1849648211161297, metrics={'train_runtime': 6016.472, 'train_samples_per_second': 119.671, 'train_steps_per_second': 3.74, 'total_flos': 1.64194783592448e+16, 'train_loss': 2.1849648211161297, 'epoch': 3.0})
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
[1875/1875 03:25]
Perplexity: 8.51

I’ll generate some text from the pretrained model and fully fine-tuned model to see how they compare:

prompt = "Once upon a time,"
generator = pipeline('text-generation', model=model_nm, tokenizer=tokz)
generator(prompt, max_length = 100, repetition_penalty=1.2)

Generated text:

‘Once upon a time, thefirst two are not in agreement. The second is to be expected; and it would have been an easy task for them if they had known that he was going on their way from home as soon after leaving his house at night or when there were no other guests than himself who wanted him back with all of her belongings before returning into town again by midnight (and then later). But this one has never seen such things since I've lived here.”’

generator = pipeline('text-generation', model=trainer.model.to('cpu'), tokenizer=tokz)
generator(prompt, max_length = 100, repetition_penalty=1.2)

Generated text:

‘Once upon a time, there was an old man. He had a big mustache and he loved to wear it every day. One morning when the sun came out, his eyes lit up with joy! 0He wanted to go outside but couldn't find anything else. So he decided to take off his hat and coat so that no one could see him. The old man smiled at Jimmy's face and said “I'm glad you like it”. Jimmy was happy again and thanked the old man’

The pre-trained model as is does not generate text that resembles a story whatsoever. The fully fine-tuned model’s generated text is somewhat coherent and it resembles a story although elements of it still don’t make sense.

Fine-Tuning EleutherAI/pythia-70m with LoRA

Since a LoRA model has less trainable parameters, I can increase the dataset size for the training. I’ll also see if I can train for more epochs without overfitting since I’m using more data.

train_ds = ds['train'].select(range(256000))
eval_ds = ds['validation'].select(range(64000))
lora_config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = AutoModelForCausalLM.from_pretrained(model_nm)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
trainable params: 98,304 || all params: 70,524,928 || trainable%: 0.13938901149959346
model
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPTNeoXForCausalLM(
      (gpt_neox): GPTNeoXModel(
        (embed_in): Embedding(50304, 512)
        (layers): ModuleList(
          (0-5): 6 x GPTNeoXLayer(
            (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (attention): GPTNeoXAttention(
              (rotary_emb): RotaryEmbedding()
              (query_key_value): Linear(
                in_features=512, out_features=1536, bias=True
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=512, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=1536, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (dense): Linear(in_features=512, out_features=512, bias=True)
            )
            (mlp): GPTNeoXMLP(
              (dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
              (dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
              (act): GELUActivation()
            )
          )
        )
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (embed_out): Linear(in_features=512, out_features=50304, bias=False)
    )
  )
)
trainer = get_trainer(model, tokz, tok_func, train_ds, eval_ds, lr=5e-4, bs=16, epochs=4)
trainer.train()
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
[32000/32000 2:00:46, Epoch 4/4]
Epoch Training Loss Validation Loss
1 2.616000 2.614058
2 2.575500 2.570585
3 2.605000 2.547680
4 2.493900 2.540338

TrainOutput(global_step=32000, training_loss=2.621225409567356, metrics={'train_runtime': 7252.3347, 'train_samples_per_second': 141.196, 'train_steps_per_second': 4.412, 'total_flos': 2.33350953959424e+16, 'train_loss': 2.621225409567356, 'epoch': 4.0})
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
[2000/2000 03:32]
Perplexity: 12.68
prompt = "Once upon a time,"
generator = pipeline('text-generation', model=trainer.model.to('cpu'), tokenizer=tokz)
generator(prompt, max_length = 100, repetition_penalty=1.2)

Generated text:

“Once upon a time, there was an old man who lived in the park. He had many friends and loved to play with him every day at his house all night long! One morning he decided that it would be best for everyone else because they were so happy together as each other on their own one by another’s side of town hall or doorstep…so when something unexpected happened she started playing outside - her mommy said no but could help herself out here until someone came up close enough.. She”

The generated text resembles a story and is a bit coherent for the first couple of sentences before it stops making sense in the second half.

Here is a comparison of the full-fine-tuning (Full FT) vs. LoRA fine-tuning (LoRA FT) process on the EleutherAI/pythia-70m:

Type Parameters Training Set Validation Set Perplexity Batch Size Epochs Train Steps Train Time (Minutes)
Full FT 70.4M 240k 60k 8.51 16 3 22500 100
LoRA FT 98k 256k 64k 12.68 16 4 32000 120

Generating Text from the Pre-Trained TinyStories Model

The authors of the paper that this dataset comes released their fine-tuned model on HuggingFace, so I’ll use it to generate text to see how a state-of-the-art TinyStories model performs:

model_nm = "EleutherAI/gpt-neo-125M"
tokz = AutoTokenizer.from_pretrained(model_nm)
tokz.add_special_tokens({'pad_token': '[PAD]'})
def tok_func(x): return tokz(x["text"])
generator = pipeline('text-generation', model='roneneldan/TinyStories-33M', tokenizer=tokz)
generator(prompt, max_length = 100, repetition_penalty=1.2)

Generated text:

‘Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine and pick flowers. One day, she found an ancient book on her porch. It had lots of pictures inside that looked very old.opened the book and saw many words written around it. But then, she heard a loud noise coming from the house next door. She went to investigate and found out that someone had broken into their home. ran back to’

The model is so good! It can hold a consistent, coherent theme in story format for multiple sentences.

Final Thoughts

I’m happy to have got this all to work, as that alone was a big step in my learning process. This is the first time I have trained a causal language model using HuggingFace. One thought to close out this exercise: Would restructuring the data help? Currently the dataset has text values like “Summary:” and “Features:”, which are the prompts used by the TinyStories paper authors to generate stories using GPT-3.5 and 4. Perhaps removing these prompts from the dataset and keeping only the story text would help improve the model. I’ll explore this in a future exercise.

I hope you enjoyed this blog post!