# all the imports
!pip install peft accelerate evaluate -Uqq
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer, AutoModelForCausalLM, pipeline
from peft import LoraConfig, get_peft_model, TaskType
from evaluate import load
import math
Fine-tuning a Language Model Using LoRA
Background
In this notebook I want to compare the differences between fine-tuning a pretrained model with and without using LoRA. This exercise is a fastai community study group homework assignment.
Here is a comparison of the full-fine-tuning (Full FT) vs. LoRA fine-tuning (LoRA FT) process on the EleutherAI/pythia-70m
model using the roneneldan/TinyStoriesInstruct
dataset (which comes from the TinyStories paper):
Type | Parameters | Training Set | Validation Set | Perplexity | Batch Size | Epochs | Train Steps | Train Time (Minutes) |
---|---|---|---|---|---|---|---|---|
Full FT | 70.4M | 240k | 60k | 8.51 | 16 | 3 | 22500 | 100 |
LoRA FT | 98k | 256k | 64k | 12.68 | 16 | 4 | 32000 | 120 |
Resources
- I’ll use a small subset of the
roneneldan/TinyStoriesInstruct
dataset from HuggingFace for both trainings since when I use the full dataset I’m getting CUDA out-of-memory errors. - I’m referencing the following to patch together the code in this notebook:
- Jeremy Howard’s Getting started with NLP for absolute beginners for fundamental setup of data, model, and tokenizer.
- HuggingFace’s Causal language modeling tutorial for updating the tokenizer with a pad token,
data_collator
and training arguments. - This forum response that shows how to select a subset of a dataset with a given set of indexes.
- The TinyStories author’s hyperparameters as listed in their 33M parameter model page
- HuggingFace’s LoRA Conceptual Guide for steps on how to implement LoRA using
peft
. - This blog post which walks through an example LoRA training.
- This forum response by Sylvain Gugger which says to set
save_strategy
to"no"
to avoid theTrainer
creating checkpoints as I was running into errors around this.
Plan of Attack
In my first iteration of this exercise (see below) I manually ran multiple different trainings with different models, dataset sizes and training arguments. The code was flexible and easy to update but I through that process I re-ran a lot of cells with different values and lost track a bit exactly the order of things I was running. In this second iteration, I’ll create a helper function get_trainer
which takes various arguments (model
, bs
,tokz
, train_ds
, etc.) and returns a HuggingFace Trainer
. This will help clear up some of the redundancy in my code and make it a bit cleaner to read.
get_trainer
Helper Function
This function prepares and returns Trainer
object for a given model, tokenizer (and tokenize function), training/validation dataset, learning rate, batch size and number of epochs:
def get_trainer(model, tokz, tok_func, train_ds, eval_ds, lr, bs, epochs):
# get tokenized datasets
= train_ds.map(tok_func, batched=True)
train_tok_ds = eval_ds.map(tok_func, batched=True)
eval_tok_ds
# sometimes for whatever reason the datasets are not the right size so checking it here
print(train_tok_ds)
# not sure what this does but I get an error that the model didn't return a loss value without it
= DataCollatorForLanguageModeling(tokenizer=tokz, mlm=False)
data_collator
# define training arguments
= TrainingArguments(
training_args ="outputs",
output_dir="epoch",
evaluation_strategy=lr,
learning_rate= "cosine",
lr_scheduler_type =0.1,
weight_decay=bs,
per_device_train_batch_size=bs,
per_device_eval_batch_size=epochs,
num_train_epochs='none',
report_to=True,
fp16=10,
logging_steps="no"
save_strategy
)
# define Trainer
= Trainer(model, training_args, train_dataset=train_tok_ds, eval_dataset=eval_tok_ds,
trainer =tokz, data_collator=data_collator)
tokenizer
return trainer
Load the Dataset
As recommended in the study group, I’ll use the TinyStoriesInstruct
dataset which comes from the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English?.
= load_dataset("roneneldan/TinyStoriesInstruct") ds
ds
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 21755681
})
validation: Dataset({
features: ['text'],
num_rows: 218380
})
})
Full Fine-Tuning with EleutherAi/pythia-70m
First, I’ll fully fine-tune an existing pretrained model on a subset of the TinyStoriesInstruct
dataset using the EleutherAI/pythia-70m
model. I chose this model because larger models were giving me CUDA-out-of-memory errors even for small dataset and batch sizes.
= 'EleutherAI/pythia-70m'
model_nm = AutoTokenizer.from_pretrained(model_nm)
tokz 'pad_token': '[PAD]'})
tokz.add_special_tokens({def tok_func(x): return tokz(x["text"])
= AutoModelForCausalLM.from_pretrained(model_nm) model
model
GPTNeoXForCausalLM(
(gpt_neox): GPTNeoXModel(
(embed_in): Embedding(50304, 512)
(layers): ModuleList(
(0-5): 6 x GPTNeoXLayer(
(input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(attention): GPTNeoXAttention(
(rotary_emb): RotaryEmbedding()
(query_key_value): Linear(in_features=512, out_features=1536, bias=True)
(dense): Linear(in_features=512, out_features=512, bias=True)
)
(mlp): GPTNeoXMLP(
(dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
(dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
(act): GELUActivation()
)
)
)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(embed_out): Linear(in_features=512, out_features=50304, bias=False)
)
model.num_parameters()
70426624
I first trained the model on a very small subset (1000 rows) for both full-finetuning and LoRA to make sure it worked, then slowly increased the training and validation size until I got the CUDA out-of-memory error.
For small datasets, I noticed that the validation loss started increasing after 3 epochs so I’ve kept the number of epochs at 3. With larger datasets I could try to increase the number of epochs and see if it still overfits.
I couldn’t figure out how to implement perplexity during training. I was getting a Sizes of tensors must match except in dimension 0.
error when passing any function to compute_metrics
so I calculate perplexity at the end of training instead.
When I tried to train the model with 240k, 220k or 200k training samples, I got the following error after 1.60, 1.75 and 1.92 epochs respectively:
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/9: file write failed
I set the save_strategy
argument in the training arguments dictionary to "no"
and this resolved this error. However, in the future, if I wanted checkpoints during my training I would have to figure out how to resolve this error differently.
= ds['train'].select(range(240000))
train_ds = ds['validation'].select(range(60000)) eval_ds
= get_trainer(model, tokz, tok_func, train_ds, eval_ds, lr=5e-4, bs=16, epochs=3) trainer
trainer.train()
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 2.385700 | 2.407521 |
2 | 2.098300 | 2.192903 |
3 | 1.841100 | 2.141196 |
TrainOutput(global_step=22500, training_loss=2.1849648211161297, metrics={'train_runtime': 6016.472, 'train_samples_per_second': 119.671, 'train_steps_per_second': 3.74, 'total_flos': 1.64194783592448e+16, 'train_loss': 2.1849648211161297, 'epoch': 3.0})
= trainer.evaluate()
eval_results print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.51
I’ll generate some text from the pretrained model and fully fine-tuned model to see how they compare:
= "Once upon a time,"
prompt = pipeline('text-generation', model=model_nm, tokenizer=tokz)
generator = 100, repetition_penalty=1.2) generator(prompt, max_length
Generated text:
‘Once upon a time, thefirst two are not in agreement. The second is to be expected; and it would have been an easy task for them if they had known that he was going on their way from home as soon after leaving his house at night or when there were no other guests than himself who wanted him back with all of her belongings before returning into town again by midnight (and then later). But this one has never seen such things since I've lived here.”’
= pipeline('text-generation', model=trainer.model.to('cpu'), tokenizer=tokz)
generator = 100, repetition_penalty=1.2) generator(prompt, max_length
Generated text:
‘Once upon a time, there was an old man. He had a big mustache and he loved to wear it every day. One morning when the sun came out, his eyes lit up with joy! 0He wanted to go outside but couldn't find anything else. So he decided to take off his hat and coat so that no one could see him. The old man smiled at Jimmy's face and said “I'm glad you like it”. Jimmy was happy again and thanked the old man’
The pre-trained model as is does not generate text that resembles a story whatsoever. The fully fine-tuned model’s generated text is somewhat coherent and it resembles a story although elements of it still don’t make sense.
Fine-Tuning EleutherAI/pythia-70m
with LoRA
Since a LoRA model has less trainable parameters, I can increase the dataset size for the training. I’ll also see if I can train for more epochs without overfitting since I’m using more data.
= ds['train'].select(range(256000))
train_ds = ds['validation'].select(range(64000)) eval_ds
= LoraConfig(task_type=TaskType.CAUSAL_LM)
lora_config = AutoModelForCausalLM.from_pretrained(model_nm)
model = get_peft_model(model, lora_config)
model model.print_trainable_parameters()
trainable params: 98,304 || all params: 70,524,928 || trainable%: 0.13938901149959346
model
PeftModelForCausalLM(
(base_model): LoraModel(
(model): GPTNeoXForCausalLM(
(gpt_neox): GPTNeoXModel(
(embed_in): Embedding(50304, 512)
(layers): ModuleList(
(0-5): 6 x GPTNeoXLayer(
(input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(attention): GPTNeoXAttention(
(rotary_emb): RotaryEmbedding()
(query_key_value): Linear(
in_features=512, out_features=1536, bias=True
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=512, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=1536, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
(dense): Linear(in_features=512, out_features=512, bias=True)
)
(mlp): GPTNeoXMLP(
(dense_h_to_4h): Linear(in_features=512, out_features=2048, bias=True)
(dense_4h_to_h): Linear(in_features=2048, out_features=512, bias=True)
(act): GELUActivation()
)
)
)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(embed_out): Linear(in_features=512, out_features=50304, bias=False)
)
)
)
= get_trainer(model, tokz, tok_func, train_ds, eval_ds, lr=5e-4, bs=16, epochs=4) trainer
trainer.train()
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 2.616000 | 2.614058 |
2 | 2.575500 | 2.570585 |
3 | 2.605000 | 2.547680 |
4 | 2.493900 | 2.540338 |
TrainOutput(global_step=32000, training_loss=2.621225409567356, metrics={'train_runtime': 7252.3347, 'train_samples_per_second': 141.196, 'train_steps_per_second': 4.412, 'total_flos': 2.33350953959424e+16, 'train_loss': 2.621225409567356, 'epoch': 4.0})
= trainer.evaluate()
eval_results print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 12.68
= "Once upon a time,"
prompt = pipeline('text-generation', model=trainer.model.to('cpu'), tokenizer=tokz)
generator = 100, repetition_penalty=1.2) generator(prompt, max_length
Generated text:
“Once upon a time, there was an old man who lived in the park. He had many friends and loved to play with him every day at his house all night long! One morning he decided that it would be best for everyone else because they were so happy together as each other on their own one by another’s side of town hall or doorstep…so when something unexpected happened she started playing outside - her mommy said no but could help herself out here until someone came up close enough.. She”
The generated text resembles a story and is a bit coherent for the first couple of sentences before it stops making sense in the second half.
Here is a comparison of the full-fine-tuning (Full FT) vs. LoRA fine-tuning (LoRA FT) process on the EleutherAI/pythia-70m
:
Type | Parameters | Training Set | Validation Set | Perplexity | Batch Size | Epochs | Train Steps | Train Time (Minutes) |
---|---|---|---|---|---|---|---|---|
Full FT | 70.4M | 240k | 60k | 8.51 | 16 | 3 | 22500 | 100 |
LoRA FT | 98k | 256k | 64k | 12.68 | 16 | 4 | 32000 | 120 |
Generating Text from the Pre-Trained TinyStories Model
The authors of the paper that this dataset comes released their fine-tuned model on HuggingFace, so I’ll use it to generate text to see how a state-of-the-art TinyStories model performs:
= "EleutherAI/gpt-neo-125M"
model_nm = AutoTokenizer.from_pretrained(model_nm)
tokz 'pad_token': '[PAD]'})
tokz.add_special_tokens({def tok_func(x): return tokz(x["text"])
= pipeline('text-generation', model='roneneldan/TinyStories-33M', tokenizer=tokz)
generator = 100, repetition_penalty=1.2) generator(prompt, max_length
Generated text:
‘Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine and pick flowers. One day, she found an ancient book on her porch. It had lots of pictures inside that looked very old.opened the book and saw many words written around it. But then, she heard a loud noise coming from the house next door. She went to investigate and found out that someone had broken into their home. ran back to’
The model is so good! It can hold a consistent, coherent theme in story format for multiple sentences.
Final Thoughts
I’m happy to have got this all to work, as that alone was a big step in my learning process. This is the first time I have trained a causal language model using HuggingFace. One thought to close out this exercise: Would restructuring the data help? Currently the dataset has text values like “Summary:” and “Features:”, which are the prompts used by the TinyStories paper authors to generate stories using GPT-3.5 and 4. Perhaps removing these prompts from the dataset and keeping only the story text would help improve the model. I’ll explore this in a future exercise.
I hope you enjoyed this blog post!