Fine-tuning TinyStories-8M on the financial_phrasebank Dataset
python
LLM
TinySentiment
In this blog post I fine-tune the smaller TinyStories-8M model on the financial_phrasebank dataset and achieve 86% accuracy on the test set and 85% accuracy on the validation set.
Author
Vishal Bakshi
Published
August 19, 2024
Background
In a previous blog post I finetuned the TinyStories-33M model on the financial_phrasebank dataset and achieved a ~85% accuracy on the validation set and an ~80% accuracy on the test set.
In this notebook, I’ll finetune the much smaller TinyStories-8M model and see how it performs. I expect it to perform worse. In future notebooks, I’ll also finetune the 3M and 1M TinyStories models. I also suspect these models might perform better on a (synthetically generated) simpler version of this dataset, which I plan to explore in a future notebook.
Much of the code in this section is boilerplate, tokenizing the dataset and splitting it into training, validation and test sets.
Show load_dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree", split="train"# note that the dataset does not have a default test split)dataset = dataset.rename_columns({'label':'labels', 'sentence': 'input'})
tokz.add_special_tokens({'pad_token': '[PAD]'})tokz.padding_side ="left"# https://github.com/huggingface/transformers/issues/16595 and https://www.kaggle.com/code/baekseungyun/gpt-2-with-huggingface-pytorchtok_ds = dataset.map(tok_func, batched=True)
tok_ds[0]['input']
'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .'
tok_ds[0]['input_ids'][100:110] # first 100 elements are 50257 ('[PAD]')
Much of the code in this section is either helper functions (like get_acc, MetricCallback, or results_to_dataframe) or boilerplate code to prepare a HuggingFace trainer:
# thanks Claudeclass MetricCallback(TrainerCallback):def__init__(self):self.metrics = []self.current_epoch_metrics = {}def on_log(self, args, state, control, logs=None, **kwargs):if logs isnotNone:self.current_epoch_metrics.update(logs)def on_epoch_end(self, args, state, control, **kwargs):ifhasattr(state, 'log_history') and state.log_history:# Get the last logged learning rate last_lr = state.log_history[-1].get('learning_rate', None)else: last_lr =Noneself.metrics.append({"epoch": state.epoch,"learning_rate": last_lr,**self.current_epoch_metrics })self.current_epoch_metrics = {} # Reset for next epochdef on_train_end(self, args, state, control, **kwargs):# Capture final metrics after the last epochifself.current_epoch_metrics:self.metrics.append({"epoch": state.num_train_epochs,"learning_rate": self.metrics[-1].get('learning_rate') ifself.metrics elseNone,**self.current_epoch_metrics })
Show function to convert results dict into DataFrame
def results_to_dataframe(results, model_name): rows = []for result in results: initial_lr = result['learning_rate']for metric in result['metrics']: row = {'model_name': model_name,'initial_learning_rate': initial_lr,'current_learning_rate': metric.get('learning_rate'), } row.update(metric) rows.append(row) df = pd.DataFrame(rows)# Ensure specific columns are at the beginning first_columns = ['model_name', 'initial_learning_rate', 'current_learning_rate', 'epoch'] other_columns = [col for col in df.columns if col notin first_columns] df = df[first_columns + other_columns]return df
Show function to make confusion matrix
def make_cm(df):"""Create confusion matrix for true vs predicted sentiment classes""" cm = confusion_matrix(y_true=df['label_text'], y_pred=df['pred_text'], labels=['negative', 'neutral', 'positive']) disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive']) fig, ax = plt.subplots(figsize=(4,4)) disp.plot(ax=ax,text_kw={'fontsize': 12}, cmap='Blues', colorbar=False);# change label font size without changing label text ax.xaxis.label.set_fontsize(16) ax.yaxis.label.set_fontsize(16)# make tick labels larger ax.tick_params(axis='y', labelsize=14) ax.tick_params(axis='x', labelsize=14)
Show function to generate a prediction
def get_prediction(model, text, tokz):# Determine the device device = torch.device("cuda"if torch.cuda.is_available() else"cpu")# Move the model to the appropriate device model = model.to(device)# Tokenize the input text inputs = tokz(text, return_tensors="pt", truncation=True, padding=True)# Move input tensors to the same device as the model inputs = {k: v.to(device) for k, v in inputs.items()}# Get the model's prediction model.eval() # Set the model to evaluation modewith torch.no_grad(): outputs = model(**inputs)# Ensure logits are on CPU for numpy operations logits = outputs.logits.detach().cpu()# Get probabilities probs = torch.softmax(logits, dim=-1)# Get the predicted class p_class = torch.argmax(probs, dim=-1).item()# Get the probability for the predicted class p = probs[0][p_class].item() labels = {0: "negative", 1: "neutral", 2: "positive"}print(f"Probability: {p:.2f}")print(f"Predicted label: {labels[p_class]}")return p_class, p
While there are other hyperparameters to tune (warmup_ratio, weight_decay) I’ll focus this notebook on fine-tuning with different learning rates. I’ll start with the same learning rates that I used for the 33M model:
This model actually has a higher test accuracy than the 33M model (81% > 79%)—a result that I was not expecting!
test_df, acc = get_test_df(trainers[4])acc
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
0.8133333333333334
This 8M parameter finetuned model predicts neutral sentences the best (117/125) followed by positive sentences (39/54) and lastly, negative sentences (27/46). It’s interesting to note that the dataset contains a majority of neutral sentences, followed by positive sentences and the least represented sentiment is negative.
make_cm(test_df)
As the learning rate increases (starting at 1e-6) the validation set accuracy increases until it reaches a peak at a learning rate of 8e-5.
Show plotting code
final_epoch_metrics = metrics_df.query("epoch == 3")plt.scatter(final_epoch_metrics['initial_learning_rate'], final_epoch_metrics['eval_accuracy']);plt.xscale('log')plt.xlabel('Learning Rate (log scale)')plt.ylabel('Validation Set Accuracy')plt.title('Learning Rate vs. Final Epoch Validation Accuracy');
I’ll test the model (run a “sanity check”) on three made-up sentences. I don’t want to weigh these results too much as they are cherry-picked sentences, but this model only gets one of them right and predicts all three as negative.
text ="The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"_ = get_prediction(trainers[4].model, text, tokz)
Probability: 0.72
Predicted label: negative
text ="The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"_ = get_prediction(trainers[4].model, text, tokz)
Probability: 0.62
Predicted label: negative
text ="The net sales stayed the as the same quarter last year"_ = get_prediction(trainers[4].model, text, tokz)
Probability: 0.68
Predicted label: negative
Highest Test Set Accuracy
Show accuracy calculation loop
test_dfs = []accs = []for t in trainers: test_df, acc = get_test_df(t) test_dfs.append(test_df) accs.append(acc)
The learning rate with the highest test set accuracy (83%) is 5e-4. Interestingly, this was the same learning rate for the 33M model.
This model gets 121/125 neutral predictions correct, followed by 40/54 positive predictions and 25/46 negative predictions.
make_cm(test_dfs[7])
Interestingly, it also gets 1/3 of my “sanity check” predictions correct, predicting all three as positive.
text ="The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"_ = get_prediction(trainers[7].model, text, tokz)
Probability: 0.65
Predicted label: positive
text ="The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"_ = get_prediction(trainers[7].model, text, tokz)
Probability: 0.64
Predicted label: positive
text ="The net sales stayed the as the same quarter last year"_ = get_prediction(trainers[7].model, text, tokz)
Probability: 0.63
Predicted label: positive
Training with the Best Learning Rates 10 Times
Since I have different models achieving the highest validation set accuracy and the highest test set accuracy, I’ll train 10 models for each learning rate to see if the results are consistent.
Similar to the 33M model, 9 out of the 10 training runs resulted in the exact same final validation set accuracy. I’m not sure why this behavior persists—I’ll have to look at my Trainer setup and see if there’s something awry?
test_dfs = []accs = []for t in best_trainers: test_df, acc = get_test_df(t) test_dfs.append(test_df) accs.append(acc)
Similarly, 9 out of the 10 training runs resulted in the same test set accuracy. One of the models resulted in an 86% test set accuracy! This is higher than the 33M model’s best validation set accuracy.
accs = pd.Series(accs)accs.value_counts()
0.813333 9
0.862222 1
Name: count, dtype: int64
For what it’s worth (not much) the best model (85% validation set and 86% test set accuracy) gets 2/3 of my sanity check sentences right.
text ="The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"_ = get_prediction(best_trainers[0].model, text, tokz)
Probability: 0.72
Predicted label: positive
text ="The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"_ = get_prediction(best_trainers[0].model, text, tokz)
Probability: 0.53
Predicted label: negative
text ="The net sales stayed the as the same quarter last year"_ = get_prediction(best_trainers[0].model, text, tokz)
count 10.000000
mean 0.784804
std 0.013951
min 0.745098
25% 0.789216
50% 0.789216
75% 0.789216
max 0.789216
Name: eval_accuracy, dtype: float64
Show accuracy calculation loop
test_dfs2 = []accs2 = []for t in best_trainers2: test_df, acc = get_test_df(t) test_dfs2.append(test_df) accs2.append(acc)
The most common test set accuracy (81%) was less than before for this learning rate (5e-4):
accs = pd.Series(accs)accs.value_counts()
0.813333 9
0.862222 1
Name: count, dtype: int64
If I use the model with the best test set accuracy (86%), the model gets all three of my sanity check sentence sentiments correct:
text ="The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"_ = get_prediction(best_trainers2[0].model, text, tokz)
Probability: 0.48
Predicted label: positive
text ="The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"_ = get_prediction(best_trainers2[0].model, text, tokz)
Probability: 0.54
Predicted label: negative
text ="The net sales stayed the as the same quarter last year"_ = get_prediction(best_trainers2[0].model, text, tokz)
Probability: 0.92
Predicted label: neutral
Final Thoughts
I’ll summarize my results so far, highlighting that the 8M model achieved a 7% higher test accuracy and a validation set accuracy only 1% lower than the 33M model:
Arch
Fine-tuning Learning Rate
Best Val Acc
Best Test Acc
TinyStories-33M
5e-4
86%
79%
TinyStories-8M
8e-05
85%
86%
TinyStories-8M
5e-4
79%
86%
These experiments are quite rough, quick-and-dirty experiments to get me more practice fine-tuning language models with HuggingFace. That being said, there’s something to be said about being able to relatively easily achieve a decent validation and test set accuracy on the financial_phrasebank dataset using tiny models—something that I was not expecting!
I’m excited to continue this fine-tuning series with the 3M and 1M TinyStories models. After I finish this first round of fine-tune, I’ll do a more thorough hyperparameter sweep (especially for number of epochs) and see if I can squeeze a few more %-ages of accuracy out of these models. Finally, I’ll experiment with creating synthetically generated low-reading-grade-level versions of the financial_phrasebank dataset and see if fine-tuning these small models on that dataset achieves better results.
I hope you enjoyed this notebook! Follow me on Twitter @vishal_learner.