Fine-tuning TinyStories-33M on the financial_phrasebank Dataset
python
LLM
TinySentiment
In this blog post I fine-tune the TinyStories-33M model on the financial_phrasebank dataset and achieve 79%+ accuracy on the validation and test set.
Author
Vishal Bakshi
Published
August 13, 2024
Background
In this notebook, I’ll fine-tune different TinyStories base models on the financial_phrasebank dataset to perform sentiment classification on financial news text. In a previous blog post I showed that TinyStories-Instruct-33M does not follow even simple instructions (e.g., “What is the color an apple?”) that deviate from its training data, so that motivated me to finetune these models.
The TinyStories paper doesn’t include any hyperparameters (of particular interest are the learning rate and batch size) used to train their models so I’ll experiment with different values.
2024-08-14 01:38:25.564071: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-14 01:38:25.564133: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-14 01:38:25.565856: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Show load_dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree", split="train"# note that the dataset does not have a default test split)# # Source: https://huggingface.co/blog/synthetic-data-save-costs# # create a new column with the numeric label verbalised as label_text (e.g. "positive" instead of "0")# label_map = {# i: label_text# for i, label_text in enumerate(dataset.features["label"].names)# }# def add_label_text(example):# example["labels"] = label_map[example["label"]]# return example# dataset = dataset.map(add_label_text)
Initial Fine-Tune
Tokenize the Dataset
The HuggingFace Trainer, if I’m not mistaken, expects the target variable to have the title labels and the independent variable titled input for classification tasks:
The financial_phrasebank dataset doesn’t have a default test split, so I’ll use 225 sentences (~10%) as the test set. I’ll split the remaining data into an 80/20 train/validation set.
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=3) # 3 labels for 3 classestrainer = Trainer(model, args, train_dataset=train_ds, eval_dataset=eval_ds, tokenizer=tokz, compute_metrics=get_acc, callbacks=[metric_callback])
Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-33M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model.resize_token_embeddings(len(tokz)) # do this otherwise I get a "index out of range" error
Embedding(50258, 768)
model.config.pad_token_id = model.config.eos_token_id # do this otherwise I get an error about padding tokens
trainer.train();
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[153/153 00:34, Epoch 3/3]
Epoch
Training Loss
Validation Loss
Accuracy
1
0.809300
0.450100
0.833333
2
0.192700
0.429838
0.889706
3
0.010500
0.751890
0.879902
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
The metric_callback stores the loss, accuracy and runtime (among other metrics) for each epoch:
Show function to convert results dict into DataFrame
def results_to_dataframe(results, model_name): rows = []for result in results: initial_lr = result['learning_rate']for metric in result['metrics']: row = {'model_name': model_name,'initial_learning_rate': initial_lr,'current_learning_rate': metric.get('learning_rate'), } row.update(metric) rows.append(row) df = pd.DataFrame(rows)# Ensure specific columns are at the beginning first_columns = ['model_name', 'initial_learning_rate', 'current_learning_rate', 'epoch'] other_columns = [col for col in df.columns if col notin first_columns] df = df[first_columns + other_columns]return df
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
input
labels
predicted
0
Indigo and Somoncom serve 377,000 subscribers ...
1
1
1
The sellers were EOSS Innovationsmanagement an...
1
1
2
UPM-Kymmene said its has ` not indicated any i...
1
1
3
These financing arrangements will enable the c...
2
1
4
Fortum expects its annual capital expenditure ...
1
1
This training run resulted in a TinyStories-33M finetuned model that predicted 85% of the 225 test sentences’ sentiment correctly.
def make_cm(df):"""Create confusion matrix for true vs predicted sentiment classes""" cm = confusion_matrix(y_true=df['label_text'], y_pred=df['pred_text'], labels=['negative', 'neutral', 'positive']) disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive']) fig, ax = plt.subplots(figsize=(4,4)) disp.plot(ax=ax,text_kw={'fontsize': 12}, cmap='Blues', colorbar=False);# change label font size without changing label text ax.xaxis.label.set_fontsize(16) ax.yaxis.label.set_fontsize(16)# make tick labels larger ax.tick_params(axis='y', labelsize=14) ax.tick_params(axis='x', labelsize=14)
The model predicted neutral sentences with the highest accuracy (122/125), followed by negative sentences (32/46) and finally positive sentences (37/54).
make_cm(test_df)
As a final sanity check, I’ll prompt the model with a made-up financial new sentence and see if it correctly classifies it (as positive, negative, or neutral):
Show function to generate a prediction
def get_prediction(model, text, tokz):# Determine the device device = torch.device("cuda"if torch.cuda.is_available() else"cpu")# Move the model to the appropriate device model = model.to(device)# Tokenize the input text inputs = tokz(text, return_tensors="pt", truncation=True, padding=True)# Move input tensors to the same device as the model inputs = {k: v.to(device) for k, v in inputs.items()}# Get the model's prediction model.eval() # Set the model to evaluation modewith torch.no_grad(): outputs = model(**inputs)# Ensure logits are on CPU for numpy operations logits = outputs.logits.detach().cpu()# Get probabilities probs = torch.softmax(logits, dim=-1)# Get the predicted class p_class = torch.argmax(probs, dim=-1).item()# Get the probability for the predicted class p = probs[0][p_class].item() labels = {0: "negative", 1: "neutral", 2: "positive"}print(f"Probability: {p:.2f}")print(f"Predicted label: {labels[p_class]}")return p_class, p
text ="The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"_ = get_prediction(model, text, tokz)
Probability: 0.60
Predicted label: positive
text ="The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"_ = get_prediction(model, text, tokz)
Probability: 1.00
Predicted label: negative
text ="The net sales stayed the as the same quarter last year"_ = get_prediction(model, text, tokz)
Probability: 0.93
Predicted label: neutral
Learning Rate Sweep
Let’s see if I can beat the validation accuracy of 89% and test set accuracy of 85% by using different learning rates. I’ll wrap up some of the training code (there’s so much of it!!) in helper functions so I can loop through my learning rates. I’ve also created a helper function to get the test set accuracy.
The corresponding model has a 79% accuracy on the test set.
test_df, acc = get_test_df(trainers[7])acc
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
0.7911111111111111
Similar to before the model does well at predicting neutral sentences (119/125). This time, the model has a better accuracy with positive sentences (41/54) than negative sentences where it gets less than 50% correct (18/46).
make_cm(test_df)
Interestingly enough, this model gets my made-up “positive” sentence incorrect (while getting the “neutral” and “negative” made-up ones correct):
text ="The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"_ = get_prediction(trainers[7].model, text, tokz)
Probability: 0.49
Predicted label: neutral
text ="The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"_ = get_prediction(trainers[7].model, text, tokz)
Probability: 0.36
Predicted label: negative
text ="The net sales stayed the as the same quarter last year"_ = get_prediction(trainers[7].model, text, tokz)
Probability: 0.54
Predicted label: neutral
Relationship between Learning Rate and Validation Accuracy
The validation set accuracy starts low, increases to a peak at lr=0.0005 and then decreases again. I’ve used a log-scale on the x-axis to more easily view all of the data.
Show plotting code
final_epoch_metrics = metrics_df.query("epoch == 3")plt.scatter(final_epoch_metrics['initial_learning_rate'], final_epoch_metrics['eval_accuracy']);plt.xscale('log')plt.xlabel('Learning Rate (log scale)')plt.ylabel('Validation Set Accuracy')plt.title('Learning Rate vs. Final Epoch Validation Accuracy');
Learning Rate with Highest Test Set Accuracy
test_dfs = []accs = []for t in trainers: test_df, acc = get_test_df(t) test_dfs.append(test_df) accs.append(acc)
The learning rate of 0.0005 also had the highest test set accuracy.
I would say this model’s training is quite consistent! Almost too consistent? Something feels weird about it getting the same accuracy for 9 out of the 10 runs.
test_df.to_csv("TinyStories-33M_test set predictions_LR5e-4.csv", index=False)
Final Thoughts
I entered this experiment expecting TinyStories-33M to perform poorly on sentiment classification, and am surprised (even shocked?) that it’s achieving 80%+ accuracy consistently. Granted, I have a small test set (and validation set) but these results are promising.
I also didn’t do an exhaustive hyperparameter search (weight decay, learning rate warmup, number of epochs) so maybe I could have increased the performance of the model. I’ll leave that for a future exercise.
For now, in my next notebook/blog post related to this project (that I’m calling TinySentiment), I’ll fine-tune three smaller base models (TinyStories-8M, TinyStories-3M, and TinyStories-1M) on the financial_phrasebank dataset and compare their results with the 33M model.
I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.