Fine-tuning TinyStories-8M on the `financial_phrasebank` Dataset

python

LLM

TinySentiment

In this blog post I fine-tune the smaller TinyStories-8M model on the financial_phrasebank dataset and achieve 86% accuracy on the test set and 85% accuracy on the validation set.

Author

Vishal Bakshi

Published

August 19, 2024

Background

In a previous blog post I finetuned the TinyStories-33M model on the financial_phrasebank dataset and achieved a ~85% accuracy on the validation set and an ~80% accuracy on the test set.

In this notebook, I’ll finetune the much smaller TinyStories-8M model and see how it performs. I expect it to perform worse. In future notebooks, I’ll also finetune the 3M and 1M TinyStories models. I also suspect these models might perform better on a (synthetically generated) simpler version of this dataset, which I plan to explore in a future notebook.

::: {.cell _cell_guid=‘b1076dfc-b9ad-4769-8c92-a6c4dae69d19’ _uuid=‘8f2839f25d086af736a60e9eeb907d3b93b6e0e5’ execution=‘{“iopub.execute_input”:“2024-08-20T01:40:48.284727Z”,“iopub.status.busy”:“2024-08-20T01:40:48.284346Z”,“iopub.status.idle”:“2024-08-20T01:41:09.248980Z”,“shell.execute_reply”:“2024-08-20T01:41:09.248146Z”,“shell.execute_reply.started”:“2024-08-20T01:40:48.284695Z”}’ trusted=‘true’}

Show imports and setup

#!pip install accelerate evaluate datasets -Uqq
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer, TrainerCallback
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F

import gc
def report_gpu():
    print(torch.cuda.list_gpu_processes())
    gc.collect()
    torch.cuda.empty_cache()
    
#model_nm = "roneneldan/TinyStories-33M"
#model_nm = "roneneldan/TinyStories-1M"
#model_nm = "roneneldan/TinyStories-3M"
model_nm = "roneneldan/TinyStories-8M"

tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x["input"], padding=True, truncation=True)

:::

Preparing Datasets

Much of the code in this section is boilerplate, tokenizing the dataset and splitting it into training, validation and test sets.

Show load_dataset

dataset = load_dataset(
    "financial_phrasebank", "sentences_allagree",
    split="train"  # note that the dataset does not have a default test split
)

dataset = dataset.rename_columns({'label':'labels', 'sentence': 'input'})

tokz.add_special_tokens({'pad_token': '[PAD]'})
tokz.padding_side = "left" # https://github.com/huggingface/transformers/issues/16595 and https://www.kaggle.com/code/baekseungyun/gpt-2-with-huggingface-pytorch
tok_ds = dataset.map(tok_func, batched=True)

tok_ds[0]['input']

'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .'

tok_ds[0]['input_ids'][100:110] # first 100 elements are 50257 ('[PAD]')

[50257, 50257, 50257, 50257, 50257, 50257, 4821, 284, 17113, 837]

tokz.decode(50257), tokz.decode(4821), tokz.decode(284), tokz.decode(17113)

('[PAD]', 'According', ' to', ' Gran')

tok_ds[0]['labels']

split_dataset = tok_ds.train_test_split(test_size=225/2264, seed=42)

training_split = split_dataset['train'].train_test_split(test_size=0.2, seed=42)

train_ds = training_split['train']
eval_ds = training_split['test']
test_ds = split_dataset['test']

train_ds, eval_ds, test_ds

(Dataset({
     features: ['input', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 1631
 }),
 Dataset({
     features: ['input', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 408
 }),
 Dataset({
     features: ['input', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 225
 }))

train_ds[0]['input']

'The result will also be burdened by increased fixed costs associated with operations in China , and restructuring costs in Japan .'

train_ds[0]['labels']

The dataset distributions show a predominance of neutral (1) sentences:

train_ds.to_pandas()['labels'].value_counts() / len(train_ds)

labels
1    0.622318
2    0.251993
0    0.125690
Name: count, dtype: float64

eval_ds.to_pandas()['labels'].value_counts() / len(eval_ds)

labels
1    0.615196
2    0.257353
0    0.127451
Name: count, dtype: float64

test_ds.to_pandas()['labels'].value_counts() / len(test_ds)

labels
1    0.555556
2    0.240000
0    0.204444
Name: count, dtype: float64

Prepare for Training

Much of the code in this section is either helper functions (like get_acc, MetricCallback, or results_to_dataframe) or boilerplate code to prepare a HuggingFace trainer:

def get_acc(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).astype(np.float32).mean().item()}

Show MetricCallback code

# thanks Claude

class MetricCallback(TrainerCallback):
    def __init__(self):
        self.metrics = []
        self.current_epoch_metrics = {}

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            self.current_epoch_metrics.update(logs)

    def on_epoch_end(self, args, state, control, **kwargs):
        if hasattr(state, 'log_history') and state.log_history:
            # Get the last logged learning rate
            last_lr = state.log_history[-1].get('learning_rate', None)
        else:
            last_lr = None

        self.metrics.append({
            "epoch": state.epoch,
            "learning_rate": last_lr,
            **self.current_epoch_metrics
        })
        self.current_epoch_metrics = {}  # Reset for next epoch

    def on_train_end(self, args, state, control, **kwargs):
        # Capture final metrics after the last epoch
        if self.current_epoch_metrics:
            self.metrics.append({
                "epoch": state.num_train_epochs,
                "learning_rate": self.metrics[-1].get('learning_rate') if self.metrics else None,
                **self.current_epoch_metrics
            })

Show function to convert results dict into DataFrame

def results_to_dataframe(results, model_name):
    rows = []
    for result in results:
        initial_lr = result['learning_rate']
        for metric in result['metrics']:
            row = {
                'model_name': model_name,
                'initial_learning_rate': initial_lr,
                'current_learning_rate': metric.get('learning_rate'),
            }
            row.update(metric)
            rows.append(row)
    
    df = pd.DataFrame(rows)
    
    # Ensure specific columns are at the beginning
    first_columns = ['model_name', 'initial_learning_rate', 'current_learning_rate', 'epoch']
    other_columns = [col for col in df.columns if col not in first_columns]
    df = df[first_columns + other_columns]
    
    return df

Show function to make confusion matrix

def make_cm(df):
    """Create confusion matrix for true vs predicted sentiment classes"""
    
    cm = confusion_matrix(y_true=df['label_text'], y_pred=df['pred_text'], labels=['negative', 'neutral', 'positive'])
    disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive'])
    
    fig, ax = plt.subplots(figsize=(4,4))
    disp.plot(ax=ax,text_kw={'fontsize': 12}, cmap='Blues', colorbar=False);
    
    # change label font size without changing label text
    ax.xaxis.label.set_fontsize(16)
    ax.yaxis.label.set_fontsize(16)
    
    # make tick labels larger
    ax.tick_params(axis='y', labelsize=14)
    ax.tick_params(axis='x', labelsize=14)

Show function to generate a prediction

def get_prediction(model, text, tokz):
    # Determine the device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move the model to the appropriate device
    model = model.to(device)

    # Tokenize the input text
    inputs = tokz(text, return_tensors="pt", truncation=True, padding=True)

    # Move input tensors to the same device as the model
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get the model's prediction
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        outputs = model(**inputs)

    # Ensure logits are on CPU for numpy operations
    logits = outputs.logits.detach().cpu()

    # Get probabilities
    probs = torch.softmax(logits, dim=-1)

    # Get the predicted class
    p_class = torch.argmax(probs, dim=-1).item()

    # Get the probability for the predicted class
    p = probs[0][p_class].item()

    labels = {0: "negative", 1: "neutral", 2: "positive"}
    
    print(f"Probability: {p:.2f}")
    print(f"Predicted label: {labels[p_class]}")
    return p_class, p

Show function to prep trainer

def get_trainer(lr, bs=16):

    args = TrainingArguments(
        'outputs',
        learning_rate=lr,
        warmup_ratio=0.1,
        lr_scheduler_type='cosine',
        fp16=True,
        eval_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=bs,
        per_device_eval_batch_size=bs*2,
        num_train_epochs=3,
        weight_decay=0.01,
        report_to='none')
    
    model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=3) # 3 labels for 3 classes
    model.resize_token_embeddings(len(tokz))
    model.config.pad_token_id = model.config.eos_token_id
    
    trainer = Trainer(model, args, train_dataset=train_ds, eval_dataset=eval_ds, 
                  tokenizer=tokz, compute_metrics=get_acc, callbacks=[metric_callback])
    
    return trainer, args

Show function to get test set accuracy

def get_test_df(trainer):
    test_df = test_ds.to_pandas()[['input', 'labels']]
    
    preds = trainer.predict(test_ds).predictions.astype(float)
    probs = F.softmax(torch.tensor(preds), dim=1)
    predicted_classes = torch.argmax(probs, dim=1).numpy()

    test_df['predicted'] = predicted_classes
    
    test_df['match'] = test_df['labels'] == test_df['predicted']
    acc = test_df['match'].mean()
    
    label_map = {i: label_text for i, label_text in enumerate(test_ds.features["labels"].names)}
    test_df['label_text'] = test_df['labels'].apply(lambda x: label_map[x])
    test_df['pred_text'] = test_df['predicted'].apply(lambda x: label_map[x])
    
    return test_df, acc

Training: Learning Rate Sweep

While there are other hyperparameters to tune (warmup_ratio, weight_decay) I’ll focus this notebook on fine-tuning with different learning rates. I’ll start with the same learning rates that I used for the 33M model:

Show training loop

metrics = []
trainers = []
learning_rates = [1e-6, 1e-5, 3e-5, 5e-5, 8e-5, 1e-4, 3e-4, 5e-4, 8e-4, 1e-3, 1e-2, 1e-1]

for lr in learning_rates:
    print(f"Learning Rate: {lr}")
    
    metric_callback = MetricCallback()
    
    trainer, args = get_trainer(lr, bs=64)

    trainer.train()

    metrics.append({
        "learning_rate": lr,
        "metrics": metric_callback.metrics
        })
    
    trainers.append(trainer) 
    
    # clean up
    report_gpu()
    !rm -r /kaggle/working/outputs

metrics_df = results_to_dataframe(metrics, model_name="TinyStories-8M")
metrics_df = metrics_df.query('current_learning_rate.notna()')

Results

Highest Validation Set Accuracy

The highest validation set accuracy (82%) was obtained with a learning rate of 8e-5.

metrics_df.query('eval_accuracy == eval_accuracy.max()')

	model_name	initial_learning_rate	current_learning_rate	epoch	learning_rate	loss	grad_norm	eval_loss	eval_accuracy	eval_runtime	eval_samples_per_second	eval_steps_per_second	train_runtime	train_samples_per_second	train_steps_per_second	total_flos	train_loss
19	TinyStories-8M	0.00008	0.0	3.0	0.0	0.2323	632259.5	0.44796	0.823529	0.4288	951.429	4.664	12.5028	391.353	3.119	2.427998e+13	0.49546

learning_rates[4]

8e-05

This model actually has a higher test accuracy than the 33M model (81% > 79%)—a result that I was not expecting!

test_df, acc = get_test_df(trainers[4])
acc

/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

0.8133333333333334

This 8M parameter finetuned model predicts neutral sentences the best (117/125) followed by positive sentences (39/54) and lastly, negative sentences (27/46). It’s interesting to note that the dataset contains a majority of neutral sentences, followed by positive sentences and the least represented sentiment is negative.

make_cm(test_df)

As the learning rate increases (starting at 1e-6) the validation set accuracy increases until it reaches a peak at a learning rate of 8e-5.

Show plotting code

final_epoch_metrics = metrics_df.query("epoch == 3")
plt.scatter(final_epoch_metrics['initial_learning_rate'], final_epoch_metrics['eval_accuracy']);
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Validation Set Accuracy')
plt.title('Learning Rate vs. Final Epoch Validation Accuracy');

I’ll test the model (run a “sanity check”) on three made-up sentences. I don’t want to weigh these results too much as they are cherry-picked sentences, but this model only gets one of them right and predicts all three as negative.

text = "The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"

_ = get_prediction(trainers[4].model, text, tokz)

Probability: 0.72
Predicted label: negative

text = "The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"

_ = get_prediction(trainers[4].model, text, tokz)

Probability: 0.62
Predicted label: negative

text = "The net sales stayed the as the same quarter last year"

_ = get_prediction(trainers[4].model, text, tokz)

Probability: 0.68
Predicted label: negative

Highest Test Set Accuracy

Show accuracy calculation loop

test_dfs = []
accs = []
for t in trainers:
    test_df, acc = get_test_df(t)
    test_dfs.append(test_df)
    accs.append(acc)

The learning rate with the highest test set accuracy (83%) is 5e-4. Interestingly, this was the same learning rate for the 33M model.

accs

[0.5733333333333334,
 0.6844444444444444,
 0.7333333333333333,
 0.7822222222222223,
 0.8133333333333334,
 0.8177777777777778,
 0.7333333333333333,
 0.8266666666666667,
 0.6755555555555556,
 0.6888888888888889,
 0.5555555555555556,
 0.5555555555555556]

learning_rates[7]

0.0005

This learning rate had a validation set accuracy of about 79%.

final_epoch_metrics.query("initial_learning_rate == 0.0005")

	model_name	initial_learning_rate	current_learning_rate	epoch	learning_rate	loss	grad_norm	eval_loss	eval_accuracy	eval_runtime	eval_samples_per_second	eval_steps_per_second	train_runtime	train_samples_per_second	train_steps_per_second	total_flos	train_loss
31	TinyStories-8M	0.0005	0.0	3.0	0.0	0.3891	264843.125	0.543861	0.789216	0.4398	927.714	4.548	12.8595	380.498	3.033	2.427998e+13	0.846817

This model gets 121/125 neutral predictions correct, followed by 40/54 positive predictions and 25/46 negative predictions.

make_cm(test_dfs[7])

Interestingly, it also gets 1/3 of my “sanity check” predictions correct, predicting all three as positive.

text = "The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"

_ = get_prediction(trainers[7].model, text, tokz)

Probability: 0.65
Predicted label: positive

text = "The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"

_ = get_prediction(trainers[7].model, text, tokz)

Probability: 0.64
Predicted label: positive

text = "The net sales stayed the as the same quarter last year"

_ = get_prediction(trainers[7].model, text, tokz)

Probability: 0.63
Predicted label: positive

Training with the Best Learning Rates 10 Times

Since I have different models achieving the highest validation set accuracy and the highest test set accuracy, I’ll train 10 models for each learning rate to see if the results are consistent.

LR = 8e-5 (Highest Validation Set Accuracy)

learning_rates[4]

8e-05

Show training loop

best_metrics = []
best_trainers = []
lr = learning_rates[4]

for i in range(10):
    
    metric_callback = MetricCallback()
    trainer, args = get_trainer(lr=lr, bs=64)
    trainer.train()

    best_metrics.append({
        "learning_rate": lr,
        "metrics": metric_callback.metrics
        })
    
    best_trainers.append(trainer) 
    
    # clean up
    report_gpu()
    !rm -r /kaggle/working/outputs

best_metrics_df = results_to_dataframe(best_metrics, model_name="TinyStories-8M")
best_metrics_df = best_metrics_df.query('current_learning_rate.notna()')
best_metrics_df.head(3)

	model_name	initial_learning_rate	current_learning_rate	epoch	learning_rate	loss	grad_norm	eval_loss	eval_accuracy	eval_runtime	eval_samples_per_second	eval_steps_per_second	train_runtime	train_samples_per_second	train_steps_per_second	total_flos	train_loss
1	TinyStories-8M	0.00008	0.000068	1.0	0.000068	0.9671	313453.1875	0.689426	0.725490	0.4443	918.268	4.501	NaN	NaN	NaN	NaN	NaN
2	TinyStories-8M	0.00008	0.000024	2.0	0.000024	0.4975	948153.7500	0.490408	0.799020	0.4386	930.177	4.560	NaN	NaN	NaN	NaN	NaN
3	TinyStories-8M	0.00008	0.000000	3.0	0.000000	0.2880	589967.1875	0.427528	0.848039	0.4433	920.448	4.512	12.7188	384.706	3.066	2.427998e+13	0.584194

Similar to the 33M model, 9 out of the 10 training runs resulted in the exact same final validation set accuracy. I’m not sure why this behavior persists—I’ll have to look at my Trainer setup and see if there’s something awry?

final_accs = best_metrics_df.query("epoch == 3")['eval_accuracy']
final_accs.describe()

count    10.000000
mean      0.825980
std       0.007751
min       0.823529
25%       0.823529
50%       0.823529
75%       0.823529
max       0.848039
Name: eval_accuracy, dtype: float64

final_accs.value_counts()

eval_accuracy
0.823529    9
0.848039    1
Name: count, dtype: int64

Show accuracy calculation loop

test_dfs = []
accs = []
for t in best_trainers:
    test_df, acc = get_test_df(t)
    test_dfs.append(test_df)
    accs.append(acc)

Similarly, 9 out of the 10 training runs resulted in the same test set accuracy. One of the models resulted in an 86% test set accuracy! This is higher than the 33M model’s best validation set accuracy.

accs = pd.Series(accs)
accs.value_counts()

0.813333    9
0.862222    1
Name: count, dtype: int64

For what it’s worth (not much) the best model (85% validation set and 86% test set accuracy) gets 2/3 of my sanity check sentences right.

text = "The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"

_ = get_prediction(best_trainers[0].model, text, tokz)

Probability: 0.72
Predicted label: positive

text = "The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"

_ = get_prediction(best_trainers[0].model, text, tokz)

Probability: 0.53
Predicted label: negative

text = "The net sales stayed the as the same quarter last year"

_ = get_prediction(best_trainers[0].model, text, tokz)

Probability: 0.59
Predicted label: positive

LR = 5e-4 (Highest Test Set Accuracy)

learning_rates[7] == 5e-4

True

Show training loop

best_metrics2 = []
best_trainers2 = []
lr = learning_rates[7]

for i in range(10):
    
    metric_callback = MetricCallback()
    trainer, args = get_trainer(lr=lr, bs=64)
    trainer.train()

    best_metrics2.append({
        "learning_rate": lr,
        "metrics": metric_callback.metrics
        })
    
    best_trainers2.append(trainer) 
    
    # clean up
    report_gpu()
    !rm -r /kaggle/working/outputs

best_metrics_df2 = results_to_dataframe(best_metrics2, model_name="TinyStories-8M")
best_metrics_df2 = best_metrics_df2.query('current_learning_rate.notna()')
best_metrics_df2.head(3)

	model_name	initial_learning_rate	current_learning_rate	epoch	learning_rate	loss	grad_norm	eval_loss	eval_accuracy	eval_runtime	eval_samples_per_second	eval_steps_per_second	train_runtime	train_samples_per_second	train_steps_per_second	total_flos	train_loss
1	TinyStories-8M	0.0005	0.000423	1.0	0.000423	1.6448	869443.6875	0.980524	0.502451	0.4321	944.212	4.628	NaN	NaN	NaN	NaN	NaN
2	TinyStories-8M	0.0005	0.000152	2.0	0.000152	0.7600	829335.1250	0.678928	0.725490	0.4317	945.012	4.632	NaN	NaN	NaN	NaN	NaN
3	TinyStories-8M	0.0005	0.000000	3.0	0.000000	0.5742	317736.5625	0.598565	0.745098	0.4347	938.664	4.601	12.7041	385.151	3.07	2.427998e+13	0.993002

I achieve the same validation set accuracy (79%) 9 out of 10 times:

final_accs2 = best_metrics_df2.query("epoch == 3")['eval_accuracy']
final_accs2.describe()

count    10.000000
mean      0.784804
std       0.013951
min       0.745098
25%       0.789216
50%       0.789216
75%       0.789216
max       0.789216
Name: eval_accuracy, dtype: float64

Show accuracy calculation loop

test_dfs2 = []
accs2 = []
for t in best_trainers2:
    test_df, acc = get_test_df(t)
    test_dfs2.append(test_df)
    accs2.append(acc)

The most common test set accuracy (81%) was less than before for this learning rate (5e-4):

accs = pd.Series(accs)
accs.value_counts()

0.813333    9
0.862222    1
Name: count, dtype: int64

If I use the model with the best test set accuracy (86%), the model gets all three of my sanity check sentence sentiments correct:

text = "The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"

_ = get_prediction(best_trainers2[0].model, text, tokz)

Probability: 0.48
Predicted label: positive

text = "The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"

_ = get_prediction(best_trainers2[0].model, text, tokz)

Probability: 0.54
Predicted label: negative

text = "The net sales stayed the as the same quarter last year"

_ = get_prediction(best_trainers2[0].model, text, tokz)

Probability: 0.92
Predicted label: neutral

Final Thoughts

I’ll summarize my results so far, highlighting that the 8M model achieved a 7% higher test accuracy and a validation set accuracy only 1% lower than the 33M model:

Arch	Fine-tuning Learning Rate	Best Val Acc	Best Test Acc
TinyStories-33M	5e-4	86%	79%
TinyStories-8M	8e-05	85%	86%
TinyStories-8M	5e-4	79%	86%

These experiments are quite rough, quick-and-dirty experiments to get me more practice fine-tuning language models with HuggingFace. That being said, there’s something to be said about being able to relatively easily achieve a decent validation and test set accuracy on the financial_phrasebank dataset using tiny models—something that I was not expecting!

I’m excited to continue this fine-tuning series with the 3M and 1M TinyStories models. After I finish this first round of fine-tune, I’ll do a more thorough hyperparameter sweep (especially for number of epochs) and see if I can squeeze a few more %-ages of accuracy out of these models. Finally, I’ll experiment with creating synthetically generated low-reading-grade-level versions of the financial_phrasebank dataset and see if fine-tuning these small models on that dataset achieves better results.

I hope you enjoyed this notebook! Follow me on Twitter @vishal_learner.