Fine-tuning TinyStories-3M on the financial_phrasebank Dataset

python
LLM
TinySentiment
In this blog post I fine-tune the TinyStories-3M model on the financial_phrasebank dataset and achieve 74%+ accuracy on the validation and test set.
Author

Vishal Bakshi

Published

August 22, 2024

Background

In a previous blog post I finetuned the TinyStories-33M and the TinyStories-8M model on the financial_phrasebank dataset and achieved the following results:

Arch Fine-tuning Learning Rate Best Val Acc Best Test Acc
TinyStories-33M 5e-04 86% 79%
TinyStories-8M 8e-05 85% 86%
TinyStories-8M 5e-04 79% 86%

In this notebook, I’ll finetune the smaller TinyStories-3M model and see how it performs. I also suspect these smaller models might perform better on a (synthetically generated) simpler version of this dataset, which I plan to explore in a future notebook.

::: {.cell _cell_guid=‘b1076dfc-b9ad-4769-8c92-a6c4dae69d19’ _uuid=‘8f2839f25d086af736a60e9eeb907d3b93b6e0e5’ trusted=‘true’}

Show imports and setup
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer, TrainerCallback
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F

import gc
def report_gpu():
    print(torch.cuda.list_gpu_processes())
    gc.collect()
    torch.cuda.empty_cache()
    
#model_nm = "roneneldan/TinyStories-33M"
#model_nm = "roneneldan/TinyStories-1M"
model_nm = "roneneldan/TinyStories-3M"
#model_nm = "roneneldan/TinyStories-8M"

tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x["input"], padding=True, truncation=True)

:::

Preparing Datasets

Much of the code in this section is boilerplate, tokenizing the dataset and splitting it into training, validation and test sets.

Show load_dataset
dataset = load_dataset(
    "financial_phrasebank", "sentences_allagree",
    split="train"  # note that the dataset does not have a default test split
)

dataset = dataset.rename_columns({'label':'labels', 'sentence': 'input'})
tokz.add_special_tokens({'pad_token': '[PAD]'})
tokz.padding_side = "left" # https://github.com/huggingface/transformers/issues/16595 and https://www.kaggle.com/code/baekseungyun/gpt-2-with-huggingface-pytorch
tok_ds = dataset.map(tok_func, batched=True)
tok_ds[0]['input']
'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .'
tok_ds[0]['input_ids'][100:110] # first 100 elements are 50257 ('[PAD]')
[50257, 50257, 50257, 50257, 50257, 50257, 4821, 284, 17113, 837]
tokz.decode(50257), tokz.decode(4821), tokz.decode(284), tokz.decode(17113)
('[PAD]', 'According', ' to', ' Gran')
tok_ds[0]['labels']
1
split_dataset = tok_ds.train_test_split(test_size=225/2264, seed=42)

training_split = split_dataset['train'].train_test_split(test_size=0.2, seed=42)

train_ds = training_split['train']
eval_ds = training_split['test']
test_ds = split_dataset['test']

train_ds, eval_ds, test_ds
(Dataset({
     features: ['input', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 1631
 }),
 Dataset({
     features: ['input', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 408
 }),
 Dataset({
     features: ['input', 'labels', 'input_ids', 'attention_mask'],
     num_rows: 225
 }))
train_ds[0]['input']
'The result will also be burdened by increased fixed costs associated with operations in China , and restructuring costs in Japan .'
train_ds[0]['labels']
0

The dataset distributions show a predominance of neutral (1) sentences:

train_ds.to_pandas()['labels'].value_counts() / len(train_ds)
labels
1    0.622318
2    0.251993
0    0.125690
Name: count, dtype: float64
eval_ds.to_pandas()['labels'].value_counts() / len(eval_ds)
labels
1    0.615196
2    0.257353
0    0.127451
Name: count, dtype: float64
test_ds.to_pandas()['labels'].value_counts() / len(test_ds)
labels
1    0.555556
2    0.240000
0    0.204444
Name: count, dtype: float64

Prepare for Training

Much of the code in this section is either helper functions (like get_acc, MetricCallback, or results_to_dataframe) or boilerplate code to prepare a HuggingFace trainer:

Show get_acc function
def get_acc(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).astype(np.float32).mean().item()}
Show MetricCallback function
# thanks Claude

class MetricCallback(TrainerCallback):
    def __init__(self):
        self.metrics = []
        self.current_epoch_metrics = {}

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            self.current_epoch_metrics.update(logs)

    def on_epoch_end(self, args, state, control, **kwargs):
        if hasattr(state, 'log_history') and state.log_history:
            # Get the last logged learning rate
            last_lr = state.log_history[-1].get('learning_rate', None)
        else:
            last_lr = None

        self.metrics.append({
            "epoch": state.epoch,
            "learning_rate": last_lr,
            **self.current_epoch_metrics
        })
        self.current_epoch_metrics = {}  # Reset for next epoch

    def on_train_end(self, args, state, control, **kwargs):
        # Capture final metrics after the last epoch
        if self.current_epoch_metrics:
            self.metrics.append({
                "epoch": state.num_train_epochs,
                "learning_rate": self.metrics[-1].get('learning_rate') if self.metrics else None,
                **self.current_epoch_metrics
            })
Show results_to_dataframe function
def results_to_dataframe(results, model_name):
    rows = []
    for result in results:
        initial_lr = result['learning_rate']
        for metric in result['metrics']:
            row = {
                'model_name': model_name,
                'initial_learning_rate': initial_lr,
                'current_learning_rate': metric.get('learning_rate'),
            }
            row.update(metric)
            rows.append(row)
    
    df = pd.DataFrame(rows)
    
    # Ensure specific columns are at the beginning
    first_columns = ['model_name', 'initial_learning_rate', 'current_learning_rate', 'epoch']
    other_columns = [col for col in df.columns if col not in first_columns]
    df = df[first_columns + other_columns]
    
    return df
Show make_cm function
def make_cm(df):
    """Create confusion matrix for true vs predicted sentiment classes"""
    
    cm = confusion_matrix(y_true=df['label_text'], y_pred=df['pred_text'], labels=['negative', 'neutral', 'positive'])
    disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive'])
    
    fig, ax = plt.subplots(figsize=(4,4))
    disp.plot(ax=ax,text_kw={'fontsize': 12}, cmap='Blues', colorbar=False);
    
    # change label font size without changing label text
    ax.xaxis.label.set_fontsize(16)
    ax.yaxis.label.set_fontsize(16)
    
    # make tick labels larger
    ax.tick_params(axis='y', labelsize=14)
    ax.tick_params(axis='x', labelsize=14)
Show get_prediction function
def get_prediction(model, text, tokz):
    # Determine the device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move the model to the appropriate device
    model = model.to(device)

    # Tokenize the input text
    inputs = tokz(text, return_tensors="pt", truncation=True, padding=True)

    # Move input tensors to the same device as the model
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get the model's prediction
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        outputs = model(**inputs)

    # Ensure logits are on CPU for numpy operations
    logits = outputs.logits.detach().cpu()

    # Get probabilities
    probs = torch.softmax(logits, dim=-1)

    # Get the predicted class
    p_class = torch.argmax(probs, dim=-1).item()

    # Get the probability for the predicted class
    p = probs[0][p_class].item()

    labels = {0: "negative", 1: "neutral", 2: "positive"}
    
    print(f"Probability: {p:.2f}")
    print(f"Predicted label: {labels[p_class]}")
    return p_class, p
Show get_trainer function
def get_trainer(lr, bs=16):

    args = TrainingArguments(
        'outputs',
        learning_rate=lr,
        warmup_ratio=0.1,
        lr_scheduler_type='cosine',
        fp16=True,
        eval_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=bs,
        per_device_eval_batch_size=bs*2,
        num_train_epochs=3,
        weight_decay=0.01,
        report_to='none')
    
    model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=3) # 3 labels for 3 classes
    model.resize_token_embeddings(len(tokz))
    model.config.pad_token_id = model.config.eos_token_id
    
    trainer = Trainer(model, args, train_dataset=train_ds, eval_dataset=eval_ds, 
                  tokenizer=tokz, compute_metrics=get_acc, callbacks=[metric_callback])
    
    return trainer, args
Show get_test_df function
def get_test_df(trainer):
    test_df = test_ds.to_pandas()[['input', 'labels']]
    
    preds = trainer.predict(test_ds).predictions.astype(float)
    probs = F.softmax(torch.tensor(preds), dim=1)
    predicted_classes = torch.argmax(probs, dim=1).numpy()

    test_df['predicted'] = predicted_classes
    
    test_df['match'] = test_df['labels'] == test_df['predicted']
    acc = test_df['match'].mean()
    
    label_map = {i: label_text for i, label_text in enumerate(test_ds.features["labels"].names)}
    test_df['label_text'] = test_df['labels'].apply(lambda x: label_map[x])
    test_df['pred_text'] = test_df['predicted'].apply(lambda x: label_map[x])
    
    return test_df, acc

Training: Learning Rate Sweep

While there are other hyperparameters to tune (epochs, warmup_ratio, weight_decay) I’ll focus this notebook on fine-tuning with different learning rates. I’ll start with the same learning rates that I used for the 33M and 8M models:

Show training loop
metrics = []
trainers = []
learning_rates = [1e-6, 1e-5, 3e-5, 5e-5, 8e-5, 1e-4, 3e-4, 5e-4, 8e-4, 1e-3, 1e-2, 1e-1]

for lr in learning_rates:
    print(f"Learning Rate: {lr}")
    
    metric_callback = MetricCallback()
    
    trainer, args = get_trainer(lr, bs=64)

    trainer.train()

    metrics.append({
        "learning_rate": lr,
        "metrics": metric_callback.metrics
        })
    
    trainers.append(trainer) 
    
    # clean up
    report_gpu()
    !rm -r /kaggle/working/outputs
metrics_df = results_to_dataframe(metrics, model_name="TinyStories-3M")
metrics_df = metrics_df.query('current_learning_rate.notna()')

Results

Highest Validation Set Accuracy

The highest validation set accuracy (73%) was obtained with a learning rate of 8e-5.

metrics_df.query('eval_accuracy == eval_accuracy.max()')
model_name initial_learning_rate current_learning_rate epoch learning_rate loss grad_norm eval_loss eval_accuracy eval_runtime eval_samples_per_second eval_steps_per_second train_runtime train_samples_per_second train_steps_per_second total_flos train_loss
19 TinyStories-3M 0.00008 0.0 3.0 0.0 0.5798 620476.375 0.653758 0.732843 0.3241 1259.023 6.172 8.6789 563.784 4.494 6.090918e+12 0.757333
learning_rates[4] == 0.00008
True

This model achieved a 67% test set accuracy.

test_df, acc = get_test_df(trainers[4])
acc
/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
0.6666666666666666

This 3M parameter finetuned model predicts neutral sentences the best (116/125) followed by positive sentences (25/54) and lastly, negative sentences (22/46). I’ll reiterate that the dataset contains a majority of neutral sentences, followed by positive sentences and the least represented sentiment is negative.

make_cm(test_df)

As the learning rate increases (starting at 1e-6) the validation set accuracy increases until it reaches a peak at a learning rate of 8e-5. It reaches a bit of a second peak at 1e-3.

Show plotting code
final_epoch_metrics = metrics_df.query("epoch == 3")
plt.scatter(final_epoch_metrics['initial_learning_rate'], final_epoch_metrics['eval_accuracy']);
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Validation Set Accuracy')
plt.title('Learning Rate vs. Final Epoch Validation Accuracy');

I’ll test the model (run a “sanity check”) on three made-up sentences. I don’t want to weigh these results too much as they are cherry-picked sentences, but this model gets 2/3 right.

text = "The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"

_ = get_prediction(trainers[4].model, text, tokz)
Probability: 0.55
Predicted label: positive
text = "The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"

_ = get_prediction(trainers[4].model, text, tokz)
Probability: 0.53
Predicted label: positive
text = "The net sales stayed the as the same quarter last year"

_ = get_prediction(trainers[4].model, text, tokz)
Probability: 0.74
Predicted label: neutral

Highest Test Set Accuracy

test_dfs = []
accs = []
for t in trainers:
    test_df, acc = get_test_df(t)
    test_dfs.append(test_df)
    accs.append(acc)

8e-5 is also the learning rate with the highest test set accuracy (67%).

accs
[0.52,
 0.6177777777777778,
 0.6355555555555555,
 0.6488888888888888,
 0.6666666666666666,
 0.6266666666666667,
 0.6311111111111111,
 0.6133333333333333,
 0.6133333333333333,
 0.6577777777777778,
 0.5555555555555556,
 0.5555555555555556]

Training with the Best Learning Rates 10 Times

I’ll train 10 models for the best-performing learning rate (8e-5) to see if the results are consistent.

LR = 8e-5 (Highest Validation and Test Set Accuracy)

learning_rates[4]
8e-05

To prevent (all but the first) models from getting the same loss and accuracy per epoch, I’ll reset the random seed each iteration.

Show set_seed function
import random
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
Show training loop
best_metrics = []
best_trainers = []
lr = learning_rates[4]

for i in range(10):
    set_seed(42 + i)  # Use a different seed for each run
    metric_callback = MetricCallback()
    trainer, args = get_trainer(lr=lr, bs=64)
    trainer.train()

    best_metrics.append({
        "learning_rate": lr,
        "metrics": metric_callback.metrics
        })
    
    best_trainers.append(trainer) 
    
    # clean up
    report_gpu()
    !rm -r /kaggle/working/outputs
best_metrics_df = results_to_dataframe(best_metrics, model_name="TinyStories-3M")
best_metrics_df = best_metrics_df.query('current_learning_rate.notna()')
best_metrics_df.head(3)
model_name initial_learning_rate current_learning_rate epoch learning_rate loss grad_norm eval_loss eval_accuracy eval_runtime eval_samples_per_second eval_steps_per_second train_runtime train_samples_per_second train_steps_per_second total_flos train_loss
1 TinyStories-3M 0.00008 0.000068 1.0 0.000068 0.9378 679908.0000 0.809436 0.656863 0.3223 1265.836 6.205 NaN NaN NaN NaN NaN
2 TinyStories-3M 0.00008 0.000024 2.0 0.000024 0.7057 660703.8750 0.647278 0.715686 0.3267 1248.882 6.122 NaN NaN NaN NaN NaN
3 TinyStories-3M 0.00008 0.000000 3.0 0.000000 0.5673 626431.0625 0.636278 0.713235 0.3271 1247.404 6.115 8.6262 567.227 4.521 6.090918e+12 0.736961

The minimum accuracy achieved by this learning rate was about 71% and the maximum accuracy was 78%, which is a considerably wide range.

final_accs = best_metrics_df.query("epoch == 3")['eval_accuracy']
final_accs.describe()
count    10.000000
mean      0.743627
std       0.018421
min       0.713235
25%       0.737132
50%       0.745098
75%       0.747549
max       0.784314
Name: eval_accuracy, dtype: float64
final_accs.value_counts()
eval_accuracy
0.747549    2
0.745098    2
0.742647    1
0.713235    1
0.725490    1
0.784314    1
0.735294    1
0.750000    1
Name: count, dtype: int64
test_dfs = []
accs = []
for t in best_trainers:
    test_df, acc = get_test_df(t)
    test_dfs.append(test_df)
    accs.append(acc)

The minimum test set accuracy is 64% and the maximum is about 74% for this learning rate (8e-5).

accs = pd.Series(accs)
accs.value_counts()
0.684444    2
0.720000    2
0.693333    1
0.640000    1
0.715556    1
0.671111    1
0.737778    1
0.702222    1
Name: count, dtype: int64
accs.describe()
count    10.000000
mean      0.696889
std       0.028558
min       0.640000
25%       0.684444
50%       0.697778
75%       0.718889
max       0.737778
dtype: float64
accs
0    0.640000
1    0.693333
2    0.715556
3    0.684444
4    0.671111
5    0.737778
6    0.720000
7    0.720000
8    0.702222
9    0.684444
dtype: float64

The model with the best test set accuracy (74%) also gets 2/3 of my sanity check sentiments correct.

text = "The net sales went up from USD $3.4M to USD $5.6M since the same quarter last year"

_ = get_prediction(best_trainers[5].model, text, tokz)
Probability: 0.74
Predicted label: positive
text = "The net sales went down from USD $8.9M to USD $1.2M since the same quarter last year"

_ = get_prediction(best_trainers[5].model, text, tokz)
Probability: 0.73
Predicted label: positive
text = "The net sales stayed the as the same quarter last year"

_ = get_prediction(best_trainers[5].model, text, tokz)
Probability: 0.47
Predicted label: neutral

Final Thoughts

I’ll summarize my results so far for the 33M, 8M and now 3M TinyStories fine-tuned models:

Arch Fine-tuning Learning Rate Best Val Acc Best Test Acc
TinyStories-33M 5e-04 86% 79%
TinyStories-8M 8e-05 85% 86%
TinyStories-8M 5e-04 79% 86%
TinyStories-3M 8e-05 78% 74%

I’ll do the final TinyStories architecture, 1M parameters, in the next notebook in this series.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.