Using HuggingFace Transformers for Tabular Titanic Data

deep learning
python
In this notebook I run a fun experiment to see how well a NLP model can predict tabular data.
Author

Vishal Bakshi

Published

August 23, 2023

Background

In this blog post, I’ll run a fun little experiment which uses the code Jeremy Howard wrote in Getting started with NLP for absolute beginners to train an NLP classifier to predict whether or not a passenger on the titanic survived.

I’ll start by acknowledging the obvious—that training an NLP model for tabular data that doesn’t contain much natural language is probably not going to give great results. However, it gives me an opportunity to use a simple dataset (that I’ve worked with before and am familiar with) to train a model following a process that is new to me (using the HuggingFace library). With that disclaimer out of the way, let’s jump in!

Plan of Attack

Jeremy’s example uses tabular data with columns containing natural language and some additional data to predict values between 0 and 1 (0 means the two phrases are not similar in meaning, 1 means they are similar). Fundamentally, my dataset works in the same way—I have a bunch of columns describing features of the passengers and then a value of 0 (died) or 1 (survived) that I’m trying to predict.

Preparing the Data

The data preparation step will be similar—I will concatenate multiple columns with a separator between each term.

Training Process

I’ll use the same model (and thus tokenizer) as Jeremy did, so the training setup will be much of the same.

Metrics

Jeremy used Pearson’s correlation coefficient (as specified by the Kaggle competition the dataset came from). In my case, I’ll need to figure out how to pass accuracy to the HuggingFace Trainer.

Load and Prep the Data

I’ll start by using the boilerplate code Jeremy has provided to get data from Kaggle.

from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)
import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
if iskaggle: path = Path("../input/titanic")
else:
  path = Path('titanic')
  if not path.exists():
    import zipfile, kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)
Downloading titanic.zip to /content
100%|██████████| 34.1k/34.1k [00:00<00:00, 1.45MB/s]
# load the training data and look at it
import torch, numpy as np, pandas as pd

df = pd.read_csv(path/'train.csv')
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

The only data cleaning I’ll do is fill missing values with the mode of each column:

modes = df.mode().iloc[0]
df.fillna(modes, inplace=True)
df.isna().sum()
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

I’ll also set my independent variable as a float to resolve an error I got during training ("mse_cuda" not implemented for 'Long').

df['Survived'] = df['Survived'].astype(float)

I’ll next create an input column which creates the input to the model:

df['input'] = 'Pclass: ' + df.Pclass.apply(str) +\
 '; Name: ' + df.Name + '; Sex: ' + df.Sex + '; Age: ' + df.Age.apply(str) +\
  '; SibSp: ' + df.SibSp.apply(str) + '; Parch: ' + df.Parch.apply(str) +\
  '; Ticket: ' + df.Ticket + '; Fare: ' + df.Fare.apply(str) + \
  '; Cabin: ' + df.Cabin + '; Embarked: ' + df.Embarked
df['input'][0]
'Pclass: 3; Name: Braund, Mr. Owen Harris; Sex: male; Age: 22.0; SibSp: 1; Parch: 0; Ticket: A/5 21171; Fare: 7.25; Cabin: B96 B98; Embarked: S'

Tokenization

! pip install datasets transformers[sentencepiece] accelerate -U
from datasets import Dataset,DatasetDict

I’ll remove 100 rows of data to serve as a test set for final predictions after the model is trained.

# create a random sample of 100 passengers
eval_df = df.sample(100)
eval_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked input
709 710 1.0 3 Moubarek, Master. Halim Gonios ("William George") male 24.0 1 1 2661 15.2458 B96 B98 C Pclass: 3; Name: Moubarek, Master. Halim Gonio...
439 440 0.0 2 Kvillner, Mr. Johan Henrik Johannesson male 31.0 0 0 C.A. 18723 10.5000 B96 B98 S Pclass: 2; Name: Kvillner, Mr. Johan Henrik Jo...
840 841 0.0 3 Alhomaki, Mr. Ilmari Rudolf male 20.0 0 0 SOTON/O2 3101287 7.9250 B96 B98 S Pclass: 3; Name: Alhomaki, Mr. Ilmari Rudolf; ...
720 721 1.0 2 Harper, Miss. Annie Jessie "Nina" female 6.0 0 1 248727 33.0000 B96 B98 S Pclass: 2; Name: Harper, Miss. Annie Jessie "N...
39 40 1.0 3 Nicola-Yarred, Miss. Jamila female 14.0 1 0 2651 11.2417 B96 B98 C Pclass: 3; Name: Nicola-Yarred, Miss. Jamila; ...

I’ll remove these 100 rows from the original DataFrame which I will use for training and validation sets.

df = df.drop(eval_df.index)
df.shape
(791, 13)
ds = Dataset.from_pandas(df)
eval_ds = Dataset.from_pandas(eval_df)
ds
Dataset({
    features: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__'],
    num_rows: 791
})
eval_ds
Dataset({
    features: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__'],
    num_rows: 100
})

I’ll use the same model as in Jeremy’s example:

model_nm = 'microsoft/deberta-v3-small'
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

I’ll check the tokenizer:

tokz.tokenize("We are about to tokenize this dataset!")
['▁We', '▁are', '▁about', '▁to', '▁token', 'ize', '▁this', '▁dataset', '!']
# function to tokenize inputs
def tok_func(x): return tokz(x["input"])
tok_ds = ds.map(tok_func, batched=True)
eval_ds = eval_ds.map(tok_func, batched=True)
row = tok_ds[0]
row['input'], row['input_ids']
('Pclass: 3; Name: Braund, Mr. Owen Harris; Sex: male; Age: 22.0; SibSp: 1; Parch: 0; Ticket: A/5 21171; Fare: 7.25; Cabin: B96 B98; Embarked: S',
 [1,
  916,
  4478,
  294,
  404,
  346,
  5445,
  294,
  24448,
  407,
  261,
  945,
  260,
  12980,
  6452,
  346,
  23165,
  294,
  2844,
  346,
  5166,
  294,
  1460,
  260,
  693,
  346,
  42209,
  32154,
  294,
  376,
  346,
  916,
  22702,
  294,
  767,
  346,
  14169,
  294,
  336,
  320,
  524,
  1259,
  30877,
  346,
  40557,
  294,
  574,
  260,
  1883,
  346,
  22936,
  294,
  736,
  8971,
  736,
  8454,
  346,
  77030,
  569,
  294,
  662,
  2])

I’ll look at the index for some of the words in the input to check that they are present in the input_ids column:

tokz.vocab['▁P']
916
tokz.vocab['▁3']
404
tokz.vocab['▁Name']
5445

Transformers expects the independent variable to be named labels:

tok_ds = tok_ds.rename_columns({'Survived':'labels'})
tok_ds
Dataset({
    features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 791
})

Preparing Training and Validation Sets

Since I cut into my training and validation set by pulling out a test set, I’ll use a smaller split for the validation set.

dds = tok_ds.train_test_split(0.15, seed=42)
dds
DatasetDict({
    train: Dataset({
        features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 672
    })
    test: Dataset({
        features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 119
    })
})

Creating an Accuracy Function

Since my independent variable is binary (0 or 1), I’ll create an accuracy function with the following:

  • If predictions are greater than 0.5, classify them as 1, if less than 0.5, classify they as 0.
  • Compare predictions to the labels and take the mean value of the boolean array which will be the % of correctly predicted values.
def calculate_accuracy(preds, labels):
  return torch.tensor(((preds>0.5)==labels)).float().mean().item()

# Transformers want a dictionary for the metric
def acc_d(eval_pred): return {'accuracy': calculate_accuracy(*eval_pred) }

Training the Model

I’ll use the same code as is shown in Jeremy’s notebook for preparing the Trainer:

from transformers import TrainingArguments,Trainer
bs = 128
epochs = 4

I’ll use the same learning rate as the example to start with:

lr = 8e-5
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=acc_d)
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
trainer.train();
[24/24 00:08, Epoch 4/4]
Epoch Training Loss Validation Loss Accuracy
1 No log 0.253301 0.462185
2 No log 0.246423 0.537815
3 No log 0.223734 0.537815
4 No log 0.216874 0.747899

I trained the model a few times and noticed that the accuracy varied significantly. For some trainings, it was stuck at around 0.56, for others, it went from 0.4 to 0.5 to 0.6. In this final training, it jumped from 0.54 to 0.75 in the final epoch. I think this means that the combination of data and hyperparameters is causing an unstable training regime for this model.

Let’s look at some of the predictions on the test set:

preds = trainer.predict(eval_ds).predictions.astype(float)
preds[:10], preds.shape
(array([[0.53808594],
        [0.16943359],
        [0.14575195],
        [0.52392578],
        [0.50390625],
        [0.52539062],
        [0.52734375],
        [0.14221191],
        [0.52001953],
        [0.53320312]]),
 (100, 1))

I’ll calculate the accuracy for the test set:

torch.tensor((preds.squeeze(1)>0.5) == eval_df['Survived'].values).float().mean().item()
0.8100000023841858

Not bad! I get an 81% accuracy on my test set. The linear, neural net, deep neural net and fastai tabular_learner model achieved an accuracy of about 83% on their validation sets.

Final Thoughts

Overall I found this exercise enjoyable. I learned a little bit more about using HuggingFace Transfomers, and better understand what Jeremy did in his example notebook. I am not confident in this model or approach as I did notice the training was unstable (highly varying accuracy across different trainings), and this dataset is not really meant for an NLP model. I also had a relatively smaller number of rows than the example that Jeremy showed. That being said, my model wasn’t a complete dud as it mostly accurately predicted who survived in my test set.