Using HuggingFace Transformers for Tabular Titanic Data

deep learning

python

In this notebook I run a fun experiment to see how well a NLP model can predict tabular data.

Author

Vishal Bakshi

Published

August 23, 2023

Background

In this blog post, I’ll run a fun little experiment which uses the code Jeremy Howard wrote in Getting started with NLP for absolute beginners to train an NLP classifier to predict whether or not a passenger on the titanic survived.

I’ll start by acknowledging the obvious—that training an NLP model for tabular data that doesn’t contain much natural language is probably not going to give great results. However, it gives me an opportunity to use a simple dataset (that I’ve worked with before and am familiar with) to train a model following a process that is new to me (using the HuggingFace library). With that disclaimer out of the way, let’s jump in!

Plan of Attack

Jeremy’s example uses tabular data with columns containing natural language and some additional data to predict values between 0 and 1 (0 means the two phrases are not similar in meaning, 1 means they are similar). Fundamentally, my dataset works in the same way—I have a bunch of columns describing features of the passengers and then a value of 0 (died) or 1 (survived) that I’m trying to predict.

Preparing the Data

The data preparation step will be similar—I will concatenate multiple columns with a separator between each term.

Training Process

I’ll use the same model (and thus tokenizer) as Jeremy did, so the training setup will be much of the same.

Metrics

Jeremy used Pearson’s correlation coefficient (as specified by the Kaggle competition the dataset came from). In my case, I’ll need to figure out how to pass accuracy to the HuggingFace Trainer.

Load and Prep the Data

I’ll start by using the boilerplate code Jeremy has provided to get data from Kaggle.

from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
if iskaggle: path = Path("../input/titanic")
else:
  path = Path('titanic')
  if not path.exists():
    import zipfile, kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Downloading titanic.zip to /content

100%|██████████| 34.1k/34.1k [00:00<00:00, 1.45MB/s]

# load the training data and look at it
import torch, numpy as np, pandas as pd

df = pd.read_csv(path/'train.csv')
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

The only data cleaning I’ll do is fill missing values with the mode of each column:

modes = df.mode().iloc[0]
df.fillna(modes, inplace=True)
df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

I’ll also set my independent variable as a float to resolve an error I got during training ("mse_cuda" not implemented for 'Long').

df['Survived'] = df['Survived'].astype(float)

I’ll next create an input column which creates the input to the model:

df['input'] = 'Pclass: ' + df.Pclass.apply(str) +\
 '; Name: ' + df.Name + '; Sex: ' + df.Sex + '; Age: ' + df.Age.apply(str) +\
  '; SibSp: ' + df.SibSp.apply(str) + '; Parch: ' + df.Parch.apply(str) +\
  '; Ticket: ' + df.Ticket + '; Fare: ' + df.Fare.apply(str) + \
  '; Cabin: ' + df.Cabin + '; Embarked: ' + df.Embarked

df['input'][0]

'Pclass: 3; Name: Braund, Mr. Owen Harris; Sex: male; Age: 22.0; SibSp: 1; Parch: 0; Ticket: A/5 21171; Fare: 7.25; Cabin: B96 B98; Embarked: S'

Tokenization

! pip install datasets transformers[sentencepiece] accelerate -U

from datasets import Dataset,DatasetDict

I’ll remove 100 rows of data to serve as a test set for final predictions after the model is trained.

# create a random sample of 100 passengers
eval_df = df.sample(100)

eval_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	input
709	710	1.0	3	Moubarek, Master. Halim Gonios ("William George")	male	24.0	1	1	2661	15.2458	B96 B98	C	Pclass: 3; Name: Moubarek, Master. Halim Gonio...
439	440	0.0	2	Kvillner, Mr. Johan Henrik Johannesson	male	31.0	0	0	C.A. 18723	10.5000	B96 B98	S	Pclass: 2; Name: Kvillner, Mr. Johan Henrik Jo...
840	841	0.0	3	Alhomaki, Mr. Ilmari Rudolf	male	20.0	0	0	SOTON/O2 3101287	7.9250	B96 B98	S	Pclass: 3; Name: Alhomaki, Mr. Ilmari Rudolf; ...
720	721	1.0	2	Harper, Miss. Annie Jessie "Nina"	female	6.0	0	1	248727	33.0000	B96 B98	S	Pclass: 2; Name: Harper, Miss. Annie Jessie "N...
39	40	1.0	3	Nicola-Yarred, Miss. Jamila	female	14.0	1	0	2651	11.2417	B96 B98	C	Pclass: 3; Name: Nicola-Yarred, Miss. Jamila; ...

I’ll remove these 100 rows from the original DataFrame which I will use for training and validation sets.

df = df.drop(eval_df.index)

df.shape

(791, 13)

ds = Dataset.from_pandas(df)

eval_ds = Dataset.from_pandas(eval_df)

ds

Dataset({
    features: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__'],
    num_rows: 791
})

eval_ds

Dataset({
    features: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__'],
    num_rows: 100
})

I’ll use the same model as in Jeremy’s example:

model_nm = 'microsoft/deberta-v3-small'

from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

I’ll check the tokenizer:

tokz.tokenize("We are about to tokenize this dataset!")

['▁We', '▁are', '▁about', '▁to', '▁token', 'ize', '▁this', '▁dataset', '!']

# function to tokenize inputs
def tok_func(x): return tokz(x["input"])

tok_ds = ds.map(tok_func, batched=True)

eval_ds = eval_ds.map(tok_func, batched=True)

row = tok_ds[0]
row['input'], row['input_ids']

('Pclass: 3; Name: Braund, Mr. Owen Harris; Sex: male; Age: 22.0; SibSp: 1; Parch: 0; Ticket: A/5 21171; Fare: 7.25; Cabin: B96 B98; Embarked: S',
 [1,
  916,
  4478,
  294,
  404,
  346,
  5445,
  294,
  24448,
  407,
  261,
  945,
  260,
  12980,
  6452,
  346,
  23165,
  294,
  2844,
  346,
  5166,
  294,
  1460,
  260,
  693,
  346,
  42209,
  32154,
  294,
  376,
  346,
  916,
  22702,
  294,
  767,
  346,
  14169,
  294,
  336,
  320,
  524,
  1259,
  30877,
  346,
  40557,
  294,
  574,
  260,
  1883,
  346,
  22936,
  294,
  736,
  8971,
  736,
  8454,
  346,
  77030,
  569,
  294,
  662,
  2])

I’ll look at the index for some of the words in the input to check that they are present in the input_ids column:

tokz.vocab['▁P']

tokz.vocab['▁3']

tokz.vocab['▁Name']

Transformers expects the independent variable to be named labels:

tok_ds = tok_ds.rename_columns({'Survived':'labels'})

tok_ds

Dataset({
    features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 791
})

Preparing Training and Validation Sets

Since I cut into my training and validation set by pulling out a test set, I’ll use a smaller split for the validation set.

dds = tok_ds.train_test_split(0.15, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 672
    })
    test: Dataset({
        features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 119
    })
})

Creating an Accuracy Function

Since my independent variable is binary (0 or 1), I’ll create an accuracy function with the following:

If predictions are greater than 0.5, classify them as 1, if less than 0.5, classify they as 0.
Compare predictions to the labels and take the mean value of the boolean array which will be the % of correctly predicted values.

def calculate_accuracy(preds, labels):
  return torch.tensor(((preds>0.5)==labels)).float().mean().item()

# Transformers want a dictionary for the metric
def acc_d(eval_pred): return {'accuracy': calculate_accuracy(*eval_pred) }

Training the Model

I’ll use the same code as is shown in Jeremy’s notebook for preparing the Trainer:

from transformers import TrainingArguments,Trainer

bs = 128
epochs = 4

I’ll use the same learning rate as the example to start with:

lr = 8e-5

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=acc_d)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

trainer.train();

[24/24 00:08, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.253301	0.462185
2	No log	0.246423	0.537815
3	No log	0.223734	0.537815
4	No log	0.216874	0.747899

I trained the model a few times and noticed that the accuracy varied significantly. For some trainings, it was stuck at around 0.56, for others, it went from 0.4 to 0.5 to 0.6. In this final training, it jumped from 0.54 to 0.75 in the final epoch. I think this means that the combination of data and hyperparameters is causing an unstable training regime for this model.

Let’s look at some of the predictions on the test set:

preds = trainer.predict(eval_ds).predictions.astype(float)
preds[:10], preds.shape

(array([[0.53808594],
        [0.16943359],
        [0.14575195],
        [0.52392578],
        [0.50390625],
        [0.52539062],
        [0.52734375],
        [0.14221191],
        [0.52001953],
        [0.53320312]]),
 (100, 1))

I’ll calculate the accuracy for the test set:

torch.tensor((preds.squeeze(1)>0.5) == eval_df['Survived'].values).float().mean().item()

0.8100000023841858

Not bad! I get an 81% accuracy on my test set. The linear, neural net, deep neural net and fastai tabular_learner model achieved an accuracy of about 83% on their validation sets.

Final Thoughts

Overall I found this exercise enjoyable. I learned a little bit more about using HuggingFace Transfomers, and better understand what Jeremy did in his example notebook. I am not confident in this model or approach as I did notice the training was unstable (highly varying accuracy across different trainings), and this dataset is not really meant for an NLP model. I also had a relatively smaller number of rows than the example that Jeremy showed. That being said, my model wasn’t a complete dud as it mostly accurately predicted who survived in my test set.