from pathlib import Path
= Path('~/.kaggle/kaggle.json').expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
Using HuggingFace Transformers for Tabular Titanic Data
Background
In this blog post, I’ll run a fun little experiment which uses the code Jeremy Howard wrote in Getting started with NLP for absolute beginners to train an NLP classifier to predict whether or not a passenger on the titanic survived.
I’ll start by acknowledging the obvious—that training an NLP model for tabular data that doesn’t contain much natural language is probably not going to give great results. However, it gives me an opportunity to use a simple dataset (that I’ve worked with before and am familiar with) to train a model following a process that is new to me (using the HuggingFace library). With that disclaimer out of the way, let’s jump in!
Plan of Attack
Jeremy’s example uses tabular data with columns containing natural language and some additional data to predict values between 0
and 1
(0
means the two phrases are not similar in meaning, 1
means they are similar). Fundamentally, my dataset works in the same way—I have a bunch of columns describing features of the passengers and then a value of 0
(died) or 1
(survived) that I’m trying to predict.
Preparing the Data
The data preparation step will be similar—I will concatenate multiple columns with a separator between each term.
Training Process
I’ll use the same model (and thus tokenizer) as Jeremy did, so the training setup will be much of the same.
Metrics
Jeremy used Pearson’s correlation coefficient (as specified by the Kaggle competition the dataset came from). In my case, I’ll need to figure out how to pass accuracy to the HuggingFace Trainer
.
Load and Prep the Data
I’ll start by using the boilerplate code Jeremy has provided to get data from Kaggle.
import os
= os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
iskaggle if iskaggle: path = Path("../input/titanic")
else:
= Path('titanic')
path if not path.exists():
import zipfile, kaggle
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading titanic.zip to /content
100%|██████████| 34.1k/34.1k [00:00<00:00, 1.45MB/s]
# load the training data and look at it
import torch, numpy as np, pandas as pd
= pd.read_csv(path/'train.csv')
df df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
The only data cleaning I’ll do is fill missing values with the mode of each column:
= df.mode().iloc[0]
modes =True)
df.fillna(modes, inplacesum() df.isna().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64
I’ll also set my independent variable as a float to resolve an error I got during training ("mse_cuda" not implemented for 'Long'
).
'Survived'] = df['Survived'].astype(float) df[
I’ll next create an input
column which creates the input to the model:
'input'] = 'Pclass: ' + df.Pclass.apply(str) +\
df['; Name: ' + df.Name + '; Sex: ' + df.Sex + '; Age: ' + df.Age.apply(str) +\
'; SibSp: ' + df.SibSp.apply(str) + '; Parch: ' + df.Parch.apply(str) +\
'; Ticket: ' + df.Ticket + '; Fare: ' + df.Fare.apply(str) + \
'; Cabin: ' + df.Cabin + '; Embarked: ' + df.Embarked
'input'][0] df[
'Pclass: 3; Name: Braund, Mr. Owen Harris; Sex: male; Age: 22.0; SibSp: 1; Parch: 0; Ticket: A/5 21171; Fare: 7.25; Cabin: B96 B98; Embarked: S'
Tokenization
! pip install datasets transformers[sentencepiece] accelerate -U
from datasets import Dataset,DatasetDict
I’ll remove 100 rows of data to serve as a test set for final predictions after the model is trained.
# create a random sample of 100 passengers
= df.sample(100) eval_df
eval_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | input | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
709 | 710 | 1.0 | 3 | Moubarek, Master. Halim Gonios ("William George") | male | 24.0 | 1 | 1 | 2661 | 15.2458 | B96 B98 | C | Pclass: 3; Name: Moubarek, Master. Halim Gonio... |
439 | 440 | 0.0 | 2 | Kvillner, Mr. Johan Henrik Johannesson | male | 31.0 | 0 | 0 | C.A. 18723 | 10.5000 | B96 B98 | S | Pclass: 2; Name: Kvillner, Mr. Johan Henrik Jo... |
840 | 841 | 0.0 | 3 | Alhomaki, Mr. Ilmari Rudolf | male | 20.0 | 0 | 0 | SOTON/O2 3101287 | 7.9250 | B96 B98 | S | Pclass: 3; Name: Alhomaki, Mr. Ilmari Rudolf; ... |
720 | 721 | 1.0 | 2 | Harper, Miss. Annie Jessie "Nina" | female | 6.0 | 0 | 1 | 248727 | 33.0000 | B96 B98 | S | Pclass: 2; Name: Harper, Miss. Annie Jessie "N... |
39 | 40 | 1.0 | 3 | Nicola-Yarred, Miss. Jamila | female | 14.0 | 1 | 0 | 2651 | 11.2417 | B96 B98 | C | Pclass: 3; Name: Nicola-Yarred, Miss. Jamila; ... |
I’ll remove these 100 rows from the original DataFrame
which I will use for training and validation sets.
= df.drop(eval_df.index) df
df.shape
(791, 13)
= Dataset.from_pandas(df) ds
= Dataset.from_pandas(eval_df) eval_ds
ds
Dataset({
features: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__'],
num_rows: 791
})
eval_ds
Dataset({
features: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__'],
num_rows: 100
})
I’ll use the same model as in Jeremy’s example:
= 'microsoft/deberta-v3-small' model_nm
from transformers import AutoModelForSequenceClassification,AutoTokenizer
= AutoTokenizer.from_pretrained(model_nm) tokz
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I’ll check the tokenizer:
"We are about to tokenize this dataset!") tokz.tokenize(
['▁We', '▁are', '▁about', '▁to', '▁token', 'ize', '▁this', '▁dataset', '!']
# function to tokenize inputs
def tok_func(x): return tokz(x["input"])
= ds.map(tok_func, batched=True) tok_ds
= eval_ds.map(tok_func, batched=True) eval_ds
= tok_ds[0]
row 'input'], row['input_ids'] row[
('Pclass: 3; Name: Braund, Mr. Owen Harris; Sex: male; Age: 22.0; SibSp: 1; Parch: 0; Ticket: A/5 21171; Fare: 7.25; Cabin: B96 B98; Embarked: S',
[1,
916,
4478,
294,
404,
346,
5445,
294,
24448,
407,
261,
945,
260,
12980,
6452,
346,
23165,
294,
2844,
346,
5166,
294,
1460,
260,
693,
346,
42209,
32154,
294,
376,
346,
916,
22702,
294,
767,
346,
14169,
294,
336,
320,
524,
1259,
30877,
346,
40557,
294,
574,
260,
1883,
346,
22936,
294,
736,
8971,
736,
8454,
346,
77030,
569,
294,
662,
2])
I’ll look at the index for some of the words in the input to check that they are present in the input_ids
column:
'▁P'] tokz.vocab[
916
'▁3'] tokz.vocab[
404
'▁Name'] tokz.vocab[
5445
Transformers expects the independent variable to be named labels
:
= tok_ds.rename_columns({'Survived':'labels'}) tok_ds
tok_ds
Dataset({
features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 791
})
Preparing Training and Validation Sets
Since I cut into my training and validation set by pulling out a test set, I’ll use a smaller split for the validation set.
= tok_ds.train_test_split(0.15, seed=42)
dds dds
DatasetDict({
train: Dataset({
features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 672
})
test: Dataset({
features: ['PassengerId', 'labels', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 119
})
})
Creating an Accuracy Function
Since my independent variable is binary (0
or 1
), I’ll create an accuracy function with the following:
- If predictions are greater than 0.5, classify them as
1
, if less than 0.5, classify they as0
. - Compare predictions to the labels and take the mean value of the boolean array which will be the % of correctly predicted values.
def calculate_accuracy(preds, labels):
return torch.tensor(((preds>0.5)==labels)).float().mean().item()
# Transformers want a dictionary for the metric
def acc_d(eval_pred): return {'accuracy': calculate_accuracy(*eval_pred) }
Training the Model
I’ll use the same code as is shown in Jeremy’s notebook for preparing the Trainer
:
from transformers import TrainingArguments,Trainer
= 128
bs = 4 epochs
I’ll use the same learning rate as the example to start with:
= 8e-5 lr
= TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
args ="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
evaluation_strategy=epochs, weight_decay=0.01, report_to='none') num_train_epochs
= AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
model = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
trainer =tokz, compute_metrics=acc_d) tokenizer
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
; trainer.train()
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | No log | 0.253301 | 0.462185 |
2 | No log | 0.246423 | 0.537815 |
3 | No log | 0.223734 | 0.537815 |
4 | No log | 0.216874 | 0.747899 |
I trained the model a few times and noticed that the accuracy varied significantly. For some trainings, it was stuck at around 0.56, for others, it went from 0.4 to 0.5 to 0.6. In this final training, it jumped from 0.54 to 0.75 in the final epoch. I think this means that the combination of data and hyperparameters is causing an unstable training regime for this model.
Let’s look at some of the predictions on the test set:
= trainer.predict(eval_ds).predictions.astype(float)
preds 10], preds.shape preds[:
(array([[0.53808594],
[0.16943359],
[0.14575195],
[0.52392578],
[0.50390625],
[0.52539062],
[0.52734375],
[0.14221191],
[0.52001953],
[0.53320312]]),
(100, 1))
I’ll calculate the accuracy for the test set:
1)>0.5) == eval_df['Survived'].values).float().mean().item() torch.tensor((preds.squeeze(
0.8100000023841858
Not bad! I get an 81% accuracy on my test set. The linear, neural net, deep neural net and fastai tabular_learner
model achieved an accuracy of about 83% on their validation sets.
Final Thoughts
Overall I found this exercise enjoyable. I learned a little bit more about using HuggingFace Transfomers, and better understand what Jeremy did in his example notebook. I am not confident in this model or approach as I did notice the training was unstable (highly varying accuracy across different trainings), and this dataset is not really meant for an NLP model. I also had a relatively smaller number of rows than the example that Jeremy showed. That being said, my model wasn’t a complete dud as it mostly accurately predicted who survived in my test set.