from fastai.tabular.all import *
Training a Multi-Target Regression Deep Learning Model with fastai
Background
In this blog post I will use fastai to train a model that predicts more than one target for the Kaggle Titanic dataset.
I’ve referenced the notebook Multi-target: Road to the Top, Part 4 by Jeremy Howard as well as a derivative notebook Small models + Multi-targets by Kaggle user Archie Tram (in which he creates a test DataLoader
to get predictions from the model).
Plan of Attack
Creating DataLoaders
In Jeremy’s notebook, he is classifying images of plants with two targets: disease and variety of plant.
He creates his DataLoaders
object as follows:
= DataBlock(
dls =(ImageBlock,CategoryBlock,CategoryBlock),
blocks=1,
n_inp=get_image_files,
get_items= [parent_label,get_variety],
get_y =RandomSplitter(0.2, seed=42),
splitter=Resize(192, method='squish'),
item_tfms=aug_transforms(size=128, min_scale=0.75)
batch_tfms ).dataloaders(trn_path)
There are three blocks
: 1 input ImageBlock
and 2 output CategoryBlock
s. The model gets the outputs with parent_label
(for the disease) and a custom function get_variety
(which grabs the variety
column value of the given image from a DataFrame
).
In my use case, I will have to follow a similar approach, albeit catered to tabular data.
Calculating Losses
Jeremy calculates loss as the sum of the following:
- Cross-Entropy loss of the disease inputs
- Cross-Entropy loss of the variety inputs
I’ll follow a similar approach, except if I use continuous variables as targets I’ll use MSE instead of Cross-Entropy.
Calculating Metrics
Similar to the loss calculation, I’ll combine the calculation of the metric for each of the two targets. For continuous variables, I’ll use RMSE.
Training a Multi-Target Model
With a rough plan outlined, I’ll start the training process with loading and cleaning the Titanic dataset.
Load and Clean Data
from pathlib import Path
= Path('~/.kaggle/kaggle.json').expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
import os
= os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
iskaggle if iskaggle: path = Path("../input/titanic")
else:
= Path('titanic')
path if not path.exists():
import zipfile, kaggle
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading titanic.zip to /content
100%|██████████| 34.1k/34.1k [00:00<00:00, 6.18MB/s]
# load the training data and look at it
= pd.read_csv(path/'train.csv')
df df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
# feature engineering
def add_features(df):
'LogFare'] = np.log1p(df['Fare'])
df['Deck'] = df.Cabin.str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG"))
df['Family'] = df.SibSp+df.Parch
df['Alone'] = df.Family == 0
df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')
df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df.Title.map(dict(Mr="Mr", Miss="Miss", Mrs="Mrs", Master="Master")) df[
# add the features to our dataframe
add_features(df)
# view the topmost row of the modes DataFrame
= df.mode().iloc[0]
modes modes
PassengerId 1
Survived 0.0
Pclass 3.0
Name Abbing, Mr. Anthony
Sex male
Age 24.0
SibSp 0.0
Parch 0.0
Ticket 1601
Fare 8.05
Cabin B96 B98
Embarked S
LogFare 2.202765
Deck ABC
Family 0.0
Alone True
TicketFreq 1.0
Title Mr
Name: 0, dtype: object
# fill missing data with the column's mode
=True) df.fillna(modes, inplace
# check that we no longer have missing data
sum() df.isna().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
LogFare 0
Deck 0
Family 0
Alone 0
TicketFreq 0
Title 0
dtype: int64
# create training and validation index lists
= RandomSplitter(seed=42)(df) splits
Create DataLoaders
I’ll take most of the code from the Why you should use a framework notebook by Jeremy, with the following changes:
- Remove
"Age"
fromcont_names
and move it toy_names
along with"Survived"
which will be our two targets. - Set
n_out=2
for theRegressionBlock
.
I’ll treat both targets as a regression, as I wasn’t able to provide two DataBlock
s for y_block
.
Since I’ve filled in missing values manually, I have removed the FillMissing
item from procs
.
# create dataloaders object
= TabularPandas(
dls
df,=splits,
splits=[Categorify, Normalize],
procs=["Sex", "Pclass", "Embarked", "Deck", "Title"],
cat_names=["SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
cont_names=["Age", "Survived"],
y_names=RegressionBlock(n_out=2)
y_block=".") ).dataloaders(path
dls.show_batch()
Sex | Pclass | Embarked | Deck | Title | SibSp | Parch | LogFare | Alone | TicketFreq | Family | Age | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 1 | S | ABC | Mr | 1.000000e+00 | -9.897945e-09 | 3.970292 | 2.458140e-08 | 2.0 | 1.000000e+00 | 42.0 | 0.0 |
1 | male | 3 | S | ABC | Mr | 1.689237e-09 | -9.897945e-09 | 2.230014 | 1.000000e+00 | 1.0 | -1.856774e-08 | 18.0 | 0.0 |
2 | male | 2 | S | ABC | Mr | 1.000000e+00 | 2.000000e+00 | 3.358638 | 2.458140e-08 | 3.0 | 3.000000e+00 | 36.0 | 0.0 |
3 | male | 3 | C | ABC | Mr | 1.000000e+00 | 1.000000e+00 | 2.107689 | 2.458140e-08 | 1.0 | 2.000000e+00 | 17.0 | 0.0 |
4 | male | 3 | S | ABC | Mr | 1.689237e-09 | -9.897945e-09 | 2.351375 | 1.000000e+00 | 1.0 | -1.856774e-08 | 28.0 | 0.0 |
5 | female | 3 | S | ABC | Mrs | 1.000000e+00 | 4.000000e+00 | 3.363842 | 2.458140e-08 | 6.0 | 5.000000e+00 | 45.0 | 0.0 |
6 | male | 3 | S | ABC | Mr | 1.689237e-09 | -9.897945e-09 | 2.324836 | 1.000000e+00 | 1.0 | -1.856774e-08 | 23.0 | 0.0 |
7 | female | 3 | C | ABC | Mrs | 1.689237e-09 | -9.897945e-09 | 2.107178 | 1.000000e+00 | 1.0 | -1.856774e-08 | 24.0 | 1.0 |
8 | male | 3 | S | ABC | Mr | 1.689237e-09 | -9.897945e-09 | 2.188856 | 1.000000e+00 | 1.0 | -1.856774e-08 | 39.0 | 1.0 |
9 | female | 2 | C | ABC | Mrs | 1.000000e+00 | -9.897945e-09 | 3.436269 | 2.458140e-08 | 2.0 | 1.000000e+00 | 14.0 | 1.0 |
Create Loss Function
If I understand correctly, we will get 2 columns of predictions, and two variables of targets to compute the loss with:
def age_loss(pred, yb): return F.mse_loss(pred[:,0], yb[:,0])
def survived_loss(pred, yb): return F.mse_loss(pred[:,1], yb[:,1])
def combine_loss(pred, yb): return age_loss(pred, yb) + survived_loss(pred, yb)
Create Metric Function
I’ll create an RMSE function for each target variable:
def age_rmse(pred, yb): return torch.sqrt(F.mse_loss(pred[:,0], yb[:,0]))
def survived_rmse(pred, yb): return torch.sqrt(F.mse_loss(pred[:,1], yb[:,1]))
= (age_rmse, survived_rmse) rmse_metrics
= tabular_learner(dls, loss_func=combine_loss, metrics=rmse_metrics, layers=[10,10], n_out=2) learn
Most times that I ran the learning rate finder, the loss was steadily increasing from the get-go. I randomly came across the following learning rate regime which looks more stable, so I’ll use the given value.
=(slide, valley)) learn.lr_find(suggest_funcs
SuggestedLRs(slide=6.309573450380412e-07, valley=0.14454397559165955)
20, lr=0.1) learn.fit(
epoch | train_loss | valid_loss | age_rmse | survived_rmse | time |
---|---|---|---|---|---|
0 | 766.320923 | 554.657837 | 23.431234 | 0.721841 | 00:00 |
1 | 460.486603 | 170.207932 | 13.030014 | 0.590790 | 00:00 |
2 | 335.931213 | 132.264999 | 11.456180 | 0.649899 | 00:00 |
3 | 265.317535 | 116.719322 | 10.778342 | 0.477045 | 00:00 |
4 | 221.392242 | 121.840828 | 11.004195 | 0.441827 | 00:00 |
5 | 192.420349 | 132.113815 | 11.457218 | 0.472019 | 00:00 |
6 | 173.592255 | 120.654694 | 10.943729 | 0.462033 | 00:00 |
7 | 159.223709 | 113.375626 | 10.612040 | 0.519316 | 00:00 |
8 | 148.853653 | 114.346222 | 10.654099 | 0.484549 | 00:00 |
9 | 140.409439 | 109.572639 | 10.437387 | 0.467927 | 00:00 |
10 | 133.942352 | 114.497719 | 10.642965 | 0.590436 | 00:00 |
11 | 129.807709 | 110.892578 | 10.500125 | 0.455730 | 00:00 |
12 | 125.972458 | 112.508110 | 10.570338 | 0.451019 | 00:00 |
13 | 122.350586 | 126.790512 | 11.167099 | 0.512433 | 00:00 |
14 | 119.345764 | 112.307846 | 10.571351 | 0.579465 | 00:00 |
15 | 117.329689 | 113.805359 | 10.628425 | 0.484336 | 00:00 |
16 | 116.328194 | 115.227859 | 10.696632 | 0.475317 | 00:00 |
17 | 115.390640 | 115.162354 | 10.710686 | 0.500142 | 00:00 |
18 | 116.044281 | 125.941689 | 11.149549 | 0.558260 | 00:00 |
19 | 115.501900 | 116.436340 | 10.739085 | 0.500779 | 00:00 |
After a few epochs, the RMSE values stop improving. The validation loss also fluctuates throughout the training after decreasing for the first three epochs.
Comparing Predictions to Actuals
Based on how the training went, I’m not expecting this model to be able to predict Age
and Survived
very well. I’ll use the validation set to get predictions and then calculate accuracy for Survived
and correlation between actuals vs. predictions for Age
.
= learn.get_preds(dl=dls.valid) preds, targ
# Survived accuracy
1] == (preds[:,1]>0.5)).float().mean() (targ[:,
tensor(0.6348)
def corr(x,y): return np.corrcoef(x,y)[0][1]
# Age plot
= plt.subplots(1)
fig, ax
'equal')
ax.axis(f'Predicted Age vs Actual; r: {corr(preds[:,0], targ[:,0]):.2f}')
plt.title(0], targ[:,0]); ax.scatter(preds[:,
The model achieved shoddy accuracy (63%) and an uninspiring correlation between predicted and actual age. The model did particularly poorly in predicting ages above 40.
Comparing to Single-Target Models
I’m curious to see how the model performs when I train it for single targets. I’ll train one regression model for Age
, another separate regression model for Survived
, and see how their results compare to the combined two-target model.
Single Target: Age
# create dataloaders object
= TabularPandas(
age_dls
df,=splits,
splits=[Categorify, Normalize],
procs=["Sex", "Pclass", "Embarked", "Deck", "Title"],
cat_names=["SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
cont_names="Age",
y_names=RegressionBlock()
y_block=".") ).dataloaders(path
= tabular_learner(age_dls, metrics=rmse, layers=[10,10]) age_learn
I ran the learning rate finder 10 times and got similar charts each time, which tells me that something about this model is more stable than my two-target model.
=(slide, valley)) age_learn.lr_find(suggest_funcs
SuggestedLRs(slide=6.309573450380412e-07, valley=0.0831763744354248)
16, lr=0.1) age_learn.fit(
epoch | train_loss | valid_loss | _rmse | time |
---|---|---|---|---|
0 | 781.124268 | 233.326263 | 15.275021 | 00:00 |
1 | 454.851532 | 408.981842 | 20.223301 | 00:00 |
2 | 328.806274 | 116.149773 | 10.777281 | 00:00 |
3 | 263.302643 | 119.088097 | 10.912749 | 00:00 |
4 | 219.239166 | 127.125175 | 11.274981 | 00:00 |
5 | 190.565811 | 111.707756 | 10.569189 | 00:00 |
6 | 171.005737 | 113.618858 | 10.659215 | 00:00 |
7 | 157.105713 | 109.284859 | 10.453939 | 00:00 |
8 | 146.396072 | 118.541183 | 10.887661 | 00:00 |
9 | 138.696716 | 107.435219 | 10.365096 | 00:00 |
10 | 132.795654 | 109.071220 | 10.443716 | 00:00 |
11 | 128.642639 | 112.930344 | 10.626869 | 00:00 |
12 | 124.508675 | 107.584816 | 10.372310 | 00:00 |
13 | 121.428909 | 113.099953 | 10.634846 | 00:00 |
14 | 119.856216 | 114.224464 | 10.687585 | 00:00 |
15 | 118.349365 | 109.042511 | 10.442342 | 00:00 |
The validation loss also fluctuates in this model’s training. The RMSE metric also does not really improve after the first couple of epochs. Similar to last time, I’ll plot the predicted age vs actual and calculate the correlation between the two:
= age_learn.get_preds(dl=age_dls.valid) age_preds, age_targ
# Age plot
= plt.subplots(1)
fig, ax
'equal')
ax.axis(f'Predicted Age vs Actual; r: {corr(age_preds[:,0], age_targ[:,0]):.2f}')
plt.title(0], age_targ[:,0]); ax.scatter(age_preds[:,
Surprisingly, the single target Age model does not perform much better than my two-target model. I get a similar correlation, and this model also fails to predict ages above around 40.
Single Target: Survived
In Jeremy’s “Why you should use a framework” notebook, he achieves about an 83% accuracy. I’ll use this as a benchmark to compare my model with.
# create dataloaders object
= TabularPandas(
survived_dls
df,=splits,
splits=[Categorify, Normalize],
procs=["Sex", "Pclass", "Embarked", "Deck", "Title"],
cat_names=["SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
cont_names="Survived",
y_names=RegressionBlock()
y_block=".") ).dataloaders(path
= tabular_learner(survived_dls, metrics=rmse, layers=[10,10]) survived_learn
=(slide, valley)) survived_learn.lr_find(suggest_funcs
SuggestedLRs(slide=0.05754399299621582, valley=0.0063095735386013985)
16, lr=0.02) survived_learn.fit(
epoch | train_loss | valid_loss | _rmse | time |
---|---|---|---|---|
0 | 0.250865 | 0.240620 | 0.490530 | 00:00 |
1 | 0.200100 | 0.214276 | 0.462899 | 00:00 |
2 | 0.177398 | 0.150440 | 0.387866 | 00:00 |
3 | 0.163052 | 0.135140 | 0.367615 | 00:00 |
4 | 0.153141 | 0.131269 | 0.362311 | 00:00 |
5 | 0.147007 | 0.133025 | 0.364726 | 00:00 |
6 | 0.143294 | 0.132439 | 0.363922 | 00:00 |
7 | 0.138928 | 0.131754 | 0.362979 | 00:00 |
8 | 0.135169 | 0.128147 | 0.357976 | 00:00 |
9 | 0.133087 | 0.125253 | 0.353910 | 00:00 |
10 | 0.130366 | 0.126195 | 0.355240 | 00:00 |
11 | 0.128971 | 0.130248 | 0.360899 | 00:00 |
12 | 0.127474 | 0.128108 | 0.357922 | 00:00 |
13 | 0.126128 | 0.124583 | 0.352963 | 00:00 |
14 | 0.125103 | 0.125416 | 0.354142 | 00:00 |
15 | 0.123530 | 0.129710 | 0.360152 | 00:00 |
= survived_learn.get_preds(dl=survived_dls.valid) survived_preds, survived_targ
== (survived_preds>0.5)).float().mean() (survived_targ
tensor(0.8258)
I get an accuracy of around 83% as well.
Final Thoughts
Here are my takeaways from this experiment:
- A single-target regression model predicts
Survived
better thanAge
. - A two-target regression model (
Survived
andAge
) predictsSurvived
significantly worse than a single-target model (Survived
only). Something about introducing an output forAge
decreases the model’s performance when predicting survival rate. - A two-target regression model (
Survived
andAge
) predictsAge
with about the same correlation as a single-target model (Age
only).
Something about this dataset (and how the model learns from it) makes Age
a poor target for prediction. Perhaps it’s the distribution of ages in the dataset, or the relationship with other columns, that makes it harder for the model to predict it accurately.
I’m happy and proud that I was able to run this experiment after failing to overcome some errors the first couple of times I tried to train a two-target model earlier this week.
I hope you enjoyed this blog post!