Training a Multi-Target Regression Deep Learning Model with fastai

deep learning
python
In this notebook I train single and multi-target regression tabular deep learning models using fastai, and compare the results.
Author

Vishal Bakshi

Published

August 19, 2023

Background

In this blog post I will use fastai to train a model that predicts more than one target for the Kaggle Titanic dataset.

I’ve referenced the notebook Multi-target: Road to the Top, Part 4 by Jeremy Howard as well as a derivative notebook Small models + Multi-targets by Kaggle user Archie Tram (in which he creates a test DataLoader to get predictions from the model).

Plan of Attack

Creating DataLoaders

In Jeremy’s notebook, he is classifying images of plants with two targets: disease and variety of plant.

He creates his DataLoaders object as follows:

dls = DataBlock(
    blocks=(ImageBlock,CategoryBlock,CategoryBlock),
    n_inp=1,
    get_items=get_image_files,
    get_y = [parent_label,get_variety],
    splitter=RandomSplitter(0.2, seed=42),
    item_tfms=Resize(192, method='squish'),
    batch_tfms=aug_transforms(size=128, min_scale=0.75)
).dataloaders(trn_path)

There are three blocks: 1 input ImageBlock and 2 output CategoryBlocks. The model gets the outputs with parent_label (for the disease) and a custom function get_variety (which grabs the variety column value of the given image from a DataFrame).

In my use case, I will have to follow a similar approach, albeit catered to tabular data.

Calculating Losses

Jeremy calculates loss as the sum of the following:

  • Cross-Entropy loss of the disease inputs
  • Cross-Entropy loss of the variety inputs

I’ll follow a similar approach, except if I use continuous variables as targets I’ll use MSE instead of Cross-Entropy.

Calculating Metrics

Similar to the loss calculation, I’ll combine the calculation of the metric for each of the two targets. For continuous variables, I’ll use RMSE.

Training a Multi-Target Model

With a rough plan outlined, I’ll start the training process with loading and cleaning the Titanic dataset.

Load and Clean Data

from fastai.tabular.all import *
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)
import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
if iskaggle: path = Path("../input/titanic")
else:
  path = Path('titanic')
  if not path.exists():
    import zipfile, kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)
Downloading titanic.zip to /content
100%|██████████| 34.1k/34.1k [00:00<00:00, 6.18MB/s]
# load the training data and look at it
df = pd.read_csv(path/'train.csv')
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

# feature engineering
def add_features(df):
  df['LogFare'] = np.log1p(df['Fare'])
  df['Deck'] = df.Cabin.str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG"))
  df['Family'] = df.SibSp+df.Parch
  df['Alone'] = df.Family == 0
  df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')
  df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
  df['Title'] = df.Title.map(dict(Mr="Mr", Miss="Miss", Mrs="Mrs", Master="Master"))
# add the features to our dataframe
add_features(df)
# view the topmost row of the modes DataFrame
modes = df.mode().iloc[0]
modes
PassengerId                      1
Survived                       0.0
Pclass                         3.0
Name           Abbing, Mr. Anthony
Sex                           male
Age                           24.0
SibSp                          0.0
Parch                          0.0
Ticket                        1601
Fare                          8.05
Cabin                      B96 B98
Embarked                         S
LogFare                   2.202765
Deck                           ABC
Family                         0.0
Alone                         True
TicketFreq                     1.0
Title                           Mr
Name: 0, dtype: object
# fill missing data with the column's mode
df.fillna(modes, inplace=True)
# check that we no longer have missing data
df.isna().sum()
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
LogFare        0
Deck           0
Family         0
Alone          0
TicketFreq     0
Title          0
dtype: int64
# create training and validation index lists
splits = RandomSplitter(seed=42)(df)

Create DataLoaders

I’ll take most of the code from the Why you should use a framework notebook by Jeremy, with the following changes:

  • Remove "Age" from cont_names and move it to y_names along with "Survived" which will be our two targets.
  • Set n_out=2 for the RegressionBlock.

I’ll treat both targets as a regression, as I wasn’t able to provide two DataBlocks for y_block.

Since I’ve filled in missing values manually, I have removed the FillMissing item from procs.

# create dataloaders object
dls = TabularPandas(
    df,
    splits=splits,
    procs=[Categorify, Normalize],
    cat_names=["Sex", "Pclass", "Embarked", "Deck", "Title"],
    cont_names=["SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
    y_names=["Age", "Survived"],
    y_block=RegressionBlock(n_out=2)
).dataloaders(path=".")
dls.show_batch()
Sex Pclass Embarked Deck Title SibSp Parch LogFare Alone TicketFreq Family Age Survived
0 male 1 S ABC Mr 1.000000e+00 -9.897945e-09 3.970292 2.458140e-08 2.0 1.000000e+00 42.0 0.0
1 male 3 S ABC Mr 1.689237e-09 -9.897945e-09 2.230014 1.000000e+00 1.0 -1.856774e-08 18.0 0.0
2 male 2 S ABC Mr 1.000000e+00 2.000000e+00 3.358638 2.458140e-08 3.0 3.000000e+00 36.0 0.0
3 male 3 C ABC Mr 1.000000e+00 1.000000e+00 2.107689 2.458140e-08 1.0 2.000000e+00 17.0 0.0
4 male 3 S ABC Mr 1.689237e-09 -9.897945e-09 2.351375 1.000000e+00 1.0 -1.856774e-08 28.0 0.0
5 female 3 S ABC Mrs 1.000000e+00 4.000000e+00 3.363842 2.458140e-08 6.0 5.000000e+00 45.0 0.0
6 male 3 S ABC Mr 1.689237e-09 -9.897945e-09 2.324836 1.000000e+00 1.0 -1.856774e-08 23.0 0.0
7 female 3 C ABC Mrs 1.689237e-09 -9.897945e-09 2.107178 1.000000e+00 1.0 -1.856774e-08 24.0 1.0
8 male 3 S ABC Mr 1.689237e-09 -9.897945e-09 2.188856 1.000000e+00 1.0 -1.856774e-08 39.0 1.0
9 female 2 C ABC Mrs 1.000000e+00 -9.897945e-09 3.436269 2.458140e-08 2.0 1.000000e+00 14.0 1.0

Create Loss Function

If I understand correctly, we will get 2 columns of predictions, and two variables of targets to compute the loss with:

def age_loss(pred, yb): return F.mse_loss(pred[:,0], yb[:,0])
def survived_loss(pred, yb): return F.mse_loss(pred[:,1], yb[:,1])

def combine_loss(pred, yb): return age_loss(pred, yb) + survived_loss(pred, yb)

Create Metric Function

I’ll create an RMSE function for each target variable:

def age_rmse(pred, yb): return torch.sqrt(F.mse_loss(pred[:,0], yb[:,0]))
def survived_rmse(pred, yb): return torch.sqrt(F.mse_loss(pred[:,1], yb[:,1]))

rmse_metrics = (age_rmse, survived_rmse)
learn = tabular_learner(dls, loss_func=combine_loss, metrics=rmse_metrics, layers=[10,10], n_out=2)

Most times that I ran the learning rate finder, the loss was steadily increasing from the get-go. I randomly came across the following learning rate regime which looks more stable, so I’ll use the given value.

learn.lr_find(suggest_funcs=(slide, valley))
SuggestedLRs(slide=6.309573450380412e-07, valley=0.14454397559165955)

learn.fit(20, lr=0.1)
epoch train_loss valid_loss age_rmse survived_rmse time
0 766.320923 554.657837 23.431234 0.721841 00:00
1 460.486603 170.207932 13.030014 0.590790 00:00
2 335.931213 132.264999 11.456180 0.649899 00:00
3 265.317535 116.719322 10.778342 0.477045 00:00
4 221.392242 121.840828 11.004195 0.441827 00:00
5 192.420349 132.113815 11.457218 0.472019 00:00
6 173.592255 120.654694 10.943729 0.462033 00:00
7 159.223709 113.375626 10.612040 0.519316 00:00
8 148.853653 114.346222 10.654099 0.484549 00:00
9 140.409439 109.572639 10.437387 0.467927 00:00
10 133.942352 114.497719 10.642965 0.590436 00:00
11 129.807709 110.892578 10.500125 0.455730 00:00
12 125.972458 112.508110 10.570338 0.451019 00:00
13 122.350586 126.790512 11.167099 0.512433 00:00
14 119.345764 112.307846 10.571351 0.579465 00:00
15 117.329689 113.805359 10.628425 0.484336 00:00
16 116.328194 115.227859 10.696632 0.475317 00:00
17 115.390640 115.162354 10.710686 0.500142 00:00
18 116.044281 125.941689 11.149549 0.558260 00:00
19 115.501900 116.436340 10.739085 0.500779 00:00

After a few epochs, the RMSE values stop improving. The validation loss also fluctuates throughout the training after decreasing for the first three epochs.

Comparing Predictions to Actuals

Based on how the training went, I’m not expecting this model to be able to predict Age and Survived very well. I’ll use the validation set to get predictions and then calculate accuracy for Survived and correlation between actuals vs. predictions for Age.

preds, targ = learn.get_preds(dl=dls.valid)
# Survived accuracy
(targ[:,1] == (preds[:,1]>0.5)).float().mean()
tensor(0.6348)
def corr(x,y): return np.corrcoef(x,y)[0][1]
# Age plot
fig, ax = plt.subplots(1)

ax.axis('equal')
plt.title(f'Predicted Age vs Actual; r: {corr(preds[:,0], targ[:,0]):.2f}')
ax.scatter(preds[:,0], targ[:,0]);

The model achieved shoddy accuracy (63%) and an uninspiring correlation between predicted and actual age. The model did particularly poorly in predicting ages above 40.

Comparing to Single-Target Models

I’m curious to see how the model performs when I train it for single targets. I’ll train one regression model for Age, another separate regression model for Survived, and see how their results compare to the combined two-target model.

Single Target: Age

# create dataloaders object
age_dls = TabularPandas(
    df,
    splits=splits,
    procs=[Categorify, Normalize],
    cat_names=["Sex", "Pclass", "Embarked", "Deck", "Title"],
    cont_names=["SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
    y_names="Age",
    y_block=RegressionBlock()
).dataloaders(path=".")
age_learn = tabular_learner(age_dls, metrics=rmse, layers=[10,10])

I ran the learning rate finder 10 times and got similar charts each time, which tells me that something about this model is more stable than my two-target model.

age_learn.lr_find(suggest_funcs=(slide, valley))
SuggestedLRs(slide=6.309573450380412e-07, valley=0.0831763744354248)

age_learn.fit(16, lr=0.1)
epoch train_loss valid_loss _rmse time
0 781.124268 233.326263 15.275021 00:00
1 454.851532 408.981842 20.223301 00:00
2 328.806274 116.149773 10.777281 00:00
3 263.302643 119.088097 10.912749 00:00
4 219.239166 127.125175 11.274981 00:00
5 190.565811 111.707756 10.569189 00:00
6 171.005737 113.618858 10.659215 00:00
7 157.105713 109.284859 10.453939 00:00
8 146.396072 118.541183 10.887661 00:00
9 138.696716 107.435219 10.365096 00:00
10 132.795654 109.071220 10.443716 00:00
11 128.642639 112.930344 10.626869 00:00
12 124.508675 107.584816 10.372310 00:00
13 121.428909 113.099953 10.634846 00:00
14 119.856216 114.224464 10.687585 00:00
15 118.349365 109.042511 10.442342 00:00

The validation loss also fluctuates in this model’s training. The RMSE metric also does not really improve after the first couple of epochs. Similar to last time, I’ll plot the predicted age vs actual and calculate the correlation between the two:

age_preds, age_targ = age_learn.get_preds(dl=age_dls.valid)
# Age plot
fig, ax = plt.subplots(1)

ax.axis('equal')
plt.title(f'Predicted Age vs Actual; r: {corr(age_preds[:,0], age_targ[:,0]):.2f}')
ax.scatter(age_preds[:,0], age_targ[:,0]);

Surprisingly, the single target Age model does not perform much better than my two-target model. I get a similar correlation, and this model also fails to predict ages above around 40.

Single Target: Survived

In Jeremy’s “Why you should use a framework” notebook, he achieves about an 83% accuracy. I’ll use this as a benchmark to compare my model with.

# create dataloaders object
survived_dls = TabularPandas(
    df,
    splits=splits,
    procs=[Categorify, Normalize],
    cat_names=["Sex", "Pclass", "Embarked", "Deck", "Title"],
    cont_names=["SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
    y_names="Survived",
    y_block=RegressionBlock()
).dataloaders(path=".")
survived_learn = tabular_learner(survived_dls, metrics=rmse, layers=[10,10])
survived_learn.lr_find(suggest_funcs=(slide, valley))
SuggestedLRs(slide=0.05754399299621582, valley=0.0063095735386013985)

survived_learn.fit(16, lr=0.02)
epoch train_loss valid_loss _rmse time
0 0.250865 0.240620 0.490530 00:00
1 0.200100 0.214276 0.462899 00:00
2 0.177398 0.150440 0.387866 00:00
3 0.163052 0.135140 0.367615 00:00
4 0.153141 0.131269 0.362311 00:00
5 0.147007 0.133025 0.364726 00:00
6 0.143294 0.132439 0.363922 00:00
7 0.138928 0.131754 0.362979 00:00
8 0.135169 0.128147 0.357976 00:00
9 0.133087 0.125253 0.353910 00:00
10 0.130366 0.126195 0.355240 00:00
11 0.128971 0.130248 0.360899 00:00
12 0.127474 0.128108 0.357922 00:00
13 0.126128 0.124583 0.352963 00:00
14 0.125103 0.125416 0.354142 00:00
15 0.123530 0.129710 0.360152 00:00
survived_preds, survived_targ = survived_learn.get_preds(dl=survived_dls.valid)
(survived_targ == (survived_preds>0.5)).float().mean()
tensor(0.8258)

I get an accuracy of around 83% as well.

Final Thoughts

Here are my takeaways from this experiment:

  • A single-target regression model predicts Survived better than Age.
  • A two-target regression model (Survived and Age) predicts Survived significantly worse than a single-target model (Survived only). Something about introducing an output for Age decreases the model’s performance when predicting survival rate.
  • A two-target regression model (Survived and Age) predicts Age with about the same correlation as a single-target model (Age only).

Something about this dataset (and how the model learns from it) makes Age a poor target for prediction. Perhaps it’s the distribution of ages in the dataset, or the relationship with other columns, that makes it harder for the model to predict it accurately.

I’m happy and proud that I was able to run this experiment after failing to overcome some errors the first couple of times I tried to train a two-target model earlier this week.

I hope you enjoyed this blog post!