Training Collaborative Filtering Models on MovieLens 100k with Different Weight Decay Values

machine learning
fastai
python
In this notebook I explore the question—how does the wd (weight decay) parameter affect model performance and weight distributions? I use the MovieLens 100k subset as the dataset.
Author

Vishal Bakshi

Published

June 3, 2024

Background

In fastai Part 1 Lesson 7 (Chapter 8 of the textbook), we are introduced to the concept of weight decay, where the sum of the squared weights multiplied by a factor wd is added to the loss. In practice, instead of adding wd times the sum of squared weights to the loss (which would result in a large values of loss which destabilizes training), we add the derivative of that to the parameters’ gradients. In pseudo code:

for param in model.parameters():
  param.grad += wd * sum(params)

By increasing the loss with the squared sum of weights, we are forcing the model to minimize the weights when it minimizes the loss, which makes the model generalize better (as large weights result in a neural net function with sharp peaks and troughs that can overfit to the training data).

In this notebook I’ll train on the MovieLens 100k dataset using different values of weight decay and then plotting both the resulting training and validation loss curves and weight distribution to see how weight decay affects the training process and the resulting model.

Create DataLoaders and Learner

I’l re-use the code from Chapter 8 to create DataLoaders and the Learner.

from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)
100.15% [4931584/4924029 00:01<00:00]
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
ratings.head()
user movie rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
movies.head()
movie title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)
ratings = ratings.merge(movies)
ratings.head()
user movie rating timestamp title
0 196 242 3 881250949 Kolya (1996)
1 63 242 3 875747190 Kolya (1996)
2 226 242 5 883888671 Kolya (1996)
3 154 242 3 879138235 Kolya (1996)
4 306 242 5 876503793 Kolya (1996)
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()
user title rating
0 294 Money Talks (1997) 3
1 486 Crossing Guard, The (1995) 4
2 13 High Noon (1952) 5
3 878 My Life as a Dog (Mitt liv som hund) (1985) 5
4 433 Star Wars (1977) 5
5 847 Streetcar Named Desire, A (1951) 3
6 833 Akira (1988) 4
7 503 Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995) 3
8 5 Last of the Mohicans, The (1992) 1
9 868 Stand by Me (1986) 4

In the text, the following Learner resulted in the lowest validation loss (0.821688) so I’ll just use that in this experiment.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

I’ll do an initial training to sort my head around how I’m going to collect data during the experiment.

learn.fit_one_cycle(5, 5e-3, wd=0.1)
epoch train_loss valid_loss time
0 0.882671 0.955993 00:12
1 0.654915 0.892832 00:12
2 0.525330 0.872967 00:12
3 0.436863 0.858952 00:12
4 0.438653 0.854505 00:12

Planning the Experiment

I’ll be trying that same learning process (5 epochs, 5e-3 learning rate) with different weight decay wd values (from 0.0001 to 1.0). Once the experiment is done, I want to visualize the following for all wd values:

  • training loss curve .
  • validation loss curve.
  • distribution of model parameters.

I’ll reference the plot_loss source code for how to capture those values from the Learner’s Recorder object.

But first, I’ll plot the losses using that plot_loss function to have a baseline reference:

learn.recorder.plot_loss();

Plotting Loss Values

The following line in plot_loss plots the training loss:

ax.plot(list(range(skip_start, len(self.losses))), self.losses[skip_start:], label='train')

I’ll see if I can recreate that plot:

ax=plt.gca()
ax.plot(list(range(5, len(learn.recorder.losses))), learn.recorder.losses[5:], label='train');

Cool, that looks the same! Now for the validation loss:

ax=plt.gca()
ax.plot(list(range(5, len(learn.recorder.losses))), learn.recorder.losses[5:], label='train');
idx = (np.array(learn.recorder.iters)<5).sum()
valid_col = learn.recorder.metric_names.index('valid_loss') - 1
ax.plot(learn.recorder.iters[idx:], L(learn.recorder.values[idx:]).itemgot(valid_col), label='valid')
ax.legend();

Nice, I’m able to recreate plot_loss using the raw data recorded in the Recorder.

Plotting Distribution of Weights

Next, I want to be able to visualize (and calculate summary statistics for) the weights of the model.

The model has four sets of parameters: the user and item weights, and the user and item biases.

learn.model
EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)
for p in learn.model.parameters():
  print(p.shape)
torch.Size([944, 50])
torch.Size([1665, 50])
torch.Size([944, 1])
torch.Size([1665, 1])

Let’s take a look at the first set of parameters, which are the u_weight (user weight) parameters. This is the embedding with 50 latent factors for each user.

pd.Series(first(learn.model.parameters()).flatten().detach().numpy()).hist();

Next, let’s look at the bias parameters for the users (u_bias):

u_bias_params = next(itertools.islice(learn.model.parameters(), 2, None))
u_bias_params.shape
torch.Size([944, 1])
pd.Series(u_bias_params.flatten().detach().numpy()).hist();

Looking at all of the parameters:

torch.nn.utils.parameters_to_vector(learn.model.parameters()).shape, 944*50 + 944*1 + 1665*50 + 1665*1
(torch.Size([133059]), 133059)
pd.Series(torch.nn.utils.parameters_to_vector(learn.model.parameters()).detach().numpy()).hist();

Now I know how to access and plot the training loss, validation loss and model weights.

Estimating Runtime and Storage Size

Each model takes about 1 minute to run on Colab. I trained on Paperspace CPUs but it was taking 30+ seconds per epoch (while it takes 10-14 seconds on Colab). I have a 2015 Mac so I can’t install fastai locally (if I’m interpreting this note correctly).

Originally I was going to use torch.arange to create weight decay values (without subtracting 1e-6 from the upper bound as I eventually did below), but it acculumated floating point error. The following tensor should stop at 0.0990 but goes up to 0.1000 because of floating point error.

torch.arange(0.01, 0.1, 0.001)
tensor([0.0100, 0.0110, 0.0120, 0.0130, 0.0140, 0.0150, 0.0160, 0.0170, 0.0180,
        0.0190, 0.0200, 0.0210, 0.0220, 0.0230, 0.0240, 0.0250, 0.0260, 0.0270,
        0.0280, 0.0290, 0.0300, 0.0310, 0.0320, 0.0330, 0.0340, 0.0350, 0.0360,
        0.0370, 0.0380, 0.0390, 0.0400, 0.0410, 0.0420, 0.0430, 0.0440, 0.0450,
        0.0460, 0.0470, 0.0480, 0.0490, 0.0500, 0.0510, 0.0520, 0.0530, 0.0540,
        0.0550, 0.0560, 0.0570, 0.0580, 0.0590, 0.0600, 0.0610, 0.0620, 0.0630,
        0.0640, 0.0650, 0.0660, 0.0670, 0.0680, 0.0690, 0.0700, 0.0710, 0.0720,
        0.0730, 0.0740, 0.0750, 0.0760, 0.0770, 0.0780, 0.0790, 0.0800, 0.0810,
        0.0820, 0.0830, 0.0840, 0.0850, 0.0860, 0.0870, 0.0880, 0.0890, 0.0900,
        0.0910, 0.0920, 0.0930, 0.0940, 0.0950, 0.0960, 0.0970, 0.0980, 0.0990,
        0.1000])

The following arange call results in the correct number of values:

torch.arange(0.0001, 0.001-1e-6, 0.00001).shape, \
torch.arange(0.001, 0.01-1e-6, 0.0001).shape, \
torch.arange(0.01, 0.10-1e-6, 0.001).shape, \
torch.arange(0.1, 1.0-1e-6, 0.01).shape
(torch.Size([90]), torch.Size([90]), torch.Size([90]), torch.Size([90]))
wds = torch.cat([
    torch.arange(0.0001, 0.001-1e-6, 0.00001),
    torch.arange(0.001, 0.01-1e-6, 0.0001),
    torch.arange(0.01, 0.10-1e-6, 0.001),
    torch.arange(0.1, 1.0-1e-6, 0.01)])
wds.shape
torch.Size([360])
pd.Series(wds).plot();

Next, I’ll estimate the disk space required to store the training loss, validation loss and weights for all 360 models (1 per wd value).

With ChatGPT’s help:

import pickle
import io

def get_pickled_object_size(obj):
    # Create a BytesIO buffer
    buffer = io.BytesIO()
    # Pickle the object into the buffer
    pickle.dump(obj, buffer)
    # Get the size of the buffer
    size = buffer.getbuffer().nbytes
    return size
res = {
    'training_loss': learn.recorder.losses,
    'iters': learn.recorder.iters,
    'valid_cols': learn.recorder.metric_names.index('valid_loss') - 1,
    'validation_loss': L(learn.recorder.values[(np.array(learn.recorder.iters)<5).sum():]).itemgot(learn.recorder.metric_names.index('valid_loss') - 1),
    'weights': torch.nn.utils.parameters_to_vector(learn.model.parameters())
}
size = get_pickled_object_size(res)
print(f'The size of the pickled object for one model is {size/1e6} MB.')
print(f'The total size of the pickled objects for 360 models will be ~{360*size/1e6} MB.')
The size of the pickled object for one model is 2.345662 MB.
The total size of the pickled objects for 360 models will be ~844.43832 MB.

That’s quite a bit. And seem unnecessarily large. learn.recorder.losses is a list of tensors:

learn.recorder.losses[:5]
[TensorBase(1.8254),
 TensorBase(1.8496),
 TensorBase(1.9542),
 TensorBase(1.8464),
 TensorBase(1.9200)]

Perhaps I can convert that to a list of floats and see if it reduces the storage size:

res = {
    'training_loss': [t.item() for t in learn.recorder.losses],
    'iters': learn.recorder.iters,
    'valid_cols': learn.recorder.metric_names.index('valid_loss') - 1,
    'validation_loss': L(learn.recorder.values[(np.array(learn.recorder.iters)<5).sum():]).itemgot(learn.recorder.metric_names.index('valid_loss') - 1),
    'weights': torch.nn.utils.parameters_to_vector(learn.model.parameters())
}

size = get_pickled_object_size(res)
print(f'The size of the pickled object for one model is {size/1e6} MB.')
print(f'The total size of the pickled objects for 360 models will be ~{360*size/1e6} MB.')
The size of the pickled object for one model is 0.589096 MB.
The total size of the pickled objects for 360 models will be ~212.07456 MB.

That’s better! Most of the size is coming from my weights.

size = get_pickled_object_size(torch.nn.utils.parameters_to_vector(learn.model.parameters()))
print(f'The size of the pickled object for one model is {size/1e6} MB.')
print(f'The total size of the pickled objects for 360 models will be ~{360*size/1e6} MB.')
The size of the pickled object for one model is 0.532648 MB.
The total size of the pickled objects for 360 models will be ~191.75328 MB.

With ChatGPT’s help: there are 133059 float32 elements in the weights tensor. Each float32 element is 4 bytes large. 4 x 133059 is 532236 bytes, which is ~0.5MB. I’ll keep them as float32’s and accept the larger disk space (since I’m keeping this only temporarily).

Running the Experiment

Now I can run the experiment and collect the data on it. I’ll do a quick run with 3 models and save the Learner objects to make sure I’m collecting data correctly.

res = {'training_loss': [], 'iters': [], 'validation_loss': [], 'weights': []}
learners = []
count = 0

for wd in wds:
  if count == 3: break
  learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
  learn.fit_one_cycle(5, 5e-3, wd=wd.item())
  learn.recorder.plot_loss();
  learners.append(learn)

  res['training_loss'].append([t.item() for t in learn.recorder.losses])
  res['iters'].append(learn.recorder.iters)
  res['validation_loss'].append(L(learn.recorder.values[(np.array(learn.recorder.iters)<5).sum():]).itemgot(learn.recorder.metric_names.index('valid_loss') - 1))
  res['weights'].append(torch.nn.utils.parameters_to_vector(learn.model.parameters()))
  count += 1
epoch train_loss valid_loss time
0 0.867206 0.947485 00:12
1 0.612190 0.929238 00:12
2 0.401040 0.964420 00:13
3 0.298337 0.977934 00:12
4 0.293123 0.979958 00:12
epoch train_loss valid_loss time
0 0.879830 0.946919 00:12
1 0.544347 0.930262 00:12
2 0.416691 0.962877 00:12
3 0.298801 0.978486 00:12
4 0.269458 0.979087 00:11
epoch train_loss valid_loss time
0 0.867352 0.940780 00:11
1 0.624375 0.930750 00:12
2 0.405802 0.975494 00:12
3 0.325339 0.991441 00:12
4 0.300811 0.993090 00:12

Good to know—it plots on the same chart! I could just use this functionality instead of saving the training and validation loss values, but I want to keep those to do additional visualizations.

Next, I’ll check that the recorded values are the same as the Learner object.

len(res['training_loss'])
3
res['training_loss'][0] == [t.item() for t in learners[0].recorder.losses], \
res['iters'][1] == learners[1].recorder.iters, \
res['validation_loss'][2] == L(learners[2].recorder.values[(np.array(learners[2].recorder.iters)<5).sum():]).itemgot(learners[2].recorder.metric_names.index('valid_loss') - 1), \
(res['weights'][2] == torch.nn.utils.parameters_to_vector(learners[2].model.parameters())).sum()
(True, True, True, tensor(133059))

Testing that I can pickle and unpickle the results:

save_pickle('wd_res_test.pkl', res)
res = load_pickle('/content/wd_res_test.pkl')

res['training_loss'][0] == [t.item() for t in learners[0].recorder.losses], \
res['iters'][1] == learners[1].recorder.iters, \
res['validation_loss'][2] == L(learners[2].recorder.values[(np.array(learners[2].recorder.iters)<5).sum():]).itemgot(learners[2].recorder.metric_names.index('valid_loss') - 1), \
(res['weights'][2] == torch.nn.utils.parameters_to_vector(learners[2].model.parameters())).sum()
(True, True, True, tensor(133059))

Nice!

However, I noticed that some epochs took up to 13 seconds. I wonder if training time is affected by weight decay value? I’ll train with a much larger weight decay and see if it speeds up the training.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.9)
epoch train_loss valid_loss time
0 0.990363 1.074203 00:12
1 1.003346 1.067053 00:12
2 0.976281 1.029760 00:12
3 0.978760 1.001970 00:12
4 0.912565 0.993564 00:12

Nope, seems like the training time per epoch just varies.

res = {'wds': wds, 'training_loss': [], 'iters': [], 'validation_loss': [], 'weights': []}

for wd in wds:
  learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
  with learn.no_logging(), learn.no_bar():
    learn.fit_one_cycle(5, 5e-3, wd=wd.item())

  res['training_loss'].append([t.item() for t in learn.recorder.losses])
  res['iters'].append(learn.recorder.iters)
  res['validation_loss'].append(L(learn.recorder.values[(np.array(learn.recorder.iters)<5).sum():]).itemgot(learn.recorder.metric_names.index('valid_loss') - 1))
  res['weights'].append(torch.nn.utils.parameters_to_vector(learn.model.parameters()))

save_pickle('wd_res.pkl', res)

Analyzing Training Results

matplotlib Animations

I’ll create an animation with matplotlib which highlights the training loss curve for each weight decay value and prints the weight decay value on the plot.

To start, I’ll make sure that all 360 of my training loss lists are of the same length (6250 steps).

#wd_res = load_pickle('/content/wd_res.pkl')
trn_loss_t = torch.stack([torch.tensor(el) for el in wd_res['training_loss']])
trn_loss_t.shape # looks good
torch.Size([360, 6250])

Next, I’ll create my x values for the plot as a range from 0 to 6249.

x = list(range(len(wd_res['training_loss'][0])))
len(x), x[:5], x[-5:]
(6250, [0, 1, 2, 3, 4], [6245, 6246, 6247, 6248, 6249])

With ChatGPT’s help, I used the following code to create an animated GIF which highlights the training and validation loss curves for the given weight decay value.

Show the code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# Prepare Data
x = list(range(len(wd_res['training_loss'][0])))

data = wd_res['training_loss']

labels = wd_res['wds']

# Create Static Plot
fig, ax = plt.subplots()
ax.set_xlabel('Step')
ax.set_ylabel('Training Loss')

lines = [ax.plot(x, y, alpha=0.0, color='#ff0088', linewidth=2)[0] for y in data]  # Initial lines with lower opacity
text = ax.text(0.5, 1.05, "", transform=ax.transAxes, ha="center")

# Define Animation Function
def animate(i):
    # Reset all lines to low opacity
    for line in lines:
        line.set_alpha(0.0)

    # Highlight the current line
    lines[i].set_alpha(1.0)

    # Update the text
    text.set_text(f"weight decay = {labels[i].item():.5f}")

    return lines + [text]

# Create Animation
ani = FuncAnimation(fig, animate, frames=len(data), interval=150, blit=True)

# Save as GIF
gif_writer = PillowWriter(fps=6)
ani.save('training_loss.gif', writer=gif_writer)

Here’s the GIF for training loss. The loss curve slowly moves upward with the minimum loss value increasing as weight decay increases. It really starts picking up at a weight decay value of around 0.1.

Here’s the GIF for the validation loss—the loss curve starts out decreasing first then increasing and at a weight decay value of around 0.03, starts decreasing monotonically with the minimum loss reaching its minimum. It starts to shift upwards again at around a loss of 0.3.

Next, I’ll animate the distribution of the weights (as weight decay value changes), modifying the above code accordingly:

Show the code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# Prepare Data
data = wd_res['weights']
labels = wd_res['wds']

# Create Static Plot
fig, ax = plt.subplots()
ax.set_xlabel('Weights')
ax.set_ylabel('Count')

# Define Animation Function
def animate(i):
    ax.clear()  # Clear previous histogram
    n, bins, patches = ax.hist(data[i].detach().numpy())
    ax.set_title(f"weight decay = {labels[i].item():.5f}")
    ax.set_xlabel('Weights')
    ax.set_ylabel('Count')
    ax.set_xlim(-0.5, 0.5)  # Set x-axis limits to keep consistent across frames
    ax.set_ylim(0, 75000)  # Set y-axis limits to keep consistent across frames
    return patches

# Create Animation
ani = FuncAnimation(fig, animate, frames=len(data), interval=150, blit=True)

# Save as GIF
gif_writer = PillowWriter(fps=6)
ani.save('weights.gif', writer=gif_writer)

Here’s the resulting GIF showing the changing distribution of weights. Note that the range of weights starts out beyond -0.5 to 0.5, and ends up within -0.2 and +0.2 as the weight decay value increases to 1.0.

Final Thoughts

As always, I love experimenting with (relatively) simple concepts and observing how they affect training performance and results. Recapping the trends I observed:

  • Minimum training loss increases as weight decay increases.
  • Minimum validation loss decreases until a point (for this dataset and model, until a weight decay value of ~0.2) and then increases as weight decay increases.
  • The mean and median absolute values of the final weights decreases (exponentially) as weight decay increases.

I also enjoyed creating animated GIFs as a visual tool to illustrate the distributions and trends of weights and losses as weight decay increases. The resulting GIFs could be improved by tweaking frame rate and other parameters, but I’m happy with the overall trends they convey, which match the static plots I created.

As always, I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.