Training Collaborative Filtering Models on MovieLens 100k with Different `y_range` Values

machine learning

fastai

python

In this notebook I explore the question—how does the y_range parameter affect model performance and prediction distributions? I use the MovieLens 100k subset as the dataset.

Author

Vishal Bakshi

Published

May 23, 2024

Background

In Chapter 8 of the fastai text we train a collaborative filtering model that predicts movie ratings for users (who have not watched those movies yet). It’s one way of asking the question: would this user like this movie given their interests and the movie’s characteristics? The users’ “interests” and the movies’ “characteristics” are the latent factors that we train. The ratings (predictions) are the dot product between the user and movie latent factors. This dot product passes through the sigmoid_range function which squeezes the input values into output values within a given range. In the textbook, the range we use is 0 to 5.5. We use 5.5 because the sigmoid function never reaches 1 so 5 * sigmoid would never reach 5 (the maximum movie rating). Overshooting by 0.5 solves this issue. 5.5 times sigmoid will be able to reach an output of 5 comfortably.

In this blog post I’ll explore the question: how does model performance vary as y_range varies?

Here is a summary of my results of different statistics (rows) for different y_range values (columns) for the validation set (20k samples):

Statistic	None	(0, 5.5)	(0.5, 5.5)	(0.75, 5.25)	(1,5)	(-2,8)
Median Prediction	3.49	3.53	3.54	3.54	3.53	3.55
Mean Prediction	3.43	3.48	3.5	3.5	3.49	3.51
Kurtosis	0.78	-0.06	-0.1	-0.11	-0.14	0.12
Skew	-0.59	-0.39	-0.34	-0.36	-0.38	-0.32
Anderson-Darling	71.7	42.4	33.4	37	42	21.8
% preds outside 1-5	0.93%	0.23%	0.21%	0.07%	0.00%	1.19%

And for the training set (80k samples):

Statistic	None	(0, 5.5)	(0.5, 5.5)	(0.75, 5.25)	(1,5)	(-2,8)
Median Prediction	3.50	3.60	3.60	3.60	3.59	3.63
Mean Prediction	3.44	3.49	3.50	3.50	3.50	3.52
Kurtosis	0.42	0.15	0.003	0.003	-0.06	0.2
Skew	-0.49	-0.61	-0.53	-0.56	-0.56	-0.5
Anderson-Darling	228.9	490.4	388.1	444.8	467.6	350.4
% preds outside 1-5	0.68%	0.31%	0.23%	0.05%	0.00%	1.85%

Training without `y_range`

I think it makes sense to first explore the loss and output distribution when I don’t set y_range when training a collaborative filtering model on the 100k subset of MovieLens. I’ll reuse the code from the text to prepare the data and DataLoaders and use a weight decay of 0.1:

from scipy.stats import anderson

from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
ratings = ratings.merge(movies)
ratings.head()

100.15% [4931584/4924029 00:00<00:00]

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

	user	title	rating
0	815	Groundhog Day (1993)	4
1	357	Phantom, The (1996)	3
2	246	Blown Away (1994)	3
3	311	Casablanca (1942)	4
4	457	Immortal Beloved (1994)	4
5	241	Titanic (1997)	4
6	525	Independence Day (ID4) (1996)	4
7	394	Cape Fear (1991)	4
8	109	Dante's Peak (1997)	3
9	334	Wolf (1994)	2

learn = collab_learner(dls, n_factors=50, y_range=None)
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	1.257861	1.301678	00:13
1	1.071218	1.113060	00:12
2	0.989054	1.017373	00:11
3	0.856945	0.928325	00:12
4	0.848923	0.905493	00:12

I want to see the distribution of predictions for the training and validation set and understand how they vary. I’ll create a helper function for that.

def plot_preds(preds, title):
  preds = pd.Series(preds)
  preds.hist();
  plt.title(f'{title} preds distribution')
  print('median:', preds.median())
  print('mean:', preds.mean())
  print('kurtosis: ', preds.kurtosis())
  print('skew: ', preds.skew())

  result = anderson(preds, dist='norm')
  print(f'Statistic: {result.statistic}')
  print(f'Critical values: {result.critical_values}')
  print(f'Significance levels: {result.significance_level}')

  cond = (preds < 1) | (preds > 5)
  print(f'% of preds outside of 1-5 range: {100*cond.sum()/cond.count():.2f}%')

preds, targ = learn.get_preds(dl=dls.valid)

# check loss---should be close to 0.905493
MSELossFlat()(preds, targ)

TensorBase(0.9055)

plot_preds(learn.get_preds(dl=dls.valid)[0], 'valid')

median: 3.4890234
mean: 3.4260304
kurtosis:  0.7783028
skew:  -0.58709365
Statistic: 71.65442338831053
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.93%

The validation set predictions are slightly skewed left with a median rating of about 3.5. Based on the Anderson-Darling statistic (which is significantly larger than the most stringent critical value of 1.092), these 20k samples don’t come from a normal distribution. Less than 1% of the values fall outside of the expected rating range of 1 to 5.

plot_preds(learn.get_preds(dl=dls.train)[0], 'train')

median: 3.4968839
mean: 3.435657
kurtosis:  0.41849822
skew:  -0.49159753
Statistic: 228.91494857503858
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.68%

The training set predictions are similarly distributed, with a slightly larger peak resulting in a slightly larger median rating, still around 3.5.

In general there are more values outside of the realistic range (1 to 5) of ratings in the validation predictions than the training predicitons. Although, the model is doing pretty well at predicting values within the desired range with less than 1% falling outside this range.

Training with `y_range=(0, 5.5)`

learn2 = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn2.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.882406	0.942118	00:13
1	0.650510	0.887792	00:12
2	0.542655	0.862130	00:12
3	0.440741	0.848899	00:12
4	0.442999	0.842771	00:12

Using a y_range of 0 to 5.5 resulted in a ~7% lower loss.

plot_preds(learn2.get_preds(dl=dls.valid)[0], 'valid')

median: 3.5321503
mean: 3.4844122
kurtosis:  -0.055667587
skew:  -0.3875332
Statistic: 42.351315721156425
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.23%

This distribution is still not normal but has half the Anderson-Darling statistic as when y_range was None. The kurtosis is closer to 0 as well. The key point is that only about 1/4th of the values as before are outside of the 1-5 rating range.

plot_preds(learn2.get_preds(dl=dls.train)[0], 'train')

median: 3.5977917
mean: 3.4933543
kurtosis:  0.14653848
skew:  -0.6128638
Statistic: 490.3643317096139
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.31%

The training predictions are more skewed than the validation predictions.

Training with y_range=(0.5, 5.5)

ratings['rating'].min(), ratings['rating'].max()

(1, 5)

I can’t find it anymore, but there was a fastai forums post where someone was questioning why the lower range in y_range wasn’t 0.5 (0.5 less than the minimum rating of 1 matching the upper range 5.5 is 0.5 more than the maximum rating of 5). I’ll see if training with y_range=(0.5, 5.5) improves the loss or changes the distribution of predictions.

learn3 = collab_learner(dls, n_factors=50, y_range=(0.5, 5.5))
learn3.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.841459	0.931699	00:14
1	0.652540	0.878996	00:12
2	0.530454	0.865976	00:12
3	0.448474	0.856127	00:13
4	0.423248	0.852660	00:12

That actually worsened the loss, increasing it by about 1%. I’ll look at the training and validation prediction distributions:

plot_preds(learn3.get_preds(dl=dls.valid)[0], 'valid')

median: 3.5413134
mean: 3.5004866
kurtosis:  -0.102446005
skew:  -0.3400191
Statistic: 33.359148298073706
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.21%

The median and mean ratings are a bit higher and about the same amount of ratings are outside the acceptable range. The distribution is similarly not normal but has the lowest Anderson-Darling statistic so far.

plot_preds(learn3.get_preds(dl=dls.train)[0], 'train')

median: 3.6018043
mean: 3.5078757
kurtosis:  0.0025408994
skew:  -0.5326863
Statistic: 388.08379248825077
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.23%

The median and mean for the training predictions are also a tiny bit larger but mostly the distribution is the same as y_range=(0, 5.5) (although the kurtosis is much smaller).

Training with `y_range=(0.75, 5.25)`

I’m curious if a “tighter” range changes the results.

learn4 = collab_learner(dls, n_factors=50, y_range=(0.75, 5.25))
learn4.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.891943	0.931708	00:12
1	0.676929	0.879818	00:13
2	0.531733	0.866186	00:12
3	0.459268	0.852890	00:13
4	0.454604	0.848512	00:12

This results in the second-best loss value thus far.

plot_preds(learn4.get_preds(dl=dls.valid)[0], 'valid')

median: 3.5425978
mean: 3.5018144
kurtosis:  -0.113488525
skew:  -0.36151984
Statistic: 36.919869251567434
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.07%

plot_preds(learn4.get_preds(dl=dls.train)[0], 'train')

median: 3.5998974
mean: 3.503809
kurtosis:  0.0028086598
skew:  -0.5607914
Statistic: 444.8073482159525
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.05%

The training and validation predictions have the lowest amount of predictions falling outside the acceptable range—this makes sense because sigmoid is not going to get as close to 1 and 5 as a y_range of (0.5, 5.5) or (0, 5.5).

Training with `y_range=(1, 5)`

Just to cover my bases, I’ll train with a y_range not recommended: from 1 to 5. With this range, sigmoid will never output ratings of exactly 1 or 5.

learn5 = collab_learner(dls, n_factors=50, y_range=(1, 5))
learn5.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.890540	0.942143	00:13
1	0.675952	0.874900	00:12
2	0.560956	0.855053	00:14
3	0.500103	0.847492	00:17
4	0.492499	0.844006	00:14

Surprisingly, this has supplanted y_range=(0.75, 5.25) with the second-best loss after 5 epochs. I wonder if that is because the overall range is lower?

plot_preds(learn5.get_preds(dl=dls.valid)[0], 'valid')

median: 3.5273356
mean: 3.489109
kurtosis:  -0.14329968
skew:  -0.37828833
Statistic: 42.07929809941925
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.00%

As expected, 0.00% of the ratings fall outside of the minimum of 1 and maximum of 5.

plot_preds(learn5.get_preds(dl=dls.train)[0], 'train')

median: 3.5868726
mean: 3.4960902
kurtosis:  -0.0628498
skew:  -0.55758834
Statistic: 467.5922112545086
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 0.00%

Training with `y_range=(-2, 8)`

As a last fun experiment, I’ll use a much-wider-than-needed y_range and see how that affects the loss as well as the prediction distributions.

learn6 = collab_learner(dls, n_factors=50, y_range=(-2, 8))
learn6.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.806267	0.923924	00:13
1	0.556011	0.928603	00:14
2	0.437159	0.907485	00:13
3	0.346756	0.900347	00:12
4	0.331412	0.895803	00:13

Interestingly, the training loss is significantly lower than any of the other training runs. The validation loss is about 5% higher than the lowest validation loss achieved prior. I’m curious to see how the distributions compare.

plot_preds(learn6.get_preds(dl=dls.valid)[0], 'valid')

median: 3.5484176
mean: 3.5100946
kurtosis:  0.11679816
skew:  -0.32186633
Statistic: 21.7676292314718
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 1.19%

About 6 times as many predictions fall outside of the acceptable range (~1.2% to ~0.2%) which makes sense since the y_range is wider. The overall distributions is similar to the other validation predictions although this distribution (still very not normal) has the lowest Anderson-Darling statistic.

plot_preds(learn6.get_preds(dl=dls.train)[0], 'train')

median: 3.632931
mean: 3.5240762
kurtosis:  0.015062247
skew:  -0.50895566
Statistic: 350.41688774364593
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15.  10.   5.   2.5  1. ]
% of preds outside of 1-5 range: 1.85%

The training loss distribution looks funkier than before (more than 10x the Anderson-Darling statistic), and it has a slightly larger median, and almost 9 times the values falling outside of the acceptable range.

Final Thoughts

I ended up enjoying this experiment more than I expected to. It was helpful to see intuitive results being validated through observing the actual prediction distributions (for example, y_range=(1,5) had 0 prediction outside of that range while y_range=(-2,8) had the most.

There were some surprises along the way: a y_range of (-2,8) had the lowest training loss—not sure what to make of that—a y_range of (1,5) resulted in the second-best loss (perhaps because there is a smaller range to predict within?) and although the none of distributions were normal, there were varying degrees of non-normality.

As part of the fastai Part 1 Lesson 7 homework, I’ll be training models on the full MovieLens dataset (~25M rows) so it’ll be fun to experiment with y_range values and see if I get different results.

I hope you enjoyed this blog post!

Background

Training without y_range

Training with y_range=(0, 5.5)