from scipy.stats import anderson
Training Collaborative Filtering Models on MovieLens 100k with Different y_range
Values
y_range
parameter affect model performance and prediction distributions? I use the MovieLens 100k subset as the dataset.
Background
In Chapter 8 of the fastai text we train a collaborative filtering model that predicts movie ratings for users (who have not watched those movies yet). It’s one way of asking the question: would this user like this movie given their interests and the movie’s characteristics? The users’ “interests” and the movies’ “characteristics” are the latent factors that we train. The ratings (predictions) are the dot product between the user and movie latent factors. This dot product passes through the sigmoid_range
function which squeezes the input values into output values within a given range. In the textbook, the range we use is 0
to 5.5
. We use 5.5
because the sigmoid function never reaches 1
so 5
* sigmoid would never reach 5
(the maximum movie rating). Overshooting by 0.5
solves this issue. 5.5
times sigmoid will be able to reach an output of 5
comfortably.
In this blog post I’ll explore the question: how does model performance vary as y_range
varies?
Here is a summary of my results of different statistics (rows) for different y_range
values (columns) for the validation set (20k samples):
Statistic | None | (0, 5.5) | (0.5, 5.5) | (0.75, 5.25) | (1,5) | (-2,8) |
---|---|---|---|---|---|---|
Median Prediction | 3.49 | 3.53 | 3.54 | 3.54 | 3.53 | 3.55 |
Mean Prediction | 3.43 | 3.48 | 3.5 | 3.5 | 3.49 | 3.51 |
Kurtosis | 0.78 | -0.06 | -0.1 | -0.11 | -0.14 | 0.12 |
Skew | -0.59 | -0.39 | -0.34 | -0.36 | -0.38 | -0.32 |
Anderson-Darling | 71.7 | 42.4 | 33.4 | 37 | 42 | 21.8 |
% preds outside 1-5 | 0.93% | 0.23% | 0.21% | 0.07% | 0.00% | 1.19% |
And for the training set (80k samples):
Statistic | None | (0, 5.5) | (0.5, 5.5) | (0.75, 5.25) | (1,5) | (-2,8) |
---|---|---|---|---|---|---|
Median Prediction | 3.50 | 3.60 | 3.60 | 3.60 | 3.59 | 3.63 |
Mean Prediction | 3.44 | 3.49 | 3.50 | 3.50 | 3.50 | 3.52 |
Kurtosis | 0.42 | 0.15 | 0.003 | 0.003 | -0.06 | 0.2 |
Skew | -0.49 | -0.61 | -0.53 | -0.56 | -0.56 | -0.5 |
Anderson-Darling | 228.9 | 490.4 | 388.1 | 444.8 | 467.6 | 350.4 |
% preds outside 1-5 | 0.68% | 0.31% | 0.23% | 0.05% | 0.00% | 1.85% |
Training without y_range
I think it makes sense to first explore the loss and output distribution when I don’t set y_range
when training a collaborative filtering model on the 100k subset of MovieLens. I’ll reuse the code from the text to prepare the data and DataLoaders
and use a weight decay of 0.1
:
from fastai.collab import *
from fastai.tabular.all import *
= untar_data(URLs.ML_100k)
path
= pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
ratings = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
movies = ratings.merge(movies)
ratings ratings.head()
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
2 | 226 | 242 | 5 | 883888671 | Kolya (1996) |
3 | 154 | 242 | 3 | 879138235 | Kolya (1996) |
4 | 306 | 242 | 5 | 876503793 | Kolya (1996) |
= CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls dls.show_batch()
user | title | rating | |
---|---|---|---|
0 | 815 | Groundhog Day (1993) | 4 |
1 | 357 | Phantom, The (1996) | 3 |
2 | 246 | Blown Away (1994) | 3 |
3 | 311 | Casablanca (1942) | 4 |
4 | 457 | Immortal Beloved (1994) | 4 |
5 | 241 | Titanic (1997) | 4 |
6 | 525 | Independence Day (ID4) (1996) | 4 |
7 | 394 | Cape Fear (1991) | 4 |
8 | 109 | Dante's Peak (1997) | 3 |
9 | 334 | Wolf (1994) | 2 |
= collab_learner(dls, n_factors=50, y_range=None)
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.257861 | 1.301678 | 00:13 |
1 | 1.071218 | 1.113060 | 00:12 |
2 | 0.989054 | 1.017373 | 00:11 |
3 | 0.856945 | 0.928325 | 00:12 |
4 | 0.848923 | 0.905493 | 00:12 |
I want to see the distribution of predictions for the training and validation set and understand how they vary. I’ll create a helper function for that.
def plot_preds(preds, title):
= pd.Series(preds)
preds ;
preds.hist()f'{title} preds distribution')
plt.title(print('median:', preds.median())
print('mean:', preds.mean())
print('kurtosis: ', preds.kurtosis())
print('skew: ', preds.skew())
= anderson(preds, dist='norm')
result print(f'Statistic: {result.statistic}')
print(f'Critical values: {result.critical_values}')
print(f'Significance levels: {result.significance_level}')
= (preds < 1) | (preds > 5)
cond print(f'% of preds outside of 1-5 range: {100*cond.sum()/cond.count():.2f}%')
= learn.get_preds(dl=dls.valid) preds, targ
# check loss---should be close to 0.905493
MSELossFlat()(preds, targ)
TensorBase(0.9055)
=dls.valid)[0], 'valid') plot_preds(learn.get_preds(dl
median: 3.4890234
mean: 3.4260304
kurtosis: 0.7783028
skew: -0.58709365
Statistic: 71.65442338831053
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.93%
The validation set predictions are slightly skewed left with a median rating of about 3.5. Based on the Anderson-Darling statistic (which is significantly larger than the most stringent critical value of 1.092), these 20k samples don’t come from a normal distribution. Less than 1% of the values fall outside of the expected rating range of 1 to 5.
=dls.train)[0], 'train') plot_preds(learn.get_preds(dl
median: 3.4968839
mean: 3.435657
kurtosis: 0.41849822
skew: -0.49159753
Statistic: 228.91494857503858
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.68%
The training set predictions are similarly distributed, with a slightly larger peak resulting in a slightly larger median rating, still around 3.5.
In general there are more values outside of the realistic range (1 to 5) of ratings in the validation predictions than the training predicitons. Although, the model is doing pretty well at predicting values within the desired range with less than 1% falling outside this range.
Training with y_range=(0, 5.5)
= collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn2 5, 5e-3, wd=0.1) learn2.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.882406 | 0.942118 | 00:13 |
1 | 0.650510 | 0.887792 | 00:12 |
2 | 0.542655 | 0.862130 | 00:12 |
3 | 0.440741 | 0.848899 | 00:12 |
4 | 0.442999 | 0.842771 | 00:12 |
Using a y_range
of 0
to 5.5
resulted in a ~7% lower loss.
=dls.valid)[0], 'valid') plot_preds(learn2.get_preds(dl
median: 3.5321503
mean: 3.4844122
kurtosis: -0.055667587
skew: -0.3875332
Statistic: 42.351315721156425
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.23%
This distribution is still not normal but has half the Anderson-Darling statistic as when y_range
was None
. The kurtosis is closer to 0 as well. The key point is that only about 1/4th of the values as before are outside of the 1-5 rating range.
=dls.train)[0], 'train') plot_preds(learn2.get_preds(dl
median: 3.5977917
mean: 3.4933543
kurtosis: 0.14653848
skew: -0.6128638
Statistic: 490.3643317096139
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.31%
The training predictions are more skewed than the validation predictions.
Training with y_range=(0.5, 5.5)
'rating'].min(), ratings['rating'].max() ratings[
(1, 5)
I can’t find it anymore, but there was a fastai forums post where someone was questioning why the lower range in y_range
wasn’t 0.5
(0.5
less than the minimum rating of 1
matching the upper range 5.5
is 0.5
more than the maximum rating of 5
). I’ll see if training with y_range=(0.5, 5.5)
improves the loss or changes the distribution of predictions.
= collab_learner(dls, n_factors=50, y_range=(0.5, 5.5))
learn3 5, 5e-3, wd=0.1) learn3.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.841459 | 0.931699 | 00:14 |
1 | 0.652540 | 0.878996 | 00:12 |
2 | 0.530454 | 0.865976 | 00:12 |
3 | 0.448474 | 0.856127 | 00:13 |
4 | 0.423248 | 0.852660 | 00:12 |
That actually worsened the loss, increasing it by about 1%. I’ll look at the training and validation prediction distributions:
=dls.valid)[0], 'valid') plot_preds(learn3.get_preds(dl
median: 3.5413134
mean: 3.5004866
kurtosis: -0.102446005
skew: -0.3400191
Statistic: 33.359148298073706
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.21%
The median and mean ratings are a bit higher and about the same amount of ratings are outside the acceptable range. The distribution is similarly not normal but has the lowest Anderson-Darling statistic so far.
=dls.train)[0], 'train') plot_preds(learn3.get_preds(dl
median: 3.6018043
mean: 3.5078757
kurtosis: 0.0025408994
skew: -0.5326863
Statistic: 388.08379248825077
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.23%
The median and mean for the training predictions are also a tiny bit larger but mostly the distribution is the same as y_range=(0, 5.5)
(although the kurtosis is much smaller).
Training with y_range=(0.75, 5.25)
I’m curious if a “tighter” range changes the results.
= collab_learner(dls, n_factors=50, y_range=(0.75, 5.25))
learn4 5, 5e-3, wd=0.1) learn4.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.891943 | 0.931708 | 00:12 |
1 | 0.676929 | 0.879818 | 00:13 |
2 | 0.531733 | 0.866186 | 00:12 |
3 | 0.459268 | 0.852890 | 00:13 |
4 | 0.454604 | 0.848512 | 00:12 |
This results in the second-best loss value thus far.
=dls.valid)[0], 'valid') plot_preds(learn4.get_preds(dl
median: 3.5425978
mean: 3.5018144
kurtosis: -0.113488525
skew: -0.36151984
Statistic: 36.919869251567434
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.07%
=dls.train)[0], 'train') plot_preds(learn4.get_preds(dl
median: 3.5998974
mean: 3.503809
kurtosis: 0.0028086598
skew: -0.5607914
Statistic: 444.8073482159525
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.05%
The training and validation predictions have the lowest amount of predictions falling outside the acceptable range—this makes sense because sigmoid is not going to get as close to 1 and 5 as a y_range
of (0.5, 5.5)
or (0, 5.5)
.
Training with y_range=(1, 5)
Just to cover my bases, I’ll train with a y_range
not recommended: from 1 to 5. With this range, sigmoid will never output ratings of exactly 1 or 5.
= collab_learner(dls, n_factors=50, y_range=(1, 5))
learn5 5, 5e-3, wd=0.1) learn5.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.890540 | 0.942143 | 00:13 |
1 | 0.675952 | 0.874900 | 00:12 |
2 | 0.560956 | 0.855053 | 00:14 |
3 | 0.500103 | 0.847492 | 00:17 |
4 | 0.492499 | 0.844006 | 00:14 |
Surprisingly, this has supplanted y_range=(0.75, 5.25)
with the second-best loss after 5 epochs. I wonder if that is because the overall range is lower?
=dls.valid)[0], 'valid') plot_preds(learn5.get_preds(dl
median: 3.5273356
mean: 3.489109
kurtosis: -0.14329968
skew: -0.37828833
Statistic: 42.07929809941925
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.00%
As expected, 0.00% of the ratings fall outside of the minimum of 1 and maximum of 5.
=dls.train)[0], 'train') plot_preds(learn5.get_preds(dl
median: 3.5868726
mean: 3.4960902
kurtosis: -0.0628498
skew: -0.55758834
Statistic: 467.5922112545086
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 0.00%
Training with y_range=(-2, 8)
As a last fun experiment, I’ll use a much-wider-than-needed y_range
and see how that affects the loss as well as the prediction distributions.
= collab_learner(dls, n_factors=50, y_range=(-2, 8))
learn6 5, 5e-3, wd=0.1) learn6.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.806267 | 0.923924 | 00:13 |
1 | 0.556011 | 0.928603 | 00:14 |
2 | 0.437159 | 0.907485 | 00:13 |
3 | 0.346756 | 0.900347 | 00:12 |
4 | 0.331412 | 0.895803 | 00:13 |
Interestingly, the training loss is significantly lower than any of the other training runs. The validation loss is about 5% higher than the lowest validation loss achieved prior. I’m curious to see how the distributions compare.
=dls.valid)[0], 'valid') plot_preds(learn6.get_preds(dl
median: 3.5484176
mean: 3.5100946
kurtosis: 0.11679816
skew: -0.32186633
Statistic: 21.7676292314718
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 1.19%
About 6 times as many predictions fall outside of the acceptable range (~1.2% to ~0.2%) which makes sense since the y_range
is wider. The overall distributions is similar to the other validation predictions although this distribution (still very not normal) has the lowest Anderson-Darling statistic.
=dls.train)[0], 'train') plot_preds(learn6.get_preds(dl
median: 3.632931
mean: 3.5240762
kurtosis: 0.015062247
skew: -0.50895566
Statistic: 350.41688774364593
Critical values: [0.576 0.656 0.787 0.918 1.092]
Significance levels: [15. 10. 5. 2.5 1. ]
% of preds outside of 1-5 range: 1.85%
The training loss distribution looks funkier than before (more than 10x the Anderson-Darling statistic), and it has a slightly larger median, and almost 9 times the values falling outside of the acceptable range.
Final Thoughts
I ended up enjoying this experiment more than I expected to. It was helpful to see intuitive results being validated through observing the actual prediction distributions (for example, y_range=(1,5)
had 0 prediction outside of that range while y_range=(-2,8)
had the most.
There were some surprises along the way: a y_range
of (-2,8)
had the lowest training loss—not sure what to make of that—a y_range
of (1,5)
resulted in the second-best loss (perhaps because there is a smaller range to predict within?) and although the none of distributions were normal, there were varying degrees of non-normality.
As part of the fastai Part 1 Lesson 7 homework, I’ll be training models on the full MovieLens dataset (~25M rows) so it’ll be fun to experiment with y_range
values and see if I get different results.
I hope you enjoyed this blog post!