from fastai.collab import *
from fastai.tabular.all import *
= untar_data(URLs.ML_100k) path
Training a Collaborative Filtering Model Using Cross Entropy Loss
Background
In this notebook, I’ll work through the following prompt given in the “Further Research” of Chapter 8 (Collaborative Filtering):
Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter.
Visual Inspection
I’ll start by visually inspecting the DotProductBias
model from the chapter that outputs one prediction and a DotProductBiasCE
model that I’ve written to output 5 predictions (so that it works with Cross Entropy loss).
Creating DataLoaders
Since I want to use Cross Entropy loss, I’ll need to specify that the outputs (or targets) are discrete categories and not continuous numbers. To do this, I’ll use TabularDataLoaders
.
= pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
ratings = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
movies = ratings.merge(movies)
ratings ratings.head()
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
2 | 226 | 242 | 5 | 883888671 | Kolya (1996) |
3 | 154 | 242 | 3 | 879138235 | Kolya (1996) |
4 | 306 | 242 | 5 | 876503793 | Kolya (1996) |
= TabularDataLoaders.from_df(
dls 'user', 'title', 'rating']],
ratings[[=[Categorify],
procs=['user','title'],
cat_names=['rating'],
y_names=CategoryBlock) y_block
TabularDataLoaders
will provide three elements for each training item: the cat
egorical inputs, the cont
inuous inputs and the outputs (or targets, or dependent variable):
= dls.one_batch()
b len(b), b[0].shape, b[1].shape, b[2].shape
(3, torch.Size([64, 2]), torch.Size([64, 0]), torch.Size([64, 1]))
This is important to note before I create the model since the forward pass will receive two values: x_cat
(categorical inputs) and x_cont
(continuous inputs).
The TabularDataLoaders
also has a vocabulary—the five possible values for the dependent variable (1 through 5).
dls.vocab
[1, 2, 3, 4, 5]
Creating the New Model
I’ll start by creating the original DotProductBias
model for reference:
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
res return sigmoid_range(res, *self.y_range)
The biggest change in the behavior of the model, to allow the usage of Cross Entropy loss, is to make it output 5 activations instead of 1. I’ll do so by passing the dot product through an nn.Linear
layer that projects that value into 5 dimensions (one for each rating 1-5) in the forward
pass.
The second change in the model behavior is to allow two inputs in the forward
pass: x_cat
for categorical variables and x_cont
for continuous variables, which is how TabularDataLoaders
prepares the data. In the case of the MovieLens 100k subset, there are no continuous variables. The only variables of interest are the categoricals users
and movies
, one column for each in the input x_cat
.
class DotProductBiasCE(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
self.linear = nn.Linear(1, 5)
def forward(self, x_cat, x_cont):
= x_cat
x = self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
res = sigmoid_range(res, *self.y_range)
res return self.linear(res)
= len(dls.classes['user'])
n_users = len(dls.classes['title'])
n_movies
n_users, n_movies
(944, 1665)
= DotProductBiasCE(n_users, n_movies, 50) model
=b[0], x_cont=b[1]).shape model(x_cat
torch.Size([64, 5])
Training the Model
I’ll use the same hyperparameters (5 epochs, LR=5e-3 and weight decay of 0.1) as the best training run in the text. Of course, this is a different model so these values may not be optimal.
= DotProductBiasCE(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn
5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.426626 | 1.445451 | 0.337150 | 00:15 |
1 | 1.281385 | 1.425528 | 0.339900 | 00:15 |
2 | 1.171326 | 1.431534 | 0.364500 | 00:16 |
3 | 1.105676 | 1.438475 | 0.361650 | 00:16 |
4 | 1.127306 | 1.438119 | 0.363450 | 00:14 |
The model’s not great (although it’s better than guessing ratings randomly which would have an accuracy of 20%) and it’s difficult to compare it with the DotProductBias
model since that model only measured RMSE and not accuracy.
I’ll take a look at the predictions and see how they compare to the actual ratings.
Looking at the confusion matrix below, here are some observations:
- The model did not predict any
5
s. - The best predicted rating was a
4
(with 4327/6692, or 65% correct predictions). - Most of the model’s predictions are
3
s or4
s.
= ClassificationInterpretation.from_learner(learn)
interp interp.plot_confusion_matrix()
Improving the Model
Looking at these results, I’m starting to think that using sigmoid_range
in this model is causing it to predict values in the middle of the range (2-4) and making it harder for it to predict ratings that are at the edges (1 and 5). I’ll remove y_range
and sigmoid_range
from the model and train it again to see if it makes a difference.
class DotProductBiasCE(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.linear = nn.Linear(1, 5)
def forward(self, x_cat, x_cont):
= x_cat
x = self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
res return self.linear(res)
= DotProductBiasCE(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn
5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.392305 | 1.434367 | 0.353000 | 00:14 |
1 | 1.238957 | 1.465206 | 0.349450 | 00:15 |
2 | 1.122690 | 1.507049 | 0.354700 | 00:15 |
3 | 1.053334 | 1.523502 | 0.361450 | 00:14 |
4 | 1.038323 | 1.527965 | 0.364200 | 00:15 |
The resulting accuracy is about the same as before. Let’s look at the confusion matrix:
= ClassificationInterpretation.from_learner(learn)
interp interp.plot_confusion_matrix()
The model is now predicting 5
s. Although now it’s not predicting any 1
s or 2
s! I’ll see if there’s a better learning rate for this architecture:
= DotProductBiasCE(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn
learn.lr_find()
SuggestedLRs(valley=0.005248074419796467)
I’ll try a learning rate of 0.1, which is two orders of magnitude larger than 5e-3.
= DotProductBiasCE(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn
5, 0.1, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.402151 | 1.460189 | 0.337900 | 00:14 |
1 | 1.410952 | 1.438032 | 0.348000 | 00:13 |
2 | 1.353970 | 1.407190 | 0.370100 | 00:13 |
3 | 1.232812 | 1.351564 | 0.401050 | 00:13 |
4 | 1.070482 | 1.324203 | 0.415400 | 00:14 |
The higher learning rate improved the accuracy by about 5%. Looking at the confusion matrix, here are some observations—the model is predicting 1
s and 2
s better than before and the rating of 4
is still the best predicted rating (62% of all actual 4
s are predicted as 4
s). However, the model is still predominantly predicting 3
s and 4
s.
= ClassificationInterpretation.from_learner(learn)
interp interp.plot_confusion_matrix()
I’ll also check that fastai is automatically applying softmax so that the final activations add up to 1.00:
= learn.get_preds(dl=dls.valid)
probs, _ sum(dim=1) probs.
tensor([1.0000, 1.0000, 1.0000, ..., 1.0000, 1.0000, 1.0000])
sum(dim=1).sum() # should equal 20k probs.
tensor(20000.)
Final Thoughts
I’ll recap this exercise by displaying the visual comparison between DotProductBias
and the final DotProductBiasCE
(without y_range
and sigmoid_range
).
My main takeaway from this exercise is that what may work for one architecture may not necessarily work for another. In this example when passing the dot product through the sigmoid function before passing it through a linear layer, the model did not predict any 5
s even though there were many actual 5
ratings in the dataset.
Another takeaway is that I wasn’t able to compare two models that used different metrics (RMSE vs. accuracy). So I’m limited in my ability to say which model performed “better”. I asked Claude for ideas on how to compare these two models and it came up with the following:
- Convert RMSE to accuracy
- Round continuous predictions to nearest integer
- Calculate accuracy using rounded predictions
- Compare this accuracy to the categorical model
- Convert accuracy to RMSE-like metric:
- Calculate average error for categorical predictions
- Compare this to the continuous model’s RMSE
- Use normalized metrics:
- Normalize RMSE: RMSE / (max_rating - min_rating)
- Normalize accuracy: (accuracy - random_guess_accuracy) / (1 - random_guess_accuracy)
- Compare normalized values
I’ll poke around online (I’ve also asked about this on Twitter and the fastai forums) to see if there are thoughts on or examples of such comparisons, and then follow up with some exploration in a future blog post.
I hope you enjoyed this exercise! Follow me on Twitter @vishal_learner.