Training Models on the MovieLens 25M Dataset

deep learning

machine learning

fastai

python

In this notebook I train models using 5 different architectures on the 25 million rating MovieLens dataset and compare performance and results.

Author

Vishal Bakshi

Published

July 13, 2024

Background

In this notebook I’ll work through the following prompt from the “Further Research” section of Chapter 8 (Collaborative Filtering) from the fastai textbook:

Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book’s website and the fast.ai forums for ideas. Note that there are more columns in the full dataset–see if you can use those too (the next chapter might give you ideas).

Here’s a summary of my results in this notebook:

Arch	Metric	Metric Value
`DotProductBias`	MSE	0.654875
`DotProductBiasCE`	Accuracy	35%
Random Forest (baseline)	Accuracy	29%
Random Forest (additional columns)	Accuracy	30%
Neural Net	Accuracy	38%

Load the Data

The data is formatted slightly differently than the 100k subset (main difference is the columns are labeled differently).

from fastai.collab import *
from fastai.tabular.all import *

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

path = Path('/content/drive/MyDrive/movielens25m')
path.ls()

(#12) [Path('/content/drive/MyDrive/movielens25m/tags.csv'),Path('/content/drive/MyDrive/movielens25m/ratings.csv'),Path('/content/drive/MyDrive/movielens25m/movies.csv'),Path('/content/drive/MyDrive/movielens25m/genome-tags.csv'),Path('/content/drive/MyDrive/movielens25m/genome-scores.csv'),Path('/content/drive/MyDrive/movielens25m/links.csv'),Path('/content/drive/MyDrive/movielens25m/README.txt'),Path('/content/drive/MyDrive/movielens25m/rf_baseline_vars.pkl'),Path('/content/drive/MyDrive/movielens25m/rf_additional_vars.pkl'),Path('/content/drive/MyDrive/movielens25m/to_nn.pkl')...]

ratings = pd.read_csv(path/'ratings.csv')
ratings.head()

	userId	movieId	rating	timestamp
0	1	296	5.0	1147880044
1	1	306	3.5	1147868817
2	1	307	5.0	1147868828
3	1	665	5.0	1147878820
4	1	899	3.5	1147868510

ratings['movieId'].unique().shape, ratings['userId'].unique().shape

((59047,), (162541,))

movies = pd.read_csv(path/'movies.csv')
movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

movies['movieId'].unique().shape

(62423,)

ratings = ratings.merge(movies[['movieId', 'title']])
ratings.head()

	userId	movieId	rating	timestamp	title
0	1	296	5.0	1147880044	Pulp Fiction (1994)
1	3	296	5.0	1439474476	Pulp Fiction (1994)
2	4	296	4.0	1573938898	Pulp Fiction (1994)
3	5	296	4.0	830786155	Pulp Fiction (1994)
4	7	296	4.0	835444730	Pulp Fiction (1994)

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=1024)
dls.show_batch()

	userId	title	rating
0	18382	Goldfinger (1964)	3.0
1	47473	Eyes Wide Shut (1999)	0.5
2	132661	Garden State (2004)	3.0
3	68944	X-Men Origins: Wolverine (2009)	0.5
4	126422	Animal Kingdom (2010)	3.5
5	122810	Hotel Rwanda (2004)	3.5
6	8458	Sherlock Holmes (2009)	4.0
7	21172	Indiana Jones and the Temple of Doom (1984)	4.0
8	94712	Dark Knight, The (2008)	3.5
9	88335	Chicken Run (2000)	2.0

dls.classes.keys()

dict_keys(['userId', 'title'])

n_users = len(dls.classes['userId'])
n_movies = len(dls.classes['title'])

n_users, n_movies

(162542, 58959)

Training Using Different Approaches

`DotProductBias` with `Embedding`s

The first architecture I’ll use is the DotProductBias with Embeddings:

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])

        return sigmoid_range(res, *self.y_range)

I’ll use the same number of epochs, learning rate and weight decay as the textbook training example:

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.718568	0.750686	1:31:29
1	0.721771	0.743393	1:50:58
2	0.675583	0.713021	1:51:44
3	0.627697	0.671975	1:48:30
4	0.608647	0.654875	1:41:17

When using the 100k subset the lowest validation MSE I got was about 0.836. A validation MSE of 0.654875 is about a 22% reduction.

After rounding the predictions to the nearest 0.5, the model has a validation accuracy of about 30%. Yikes! That’s terrible.

preds, targs = learn.get_preds(dl=dls.valid)

preds

tensor([[3.1825],
        [3.1959],
        [3.6061],
        ...,
        [1.6408],
        [3.3054],
        [3.4723]])

rounded_preds = (preds / 0.5).round() * 0.5
rounded_preds

tensor([[3.0000],
        [3.0000],
        [3.5000],
        ...,
        [1.5000],
        [3.5000],
        [3.5000]])

targs

tensor([[3.0000],
        [5.0000],
        [3.5000],
        ...,
        [1.0000],
        [2.0000],
        [3.5000]])

(rounded_preds == targs).float().mean()

tensor(0.2931)

If I round to the nearest integer, the validation accuracy increases to about 36%. Still not great.

(preds.round(decimals=0) == targs).float().mean()

tensor(0.3581)

Plotting predictions versus the targets shows the weak relationship between the two:

def plot_preds_v_targs(preds, targs):
  plt.figure(figsize=(10, 6))
  plt.scatter(targs.detach().numpy().squeeze(), preds.detach().numpy().squeeze(), alpha=0.5)
  plt.xlabel('Targets')
  plt.ylabel('Predictions')
  plt.title('Predictions vs Targets')
  plt.show()

plot_preds_v_targs(preds, targs)

Here’s the distribution of the ratings targets for the ~5M validation records:

plt.hist(targs.detach().numpy().squeeze());

There are considerably fewer predictions less than 3 and greater than 4:

plt.hist(preds.detach().numpy().squeeze());

Let’s hope for better luck with other architectures!

`DotProductBiasCE` (for Cross Entropy Loss)

I’ll use the same architecture that I created for another Further Research prompt, with the slight modification that instead of projecting the dot product to 5 ratings I’ll project them to 10 ratings (as there are ten 0.5-increment ratings in the dataset: 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0).

class DotProductBiasCE(Module):
  def __init__(self, n_users, n_movies, n_factors):
    self.user_factors = Embedding(n_users, n_factors)
    self.user_bias = Embedding(n_users, 1)
    self.movie_factors = Embedding(n_movies, n_factors)
    self.movie_bias = Embedding(n_movies, 1)
    self.linear = nn.Linear(1, 10)

  def forward(self, x_cat, x_cont):
    x = x_cat
    users = self.user_factors(x[:,0])
    movies = self.movie_factors(x[:,1])
    res = (users * movies).sum(dim=1, keepdim=True)
    res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
    return self.linear(res)

I’ll use the same training setup as I did with the 100k subset, but with a larger batch size (otherwise it takes much longer to train). Note that using the same learning rate for a batch size of 1024 as a batch size of 64 will likely not result in optimal training.

dls = TabularDataLoaders.from_df(
    ratings[['userId', 'title', 'rating']],
    procs=[Categorify],
    cat_names=['userId','title'],
    y_names=['rating'],
    y_block=CategoryBlock,
    bs=1024)

b = dls.one_batch()
len(b), b[0].shape, b[1].shape, b[2].shape

(3, torch.Size([1024, 2]), torch.Size([1024, 0]), torch.Size([1024, 1]))

dls.vocab

[0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

dls.show_batch()

	userId	title	rating
0	64415	Jumpin' Jack Flash (1986)	2.0
1	10508	Lord of the Rings: The Return of the King, The (2003)	4.5
2	126649	Frances Ha (2012)	4.0
3	119566	Elizabeth (1998)	3.0
4	77160	Snake Eyes (1998)	5.0
5	99259	Untouchables, The (1987)	3.5
6	3726	Myth of Fingerprints, The (1997)	2.0
7	100959	Meet the Parents (2000)	3.5
8	134993	Nightmare on Elm Street, A (1984)	1.0
9	117798	Doubt (2008)	4.0

n_users = len(dls.classes['userId'])
n_movies = len(dls.classes['title'])

n_users, n_movies

(162542, 58959)

Training with Cross Entropy Loss on the 25M dataset resulted in a model with about 35% validation accuracy, about 6% less than the 41% achieved on the 100k subset. The model is not showing signs of overfitting so I could have trained it for more epochs and potentially gained more accuracy.

model = DotProductBiasCE(n_users, n_movies, n_factors=50)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 0.1, wd=0.1)

epoch	train_loss	valid_loss	accuracy	time
0	1.919933	1.924016	0.288326	1:04:57
1	1.914961	1.927970	0.284413	1:33:05
2	1.900328	1.901067	0.294077	1:19:56
3	1.837524	1.847121	0.313432	1:40:26
4	1.704779	1.740781	0.354360	1:18:08

3’s and 4’s were the most correctly predicted ratings by this model, with the model performing quite badly for other ratings—in particular, the model did not predict any 0.5, 1.5, 2.5, 3.5, or 4.5 ratings.

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(6, 6))

preds, targs = learn.get_preds(dl=learn.dls.valid)

By far the most common predicted rating was 4.0 (the 7th category in the vocab). Again, I’ll note the gaps between the bars where the 0.5-increment ratings are absent from the model’s predictions.

plt.hist(preds.argmax(dim=1));

While the most common target is also 4.0, note that it’s frequency is about half that in the prediction distribution.

plt.hist(targs.squeeze());

Random Forest (baseline)

As an additional exercise I’ll train a random forest on the userId, movieId and rating fields. In the following section, I’ll add some of the additional fields available and see if that improves the forest’s performance. I’ll follow the approach given in Chapter 9 of the fastai textbook.

Setup

I’ll start by creating a TabularPandas object with a random split:

splits = RandomSplitter(seed=42)(range_of(ratings))

len(splits), len(splits[0]), len(splits[1])

(2, 20000076, 5000019)

to = TabularPandas(
    ratings[['userId', 'title', 'rating']],
    procs=[Categorify, FillMissing],
    cat_names=['userId', 'title'],
    cont_names=None,
    y_names='rating',
    y_block=CategoryBlock,
    splits=splits)

len(to.train), len(to.valid)

(20000076, 5000019)

to.show(3)

	userId	title	rating
8613915	11056	Divergent (2014)	2.0
20221395	128803	Town, The (2010)	4.0
21140474	56442	Jack Ryan: Shadow Recruit (2014)	3.0

to.items.head(3) # coded values

	userId	title	rating
8613915	11056	13670	3
20221395	128803	53868	7
21140474	56442	24434	5

to.vocab # 10 possible ratings

[0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

# defining variables
xs,y = to.train.xs, to.train.y
valid_xs,valid_y = to.valid.xs, to.valid.y

xs.shape, y.shape, valid_xs.shape, valid_y.shape

((20000076, 2), (20000076,), (5000019, 2), (5000019,))

#save_pickle(path/'rf_baseline_vars.pkl', (xs, y, valid_xs, valid_y))

I’ll create helper functions to calculate accuracy of the model:

def acc(pred,y): return (pred == y).mean()
def m_acc(m, xs, y): return acc(m.predict(xs), y)

from sklearn.ensemble import RandomForestClassifier

def rf(xs, y, n_estimators=4, max_samples=10_000, max_features=0.5,
       min_samples_leaf=5, **kwargs):
  return RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators, max_samples=max_samples,
              max_features=max_features, min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

Training Results

Since training on the full data will likely take awhile, I’ll first fit a random forest with 4 trees and a max of ten thousand samples for each tree, which takes about a minute to train and results in a 22% validation accuracy which is not bad!

m = rf(xs, y, n_estimators=4, max_samples=10_000)

m_acc(m, xs, y), m_acc(m, valid_xs, valid_y)

(0.22450199689241182, 0.22447194700660136)

Doubling the number of trees (to 8) increases the validation accuracy to 24% (+2%).

m = rf(xs, y, n_estimators=8, max_samples=10_000)
m_acc(m, xs, y), m_acc(m, valid_xs, valid_y)

(0.24051508604267305, 0.24045988625243225)

Tripling the number of trees (to 12) gives a smaller boost (1%) to the validation accuracy (25%).

m = rf(xs, y, n_estimators=12, max_samples=10_000)
m_acc(m, xs, y), m_acc(m, valid_xs, valid_y)

(0.25057739780588834, 0.25026804898141386)

As the number of samples increases by 10_000 (while keeping n_estimators=4) the validation accuracy increases by about 0.2-0.5% each time.

for samples in [20_000, 30_000, 40_000]:
  m = rf(xs, y, n_estimators=4, max_samples=samples)
  print(f'samples: {samples}; train acc: {m_acc(m, xs, y)}; valid acc: {m_acc(m, valid_xs, valid_y)}')

samples: 20000; train acc: 0.23020222523154413; valid acc: 0.22950612787671407
samples: 30000; train acc: 0.23159146995241417; valid acc: 0.2313785207616211
samples: 40000; train acc: 0.23347261280407133; valid acc: 0.23300171459348454

Next, I’ll train a random forest using the parameters given in Chapter 9 of the text (40 trees, 200_000 samples), keeping in mind that was for a 400k row dataset, so not optimized for 25M rows of data. I’ll then double n_estimators and max_samples to see which combination works best. I’m not using a for-loop like above since my Colab instance kept crashing so I’ll fit the different random forests in individual cells.

40 trees and 200_000 samples results in a validation accuracy of 28% (+3% from the previous best achieved by 12 trees and 10_000 samples).

trees = 40
samples = 200_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 200000; trees: 40; train acc: 0.29; valid acc: 0.28

Doubling the number of trees from 40 to 80 results in a 29% validation accuracy (+1%). It took about 35 minutes to train and predict.

trees = 80
samples = 200_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 200000; trees: 80; train acc: 0.29; valid acc: 0.29

Doubling the number of samples used to 400_000 while keeping the number of trees at 40 achieves the same validation accuracy (29%) and took about 21 minutes for training and inference.

trees = 40
samples = 400_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 400000; trees: 40; train acc: 0.30; valid acc: 0.29

Doubling the number of trees to 80 with 400_000 samples achieves the same accuracy of 29% (while taking 42 minutes for training and inference).

trees = 80
samples = 400_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 400000; trees: 80; train acc: 0.30; valid acc: 0.29

I’ll increase the number of samples significantly to 2_000_000. I’ll keep the number of trees at 40 for now since increasing that number doesn’t seem to improve validation accuracy significantly.

This results in a validation accuracy of 29%.

It doesn’t seem like increasing the number of trees or samples will significantly change the validation accuracy.

trees = 40
samples = 2_000_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 2000000; trees: 40; train acc: 0.33; valid acc: 0.29

def rf_feat_importance(m, df):
  return pd.DataFrame({'cols': df.columns, 'imp': m.feature_importances_}
                      ).sort_values('imp', ascending=False)

With this data and model, userId is almost twice as important as the movie title.

rf_feat_importance(m, xs)

	cols	imp
0	userId	0.640407
1	title	0.359593

Random Forest (with additional data)

There a few additional columns available that may improve the performance of the random forest:

ratings.csv has a timestamp column.
- This is easy to incorporate as there is one value per rating.
movies.csv has a genres column.
- There are multiple pipe-separated genres per movie, I’ll pick one genre per movie.
tags.csv has tags associated with each movie by users.
- There are multiple tags for each movie/user pair, I’ll pick one tag per user/movie pair.
genome-scores.csv has genome-tags associated with each movie.
- There are multiple genome-tags per movie, so I’ll pick the genome-tag with the highest score for each movie.

# do str.split on `movies` before merging with `ratings` since it has fewer rows
movies['genres'] = movies['genres'].str.split('|', n=1).str[0]

ratings = ratings.merge(movies)

ratings.head() # peep the new `genres` column with a single genre

	userId	movieId	rating	timestamp	title	genres
0	1	296	5.0	1147880044	Pulp Fiction (1994)	Comedy
1	3	296	5.0	1439474476	Pulp Fiction (1994)	Comedy
2	4	296	4.0	1573938898	Pulp Fiction (1994)	Comedy
3	5	296	4.0	830786155	Pulp Fiction (1994)	Comedy
4	7	296	4.0	835444730	Pulp Fiction (1994)	Comedy

ratings.shape

(25000095, 6)

Only a fraction (~10%) of all of the ratings.csv userIds are captured in tags.csv, whereas about 75% of the movieIds are captured in tags.csv. I’ll pick the most frequent tag for each movie and merge with the ratings data.

tags = pd.read_csv(path/'tags.csv')

ratings['userId'].unique().shape, tags['userId'].unique().shape

((162541,), (14592,))

ratings['movieId'].unique().shape, tags['movieId'].unique().shape

((59047,), (45251,))

There’s about 8k tags that are different only because of capitalization—I’ll set all tags to lower case:

tags['tag'].unique().shape, tags['tag'].str.lower().unique().shape

((73051,), (65465,))

tags['tag'] = tags['tag'].str.lower()

# thanks Claude 3.5 Sonnet
# this was MUCH faster than using groupby + agg
most_common_tags = (
    tags.groupby(['movieId', 'tag'])
    .size()
    .reset_index(name='count')
    .sort_values(['movieId', 'count'], ascending=[True, False])
    .drop_duplicates('movieId')
    .drop('count', axis=1)
)

most_common_tags.head()

	movieId	tag
78	1	pixar
155	2	robin williams
168	3	fishing
185	4	chick flick
207	5	steve martin

ratings.shape

(25000095, 6)

ratings = ratings.merge(most_common_tags, on=['movieId'], how='left')
print(ratings.shape)
ratings.head()

(25000095, 7)

	userId	movieId	rating	timestamp	title	genres	tag
0	1	296	5.0	1147880044	Pulp Fiction (1994)	Comedy	quentin tarantino
1	3	296	5.0	1439474476	Pulp Fiction (1994)	Comedy	quentin tarantino
2	4	296	4.0	1573938898	Pulp Fiction (1994)	Comedy	quentin tarantino
3	5	296	4.0	830786155	Pulp Fiction (1994)	Comedy	quentin tarantino
4	7	296	4.0	835444730	Pulp Fiction (1994)	Comedy	quentin tarantino

Only a very small fraction of ratings are without a tag:

ratings['tag'].isna().sum()

genome_scores = pd.read_csv(path/'genome-scores.csv')

genome_tags = pd.read_csv(path/'genome-tags.csv')

genome_scores.head()

	movieId	tagId	relevance
0	1	1	0.02875
1	1	2	0.02375
2	1	3	0.06250
3	1	4	0.07575
4	1	5	0.14075

genome_tags.head()

	tagId	tag
0	1	007
1	2	007 (series)
2	3	18th century
3	4	1920s
4	5	1930s

I’ll use the most relevant genome tag for each movie:

most_common_genomes = (
    genome_scores.sort_values(['movieId', 'relevance'], ascending=[True, False])
    .drop_duplicates('movieId')
)

most_common_genomes.head()

	movieId	tagId	relevance
1035	1	1036	0.99925
1156	2	29	0.97600
3156	3	901	0.97525
4499	4	1116	0.97525
5412	5	901	0.96025

genome_scores['movieId'].unique().shape, most_common_genomes.shape

((13816,), (13816, 3))

With that sorted, I’ll get the actual tag text:

most_common_genomes = most_common_genomes.merge(genome_tags)
most_common_genomes.shape

(13816, 4)

most_common_genomes = most_common_genomes.rename(columns={'tag': 'genome_tag'})
most_common_genomes.head()

	movieId	tagId	relevance	genome_tag
0	1	1036	0.99925	toys
1	1920	1036	0.99575	toys
2	3114	1036	0.99850	toys
3	78499	1036	0.99875	toys
4	81981	1036	0.84225	toys

most_common_genomes['movieId'].unique().shape

(13816,)

and add it to the main DataFrame:

ratings.shape

(25000095, 7)

ratings = ratings.merge(most_common_genomes[['movieId', 'genome_tag']], how='left')
ratings.head()

	userId	movieId	rating	timestamp	title	genres	tag	genome_tag
0	1	296	5.0	1147880044	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
1	3	296	5.0	1439474476	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
2	4	296	4.0	1573938898	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
3	5	296	4.0	830786155	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
4	7	296	4.0	835444730	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men

With the DataFrame established, I’ll create a TabularPandas object using the same seed (42) as before for splits.

splits = RandomSplitter(seed=42)(range_of(ratings))
len(splits), len(splits[0]), len(splits[1])

(2, 20000076, 5000019)

to = TabularPandas(
    ratings[['userId', 'title', 'timestamp', 'genres', 'tag', 'genome_tag', 'rating']],
    procs=[Categorify, FillMissing],
    cat_names=['userId', 'title', 'genres', 'tag', 'genome_tag'],
    cont_names=['timestamp'],
    y_names='rating',
    y_block=CategoryBlock,
    splits=splits)

len(to.train), len(to.valid)

(20000076, 5000019)

to.show(3)

	userId	title	genres	tag	genome_tag	timestamp	rating
8613915	11056	Divergent (2014)	Adventure	dystopia	vampire human love	1422282990	2.0
20221395	128803	Town, The (2010)	Crime	ben affleck	crime	1312046717	4.0
21140474	56442	Jack Ryan: Shadow Recruit (2014)	Action	cia	tom clancy	1491438306	3.0

to.items.head(3) # coded values

	userId	title	timestamp	genres	tag	genome_tag	rating
8613915	11056	13670	1422282990	3	4536	791	3
20221395	128803	53868	1312046717	7	1983	183	7
21140474	56442	24434	1491438306	2	3241	769	5

xs,y = to.train.xs, to.train.y
valid_xs,valid_y = to.valid.xs, to.valid.y

xs.shape, y.shape, valid_xs.shape, valid_y.shape

((20000076, 6), (20000076,), (5000019, 6), (5000019,))

Using the additional columns has increased the validation accuracy from 22% (when using only movieId and userId) to 23.7% (when using the additional columns timestamp, tag, genome_tag and genres).

m = rf(xs, y, n_estimators=4, max_samples=10_000)
m_acc(m, xs, y), m_acc(m, valid_xs, valid_y)

(0.23697224950545187, 0.23664510074861717)

Doubling the number of trees to 8 yields a validation accuracy of 25% (a ~1% increase from before).

m = rf(xs, y, n_estimators=8, max_samples=10_000)
m_acc(m, xs, y), m_acc(m, valid_xs, valid_y)

(0.2500907496551513, 0.24987865046112825)

Tripling the number of trees to 12 yields a validation accuracy of 26% (a ~1% increase from before).

m = rf(xs, y, n_estimators=12, max_samples=10_000)
m_acc(m, xs, y), m_acc(m, valid_xs, valid_y)

(0.25923971488908343, 0.2590440156327406)

For every 10_000 sample increase (with 4 trees) the validation accuracy improves between 0.02-0.6%.

for samples in [20_000, 30_000, 40_000]:
  m = rf(xs, y, n_estimators=4, max_samples=samples)
  print(f'samples: {samples}; train acc: {m_acc(m, xs, y)}; valid acc: {m_acc(m, valid_xs, valid_y)}')

samples: 20000; train acc: 0.24337092519048428; valid acc: 0.24275807751930542
samples: 30000; train acc: 0.2467036625260824; valid acc: 0.24597866528107193
samples: 40000; train acc: 0.24714531084781877; valid acc: 0.24619486445951505

Increasing the number of trees to 40 and the number of samples to 200_000 results in a validation accuracy of 30% (the baseline random forest validation accuracy was 28%).

trees = 40
samples = 200_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 200000; trees: 40; train acc: 0.30; valid acc: 0.30

Doubling the number of trees results in a validation accuracy of 30%. (baseline was 29%).

trees = 80
samples = 200_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 200000; trees: 80; train acc: 0.31; valid acc: 0.30

Doubling the number of samples to 400_000 with 40 trees gets a validation accuracy of 30% (baseline was 29%).

trees = 40
samples = 400_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 400000; trees: 40; train acc: 0.31; valid acc: 0.30

Doubling the number of trees to 80 with 400_000 samples does not improve the validation accuracy of 30%.

trees = 80
samples = 400_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 400000; trees: 80; train acc: 0.32; valid acc: 0.30

Even after increasing the number of samples to 2M, the validation accuracy stays at 30%.

trees = 40
samples = 2_000_000
m = rf(xs, y, n_estimators=trees, max_samples=samples)
print(f'samples: {samples}; trees: {trees}; train acc: {m_acc(m, xs, y):.2f}; valid acc: {m_acc(m, valid_xs, valid_y):.2f}')

samples: 2000000; trees: 40; train acc: 0.36; valid acc: 0.30

It’s interesting (and a bit concerning) to note that timestamp is the most important feature for this model—it’s 6-7 times as important as the movie title.

rf_feat_importance(m, xs)

	cols	imp
5	timestamp	0.443988
0	userId	0.346306
1	title	0.073165
3	tag	0.060231
4	genome_tag	0.056996
2	genres	0.019314

I’ll follow fastai’s Chapter 9 approach to determining which columns’ values differ the most between the training and validation set.

df_dom = pd.concat([xs, valid_xs])
is_valid = np.array([0]*len(xs) + [1]*len(valid_xs))

m = rf(df_dom, is_valid, n_estimators=40, max_samples=400_000)

timestamp is the most important feature when distinguishing between the training and validation set. I’ll remove it to see if it improves training.

rf_feat_importance(m, df_dom)

	cols	imp
5	timestamp	0.313090
0	userId	0.311026
1	title	0.131196
3	tag	0.106029
4	genome_tag	0.099509
2	genres	0.039150

m = rf(xs.drop('timestamp', axis=1), y, n_estimators=40, max_samples=400_000)

After removing timestamp, userId surges to the top, being 6 times as important as the 2nd-most important feature (title). The validation accuracy stays at 30%.

rf_feat_importance(m, xs.drop('timestamp', axis=1))

	cols	imp
0	userId	0.674746
1	title	0.113734
3	tag	0.096440
4	genome_tag	0.088569
2	genres	0.026510

m_acc(m, valid_xs.drop('timestamp', axis=1), valid_y)

0.2971844706990113

Neural Net (`tabular_learner`)

Next, I’ll train a neural net on this data and see how it performs. I’ll also use the embeddings from the neural net later on to train a new random forest.

df_nn = ratings.drop(['timestamp', 'movieId'], axis=1)

df_nn.head()

	userId	rating	title	genres	tag	genome_tag
0	1	5.0	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
1	3	5.0	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
2	4	4.0	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
3	5	4.0	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men
4	7	4.0	Pulp Fiction (1994)	Comedy	quentin tarantino	hit men

procs_nn = [Categorify, FillMissing]
cat_nn = ['userId', 'genres', 'tag', 'genome_tag']
cont_nn = None
splits = RandomSplitter(seed=42)(range_of(df_nn))
len(splits), len(splits[0]), len(splits[1])

(2, 20000076, 5000019)

to_nn = TabularPandas(
    df_nn,
    procs_nn,
    cat_names=cat_nn,
    cont_names=None,
    splits=splits,
    y_names='rating',
    y_block=CategoryBlock)

dls = to_nn.dataloaders(1024)

dls.show_batch()

	userId	genres	tag	genome_tag	rating
0	149149	Documentary	boxing	documentary	4.0
1	5588	Action	conspiracy	conspiracy	5.0
2	67949	Adventure	animation	computer animation	3.0
3	61069	Action	satire	satire	0.5
4	43620	Comedy	1930s	1930s	3.0
5	161519	Action	natalie portman	hit men	4.0
6	155206	Crime	tom hanks	oscar (best directing)	4.0
7	91508	Drama	england	skinhead	4.0
8	76347	Drama	clint eastwood	western	5.0
9	30987	Crime	serial killer	oscar (best directing)	5.0

learn = tabular_learner(dls, layers=[1000, 500], loss_func=CrossEntropyLossFlat(), metrics=accuracy)

learn.lr_find()

SuggestedLRs(valley=0.0003311311302240938)

The GPU RAM stayed at around 3.5/15.0 GB during training so if I wanted to train again I’d use a larger batch size. The neural net achieved a 38% validation accuracy, better than the random forest (30%) and the DotProductBiasCE model (35%).

learn.fit_one_cycle(5, 2e-2)

epoch	train_loss	valid_loss	accuracy	time
0	1.576747	1.615499	0.359292	19:06
1	1.557086	1.605259	0.362625	18:48
2	1.514029	1.582966	0.369193	19:03
3	1.422227	1.559873	0.377868	18:50
4	1.328115	1.570342	0.382583	18:48

The neural net is much better at predicting a diverse range of ratings (whereas DotProductBiasCE did not predict any ratings ending with .5). Like the DotProductBiasCE model, the neural net’s most accurate prediction was 4. Unlike DotProductBiasCE the diagonal in the confusion matrix is somewhat darker than the non-diagonal, showing that the model is doing a better job of correctly predicting ratings (as exhibited by the higher overall accuracy).

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(6, 6))

preds, targs = learn.get_preds(dl=learn.dls.valid)

The distribution of predictions has no gaps—the model predicts all possible ratings.

plt.figure(figsize=(10, 6))
plt.hist(preds.argmax(dim=1).squeeze(), alpha=0.7, color='blue', label='Predictions');
plt.hist(targs.squeeze(), alpha=0.5, color='red', label='Targets');
plt.legend()
plt.xlabel('Rating Index')
plt.ylabel('Frequency')
plt.show()

Random Forest with Neural Net Embeddings

The final approach I was hoping to implement was using the embeddings from the neural net to create additional columns and using them to train a random forest, as done in this Medium post.

However, I ran into some RAM and disk space issues. To illustrate, I’ll show the four Embeddings that the model has learned, which totals an output of 955 values.

learn.model.embeds

ModuleList(
  (0): Embedding(162542, 600)
  (1): Embedding(21, 9)
  (2): Embedding(9856, 276)
  (3): Embedding(841, 70)
)

The smallest data type (without receiving an error) I was able to use when passing a column of xs through an Embedding was int32 which takes up 4 bytes per element. With 20M rows in the training set, outputting 955 values each 4 bytes large would take up about 76 GB of storage.

20e6*4*955/1e9

76.4

Another 5M rows of the validation set (each with 955 Embedding outputs) would bump that up to over 100GB. Kaggle provides 73GB of disk space and 30GB of RAM. Google Colab provides 107GB of disk space and 13GB of RAM.

Handling these sorts of RAM/disk space issues is something I want to learn about and experiment with in the future, after which I can return to this scale of a dataset in an attempt to train it.

Final Thoughts

Training models on the full MovieLens 25M rating dataset using Kaggle and/or Google Colab was tough. Training a single model took up to 9 hours (in the case of DotProductBias), so just getting through five different architectures required patience (on top of a lot of runtime crashes due to RAM maxing out). However, it was still fun to see how these different architectures behaved and performed during training. Here’s a summary of my results:

Arch	Metric	Metric Value
`DotProductBias`	MSE	0.654875
`DotProductBiasCE`	Accuracy	35%
Random Forest (baseline)	Accuracy	29%
Random Forest (additional columns)	Accuracy	30%
Neural Net	Accuracy	38%

In the fastai textbook, the MSE of DotProductBias on the 100k rating subset was about 0.8 and in my experiments with DotProductBiasCE (Cross-Entropy Loss) the Accuracy for DotProductBiasCE was about 40%. The 25M DotProductBias beat the 100k subset’s MSE and the 25M neural net was competitive with the 100k subset’s accuracy.

There’s still performance gains that I didn’t pursue (for example, training the neural net for more epochs or trying a different number of hidden layer sizes), but I’m satisfied with the breadth of my experiments and learning experience.

I hope you enjoyed this blog post! Following me on Twitter @vishal_learner.