Recap: My First Live Kaggle Competition
Background
After submitting predictions for a couple of closed Kaggle competitions as part of the fastai course, I participated in my first live Kaggle competition, Multi-Class Prediction of Obesity Risk. I ended up in the bottom 17% of the Private Leaderboard, ranking 2960 out of 3587. My Private ranking was 281 spots lower than my Public ranking (yikes!).
You can see my live competition notebook here and my post-competition notebook here.
In this blog post, I’ll recap my experience in this competition, and what I took away from it.
My main goal right now is learning, learning, learning. That being said, getting 10+ upvotes and 200+ views (probably 50 of them were mine) on my first bronze notebook felt AWESOME. I am already excited to try to get another bronze notebook. I also would like to get into the top 50% on the final Private Leaderboard for any live competition (including the playground series) this year.
Live Competition Approach
I decided to strictly follow the fastai textbook’s Chapter 9 approach to tabular data prediction using a Random Forest, Neural Net, and ensemble of both. I wanted to understand how each step affected the Public score.
In the end, I did 10 versions or iterations of my notebook that are summarized in the table below:
Version | Description | Private Score | Public Score |
---|---|---|---|
1 | quick and dirty RF | 0.89378 | 0.89585 |
2 | ordered ordinal columns | 0.89495 | 0.89559 |
3 | high importance cols only | 0.89342 | 0.8945 |
4* | – | – | – |
5 | rf with id col removed | 0.89116 | 0.88728 |
6 | neural net | 0.8675 | 0.86452 |
7^^ | rf nn ensemble | 0.8861 | 0.88656 |
8** | increase number of trees | – | – |
9^^ | embedding-rf nn ensemble | 0.88538 | 0.89053 |
*In version 4, I was planning to remove redundant features, but none of them were redundant so I didn’t re-train my Random Forest and didn’t submit predictions.
**In version 8, I was planning on increasing the number of trees in my Random Forest but that didn’t improve the validation set accuracy so I didn’t submit any predictions.
^^Selected for final leaderboard score (0.88610)
Why Did I Score So Low?
It’s interesting to note that the highest Private scores were my Random Forest with ordinal columns set (0.89495) and my quick and dirty Random Forest (0.89378).
I think the reason that the chapter’s strategies did not improve my score is because those strategies were not solely meant to improve accuracy—they were also meant to simplify the model and dataset for better interpretability. However, this dataset was small to begin with (17 independent variables) and removing low importance variables and the id
column brought that down to 12 columns. Contrast that with the textbook where we went from about 70 columns to 20.
Jeremy has also mentioned throughout the course that getting that final 1% or 2% of accuracy for a Kaggle score is generally when you have to start fussing with the details. Simplifying and understanding a model (how different columns and rows of the dataset are used/affected by it) is a different problem to solve than getting 2% more to win a Kaggle competition.
I was surprised that the neural nets performed so poorly, even in the public score. I was banking on the ensemble being more flexible than the Random Forest, and expected it to result in a higher final Private score.
I haven’t looked at anyone else’s notebooks yet, I plan on doing that next, but I did see quite a few XGBoost-related notebook titles, and my understanding is that model performs better than Random Forests. Something I’ll practice modeling in the next tabular competition I join.
Post-Competition Analysis
After the competition was over, I decided to dig deeper into Random Forests, exploring differences in validation accuracy due to changes in parameters like n_jobs
and n_estimators
.
I ended up modeling and analyzing 960 Random Forests. You can see my whole process detailed in this notebook.
I chose 15 models out of 960 to submit to Kaggle post-competition in order to see their Private score, answering the question—should I have focused on tuning Random Forests instead of tuning an ensemble with a neural net?
The following table lists out my results. Here is a definition of the parameters I experimented with:
n_jobs
= the number of processors used (None
= 1 and-1
= All)n_estimators
= the number of trees included in the Random Forestmax_samples
= the number of randomly selected rows included for a treemax_features
= the number of randomly selected columns included for a treemin_samples_leaf
= the minimum samples allowed on a tree nodeoob_score
= whether or not to use OOB score to evaluate a tree
n_jobs | n_estimators | max_samples | max_features | min_samples_leaf | oob_score | Private score |
---|---|---|---|---|---|---|
None |
100 | 10000 | 0.5 | 2 | False |
0.89875* |
None |
100 | 15000 | 0.5 | 2 | True |
0.89839 |
None |
60 | 10000 | 0.5 | 2 | True |
0.89803 |
None |
80 | 15000 | 0.5 | 2 | True |
0.89748 |
None |
100 | 15000 | 0.5 | 5 | False |
0.8973 |
None |
60 | 5000 | None |
5 | True |
0.89297 |
-1 | 40 | 10000 | 0.5 | 10 | False |
0.89207 |
-1 | 80 | 10000 | 0.5 | 10 | False |
0.89197 |
None |
20 | 15000 | None |
10 | True |
0.89143 |
None |
80 | 10000 | None |
10 | False |
0.89143 |
None |
60 | 1000 | None |
10 | False |
0.87165 |
None |
20 | 1000 | None |
10 | False |
0.87039 |
-1 | 40 | 1000 | None |
10 | False |
0.86994 |
-1 | 60 | 1000 | None |
10 | True |
0.86768 |
None |
20 | 1000 | None |
10 | True |
0.86109 |
*Top 65% in final leaderboard
The best post-competition result with a single Random Forest I was able to get was a Private score of 0.89875 which would have landed me in the top 65% of the final leaderboard. Not the top 50% result I’m looking for this year, but significantly better than the top 83% result I got.
It’s tough to tell which parameters contributed to better Private scores, I would need a larger sample to work with, but it’s interesting to note that a max_samples
value of 1000 did not crack the top 10 of the 15 models listed here. Similarly, an n_estimators
value of 20 or a min_samples_leaf
value of 10 did not get into the top 5. I had expected that setting max_samples
“too high” or setting min_samples_leaf
“too low” would overfit the Random Forest. But it seems like that is not the case. At least not for this competition with this test set.
I’ll also note that all of my submissions with a Private score of 0.89 or greater (live and post-competition) were single Random Forests, and all of them also had a Public score greater than 0.89.
I certainly don’t feel like I can make any solid claims with this analysis about Random Forests in general, but I can say that tuning Random Forest parameters is worth exploring in a Kaggle tabular competition.
Final Thoughts
Almost every time I code something, I keep at the forefront of my mind the saying “make it work, make it right, make it fast” by Kent Beck. After this competition, I feel I landed somewhere between making it work and making it right. What is “right” when it comes to a competition? Well, at this stage of my machine learning journey, I would like to rank in the top 50% in the final Leaderboard. I was in the top 87% during the live competition, and in the top 65% post-competition, so I’m moving in the right direction.
I also want to take all of my learnings with a grain of salt. This was one (relatively small) dataset that certainly had its own problems (as I detailed in my live competition notebook) with one (relatively small) test set. Just because my neural net didn’t perform very well doesn’t indict all neural nets. Similarly, just because my single Random Forests performed well, doesn’t mean they always will. Also, just because the textbook’s Chapter 9 approach to tabular prediction didn’t result in a top Kaggle competition Private score, doesn’t mean it’s not immensibly valuable to data science in production.
At the end of this year, after hopefully competing in at least a couple more live competitions, I will look back at this experience as a necessary but insufficient step towards having good intuition about machine learning.
As always, I hope you enjoyed this blog post!