Recap: HMS HBAC Kaggle Competition

fastai

kaggle competition

deep learning

A recap of what and how I did on the Harvard Medical Harmful Brain Activity Classification Kaggle Competition.

Author

Vishal Bakshi

Published

April 16, 2024

Background

In this notebook, I’ll recap my experience participating in the Harvard Medical School Harmful Brain Activity Classification Kaggle Research Competition. I finished in 2666th place (out of 2767 teams). I fell short of my 2024 goal to place in the top 50%, but I’m happy that I rose 11 spots in the final ranking (compared to the public score rankings) to fall just outside the bottom 100.

Here’s how I’ll approach my recap:

I’ll summarize overall process, linking to some example public notebooks I published.
I’ll then analyze my submission results, commenting on any patterns that I see.
I’ll re-envision my process: what would I do differently if I did this competition again?

Overall Process

Notebooks

For the initial small model experiments, I created the following sequence of notebooks for each family (convnext, swinv2, vit):

Part 1 [Train] notebook where I train 50+ models, document their hyperparameters and final error rates in this gist.
Part 1 [Analysis] notebook where I analyze the results of the previous and pick the top models for submission.
Part 2 [Train] notebook where I re-train those top models and export them for submission.
Part 2 [Submit] notebook with internet access disabled where I load my models, calculate predictions and export them to submission.csv.

Models

I generally followed this approach for each family of models (convnext, swinv2 and vit):

Train many (50+) small models.
Pick the top 5 best-performing models (with the lowest TTA Validation Error Rate)
Submit those 5 models individually, and as 3- or 5-model ensembles where each model is weighted twice.
Pick the top 3 performing models (with the lowest Kaggle Public Score).

After I had the top 3 models for each family:

Submit 9-model ensembles where each model is weighted twice.
Pick the top-5 best performing models.
Submit those 5 models as ensembles with each model weighted twice.
Pick the top-3 best performing models.
Submit those 3 models as ensembles with each model weighted twice.

In some cases I submitted top-2 model ensembles. In fact, my best performing submission was such a submission.

Data

Each group of models were trained on one of the following datasets that I created, heavily referencing others’ public competition notebooks:

Train spectrograms (where the 4 brain regions’ spectrograms were vertically stacked) of varying sizes
Train spectrograms (vertically stacked, clipped, log-norm values) of varying sizes
Train spectrograms (vertically stacked, clipped, log-norm values) with fixed size of 300 px wide x 400 px tall
Train spectrograms (vertically stacked) with fixed size of 300 px wide x 400 px tall
Train EEG spectrograms (vertically stacked) of fixed size 128 px wide x 512 px tall

Submission Results

I submitted 107 times (more than twice the number of submissions of anyone I got a better score than). My first and worst submission (quick-and-dirty ResNet18) resulted in a Public/Private score of 26.6/26.2. My penultimate and best submission (two swinv2 models trained on different datasets) got a Public/Private score of 1.15/1.17. Data on my full submission results are in this gist. Some key takeaways:

For 99 out of 107 submissions (including my best submission), the Private score was worse than the Public score, meaning overall, my approach didn’t generalize well to the full test dataset.
The best performing ensembles (outside of my best 2-model submission) were those with 4 models, yielding a median 1.39 Private score across 10 submissions.
The 400x300 log-norm stacked train spectrograms performed the best, with a median Private score of 1.2 across 10 submissions.
Using MixUp improved the median Private score by about 20% points.
The swinv2 models, even though they didn’t always have the best final validation error rate during training, ultimately performed the best.
In some cases, multiple models from the same family performed better in ensembles with other families.

I’m hesitant to take these patterns too seriously, as my overall Private scores were pretty terrible.

What Would I Do Differently?

There are quite a few things I would do differently:

Iterate faster. I tried my best to emulate Jeremy Howard’s approach in his Road to the Top series, but still spent a lot of time training early on since I was using the full dataset. For example, instead of taking 30-40 minutes to train each of the 50+ convnext_small_in22k models, if I used a subset of the data I could have trimmed that time down considerably.
Start earlier. I joined the competition 2 months late, with 1 month to go. I would have spent one of those months trying out more variations (and subsets) of the data and another month experimenting with different architectures and augmentations.
Don’t waste submissions. There are of course physical constraints (time, energy, resources) but I could have probably squeezed in 20 more submissions if I would have managed my time more efficiently. I lost a few days here and there waiting to see how my models performed in submissions instead of starting to train the next batch of experiments immediately while submitting the last batch. In some cases, I had to wait until all 10+ submissions each round completed to pick the top models to train next. On the other hand, I could have avoided wasting 3-4 days pursuing the wrong approach if I stopped for a day to collect my thoughts and reflect on my strategy more often.
Submit more models. My best-performing models didn’t always have best error rate during training, so there’s a chance that models I didn’t submit because they were the 6th or 7th or 12th-best based on training results might have performed well in submissions.
Set aside a test set. Most of my models performed worse on the Private test set. I also couldn’t submit all the models I wanted because I was running out of days and submissions. If I had set aside a test set, I could have seen how more models performed on data not used during training or validation.

Final Thoughts

I absolutely enjoyed the thrill of competing in such a challenging Kaggle competition and am absolutely unsatisfied with my final ranking.

I gained a lot of experience overcoming different first-time experiences (how to upload a model/dataset to Kaggle, how to run inference on a model with internet access disabled, how to manually document hyperparameters and results for 173 training runs, how to train in 6-hour increments without losing progress (Paperspace Free Tier time limit) how to best keep track of notebooks—so many notebooks!). After resting for a few days to collect my thoughts (and energy) I have begun participating in the BirdCLEF 2024 and Automated Essay Scoring 2.0 competitions (both started two weeks ago and have multiple months remaining). I am already improving my approach because of my experience in this competition.

As always, I hope you enjoyed this blog post!