Paddy Doctor Kaggle Competition - Part 3

deep learning
kaggle competition
paddy doctor
In this notebook I work through Jeremy Howard’s Live Coding 10 video in which he continues working on the Paddy Doctor Disease Classification Kaggle Competition.

Vishal Bakshi


February 5, 2024


In the fastai course Part 1 Lesson 6 video Jeremy Howard walked through the notebooks First Steps: Road to the Top, Part 1 and Small models: Road to the Top, Part 2 where he builds increasingly accurate solutions to the Paddy Doctor: Paddy Disease Classification Kaggle Competition. In the video, Jeremy referenced a series of walkthrough videos that he made while working through the four-notebook series for this competition. I’m excited to watch these walkthroughs to better understand how to approach a Kaggle competition from the perspective of a former #1 Kaggle grandmaster.

In this blog post series, I’ll walk through the code Jeremy shared in each of the 6 Live Coding videos focused on this competition, submitting predictions to Kaggle along the way. My last two blog posts in this series reference Jeremy’s Scaling Up: Road to the Top, Part 3 notebook to improve my large model ensemble predictions. Here are the links to each of the blog posts in this series:

Link to the Live Coding 10 video


!pip install -qq timm==0.6.13
import timm
# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *
from import *

comp = 'paddy-disease-classification'
path = setup_comp(comp, install='fastai')
/opt/conda/lib/python3.10/site-packages/scipy/ UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
(#4) [Path('../input/paddy-disease-classification/sample_submission.csv'),Path('../input/paddy-disease-classification/train_images'),Path('../input/paddy-disease-classification/train.csv'),Path('../input/paddy-disease-classification/test_images')]
trn_path = path/'train_images'

The best models for fine tuning image recognition

All deep learning models will return a set of probabilities. That’s what their final layer returns and we decode them using argmax across them. There’s nothing to stop you from using those probabilities directly.

The Paddy Classification competition are kind of like the natural images you see in ImageNet, but ImageNet doesn’t have any categories about diseases, they have categories about what’s the main object in this image, such as different types of grass or fields or something. It’s a bit different to ImageNet, which is what most of our pretrained models are trained on. Nearly all of the images are the same shape and size in this competition.

There are two key dimensions that really seem to impact how well a model can be fine-tuned: - How similar is your dataset to the dataset used for the pretrained model? - If it’s similar (like PETS to ImageNet), then the critical factor is how well does the fine-tuning of the model maintain the weights that are pretrained. They’re probably not going to change very much. And you can take advantage of really big accurate models that have learned to do almost the exact same thing that you are trying to do. - If it’s not similar (like Planets to ImageNet), a lot of the weights of a pretrained model are going to be useless for fine-tuning this because they’e learned specific features (like what does text look like, what do eyeballs look like, what does fur look like) none of which are going to be useful at all. - How big is your dataset? - On a big dataset, you’ve got time and epochs to take advantage of having lots of parameters in the model to learn to use them effectively. If you don’t have much data you don’t have much ability to do that.

Jeremy and Thomas Capelle analyze which models are the best for fine-tuning and Jeremy published the results in this notebook. They used YAML files for Weights and Biases to define the different models and parameters that they wanted to test. You can use the wandb web GUI to view the training results. This gist has the results.

You can export a pandas.DataFrame to a StringIO() object which essentially stores the data as a string.

from io import StringIO
strm = StringIO()
df.to_csv(strm, index=False)
txt = strm.getvalue()

You can also create a gist programatically:

import ghapi.core as gh
g = gh.GhApi()
gist = g.create_gist('name', txt, filename='name.csv', public=True)

# view URL

The vit family of models is particularly good at rapidly identifying features of data types it hasn’t seen before (like medical imaging or satellite imagery). They also have good error rate with low memory usage. The swin family, also a transformers-based model like vit, was the most accurate for fine-tuning the Planets dataset. For the Planets dataset, the really big slow models don’t necessarily have better error rates. Which makes sense because if they have heaps of parameters but they’re trying to learn something they’ve never seen before it’s unlikely that we will be able to take advantage of those parameters.

For some models (like vit_small_patch16_224) you can only use 224x224 image sizes, while with others (like convnext_tiny) you can use any sized images.

Jeremy ran the vision model fine-tuning on 3 RTX GPUs for about 12 hours. They didn’t try all combinations of all parameters. Thomas ran a learning rate sweep to get a sense of what learning rates work well, and then they tried a couple of learning rates, a couple of the best resize methods and a couple of the best pooling types across a few broadly different kinds of models across the two different datasets. In every single case, the same learning rate, resize method and pooling method was the best.

Applying learning on Paddy notebook with small models

Let’s try out some of these models for the paddy classification task to identify which ones’ larger versions we should try training next. We use a fixed validation seed (seed=42) so that the same validation set is created each time we run train. The final batch size in a convnext model is 32x32 you generally you want both sides of the image to be sized in multiples of 32. The correct dimensions for Resize is 640 by 480.

def train(arch, item, batch, accum=False):
    kwargs = {'bs': 16} if accum else {}
    dls = ImageDataLoaders.from_folder(trn_path, seed=42, valid_pct=0.2, item_tfms=item, batch_tfms=batch, **kwargs)
    cbs = GradientAccumulation(4) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    learn.fine_tune(12, 0.01)
    return error_rate(*learn.tta(dl=dls.valid))
arch = 'convnext_small_in22k'
train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75))
Downloading: "" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth
epoch train_loss valid_loss error_rate time
0 1.069485 0.617368 0.193657 01:13
epoch train_loss valid_loss error_rate time
0 0.513920 0.278679 0.093224 01:31
1 0.369457 0.236429 0.076886 01:33
2 0.344589 0.229747 0.074003 01:30
3 0.259019 0.175089 0.050457 01:28
4 0.224322 0.149210 0.041326 01:28
5 0.176708 0.155431 0.047573 01:28
6 0.128338 0.155574 0.040846 01:28
7 0.096755 0.103420 0.026430 01:28
8 0.083143 0.086435 0.025469 01:28
9 0.053020 0.089935 0.021624 01:28
10 0.038454 0.082519 0.021624 01:31
11 0.041188 0.081926 0.019222 01:32
train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.092957 0.656337 0.207593 01:12
epoch train_loss valid_loss error_rate time
0 0.534311 0.283374 0.097069 01:29
1 0.404589 0.271343 0.091783 01:30
2 0.366122 0.263794 0.077367 01:28
3 0.291584 0.194437 0.056223 01:26
4 0.245451 0.202364 0.058145 01:26
5 0.176800 0.145820 0.043248 01:27
6 0.141820 0.128727 0.038443 01:26
7 0.105305 0.103860 0.029313 01:26
8 0.082278 0.099908 0.024988 01:26
9 0.061129 0.090908 0.020183 01:26
10 0.049765 0.085010 0.017780 01:26
11 0.042815 0.082840 0.018260 01:26
train(arch, item=Resize((640,480)), batch=aug_transforms(size=(288,224), min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.074075 0.577121 0.189332 01:27
epoch train_loss valid_loss error_rate time
0 0.515035 0.284355 0.092263 01:48
1 0.400951 0.292205 0.091783 01:48
2 0.322861 0.263579 0.079769 01:48
3 0.302507 0.182555 0.056223 01:48
4 0.240202 0.166032 0.049015 01:48
5 0.181676 0.171471 0.046132 01:48
6 0.128153 0.124866 0.036040 01:47
7 0.105105 0.111518 0.028352 01:48
8 0.073392 0.093408 0.024988 01:48
9 0.051107 0.083389 0.024027 01:48
10 0.042867 0.083621 0.023066 01:48
11 0.038255 0.084581 0.022585 01:48
train(arch, item=Resize((640,480)), batch=aug_transforms(size=(320,240), min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.043271 0.641115 0.211917 01:40
epoch train_loss valid_loss error_rate time
0 0.481680 0.278677 0.089380 02:02
1 0.364523 0.263106 0.082653 02:02
2 0.349608 0.226119 0.063431 02:02
3 0.297600 0.197567 0.056223 02:01
4 0.221989 0.189447 0.058145 02:01
5 0.160790 0.156223 0.037482 02:02
6 0.120237 0.125078 0.037963 02:02
7 0.092999 0.136008 0.035079 02:01
8 0.070052 0.101822 0.027391 02:01
9 0.051421 0.095571 0.024507 02:01
10 0.037683 0.093875 0.023066 02:01
11 0.040058 0.093482 0.023066 02:01
arch = 'vit_small_patch16_224'
train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.241114 0.609537 0.202787 01:00
epoch train_loss valid_loss error_rate time
0 0.603312 0.330619 0.102355 01:05
1 0.454617 0.272407 0.090822 01:05
2 0.432220 0.399525 0.128784 01:05
3 0.343562 0.381830 0.123018 01:05
4 0.276432 0.273114 0.068717 01:06
5 0.229089 0.318629 0.077847 01:05
6 0.167870 0.146931 0.033157 01:05
7 0.117221 0.128760 0.037963 01:05
8 0.090773 0.112749 0.031235 01:05
9 0.073209 0.105501 0.028352 01:05
10 0.060867 0.107474 0.027871 01:05
11 0.061845 0.104577 0.028832 01:05
train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.264427 0.745677 0.241711 00:57
epoch train_loss valid_loss error_rate time
0 0.636773 0.356237 0.111485 01:03
1 0.512687 0.324432 0.112926 01:03
2 0.445590 0.373493 0.122537 01:03
3 0.386593 0.335397 0.106679 01:03
4 0.314561 0.262394 0.074003 01:03
5 0.236516 0.197571 0.060067 01:03
6 0.197938 0.153093 0.040846 01:03
7 0.159178 0.132239 0.038924 01:03
8 0.109954 0.117727 0.029313 01:03
9 0.084283 0.104230 0.025469 01:03
10 0.073850 0.100741 0.024988 01:03
11 0.064490 0.098695 0.024988 01:03
train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.313841 0.846934 0.269582 01:04
epoch train_loss valid_loss error_rate time
0 0.678171 0.413112 0.135031 01:11
1 0.497201 0.349746 0.111004 01:10
2 0.411814 0.311638 0.098991 01:10
3 0.410544 0.440684 0.128784 01:10
4 0.309415 0.252958 0.070159 01:10
5 0.241980 0.270128 0.073042 01:10
6 0.186923 0.202601 0.056223 01:10
7 0.130820 0.165027 0.043729 01:10
8 0.092804 0.121890 0.030274 01:10
9 0.072829 0.123613 0.029313 01:10
10 0.069157 0.110147 0.029793 01:10
11 0.054325 0.108744 0.026430 01:09
arch = 'swinv2_base_window12_192_22k'
train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=192, min_scale=0.75))
/opt/conda/lib/python3.10/site-packages/torch/ UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /usr/local/src/pytorch/aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Downloading: "" to /root/.cache/torch/hub/checkpoints/swinv2_base_patch4_window12_192_22k.pth
epoch train_loss valid_loss error_rate time
0 1.036088 0.583672 0.193176 02:14
epoch train_loss valid_loss error_rate time
0 0.509049 0.234983 0.078328 02:38
1 0.385443 0.205435 0.070159 02:38
2 0.334598 0.355438 0.089380 02:38
3 0.285663 0.368389 0.106679 02:39
4 0.238095 0.159115 0.045651 02:38
5 0.183420 0.140284 0.041326 02:38
6 0.141127 0.129525 0.036040 02:38
7 0.103826 0.111331 0.029313 02:38
8 0.077789 0.109304 0.027391 02:38
9 0.053972 0.096646 0.022585 02:38
10 0.041229 0.088552 0.021624 02:38
11 0.034090 0.088425 0.021144 02:38
train(arch, item=Resize(480), batch=aug_transforms(size=192, min_scale=0.75), accum=True)
epoch train_loss valid_loss error_rate time
0 1.345018 0.807008 0.224892 02:29
epoch train_loss valid_loss error_rate time
0 0.566454 0.335172 0.117251 03:16
1 0.569964 0.336681 0.125901 03:17
2 0.562002 0.343439 0.118212 03:17
3 0.469339 0.393603 0.124459 03:17
4 0.297434 0.332929 0.090822 03:17
5 0.269842 0.198136 0.051898 03:17
6 0.186959 0.181704 0.054781 03:17
7 0.134943 0.134798 0.036040 03:17
8 0.113144 0.102160 0.030274 03:17
9 0.085017 0.104802 0.025469 03:17
10 0.048129 0.101891 0.022105 03:17
11 0.057491 0.094901 0.022585 03:17
train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=192, min_scale=0.75), accum=True)
epoch train_loss valid_loss error_rate time
0 1.316884 1.035790 0.263335 02:35
epoch train_loss valid_loss error_rate time
0 0.617098 0.291554 0.094666 03:22
1 0.603711 0.409637 0.126862 03:23
2 0.573029 0.425025 0.127823 03:23
3 0.401325 0.402042 0.117732 03:23
4 0.340665 0.308467 0.089380 03:23
5 0.236972 0.177212 0.046132 03:23
6 0.212541 0.151314 0.041807 03:23
7 0.099307 0.110350 0.026430 03:23
8 0.054712 0.108030 0.022105 03:23
9 0.051622 0.100666 0.020183 03:22
10 0.032429 0.102271 0.022105 03:22
11 0.031421 0.097009 0.022105 03:22
arch = 'swin_small_patch4_window7_224'
train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75))
Downloading: "" to /root/.cache/torch/hub/checkpoints/swin_small_patch4_window7_224.pth
epoch train_loss valid_loss error_rate time
0 1.424551 0.834437 0.278712 01:35
epoch train_loss valid_loss error_rate time
0 0.659169 0.377870 0.125420 01:48
1 0.487998 0.293272 0.092263 01:48
2 0.439836 0.344214 0.101874 01:49
3 0.337822 0.243527 0.074964 01:48
4 0.262154 0.199788 0.065353 01:49
5 0.206655 0.129096 0.038924 01:48
6 0.179885 0.116743 0.031716 01:48
7 0.118040 0.118282 0.035079 01:48
8 0.092112 0.114298 0.028832 01:48
9 0.078792 0.105398 0.025949 01:48
10 0.064473 0.097622 0.024027 01:48
11 0.057387 0.097082 0.024027 01:49
train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.420280 0.869214 0.276790 01:34
epoch train_loss valid_loss error_rate time
0 0.727566 0.402595 0.133109 01:47
1 0.549589 0.400400 0.129265 01:48
2 0.440090 0.304687 0.101394 01:48
3 0.397689 0.340592 0.112926 01:48
4 0.288660 0.184638 0.057184 01:48
5 0.246669 0.180551 0.049976 01:47
6 0.189145 0.161568 0.043729 01:48
7 0.151034 0.160868 0.039885 01:48
8 0.110399 0.115093 0.026910 01:48
9 0.084655 0.098188 0.025469 01:48
10 0.070253 0.093308 0.023066 01:48
11 0.064076 0.095348 0.024027 01:48
train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.479291 1.005589 0.330610 01:41
epoch train_loss valid_loss error_rate time
0 0.758326 0.441894 0.145123 01:55
1 0.548370 0.436102 0.139356 01:54
2 0.444455 0.361651 0.104277 01:55
3 0.370136 0.280115 0.088419 01:55
4 0.269262 0.184901 0.059106 01:54
5 0.242950 0.177827 0.054781 01:55
6 0.171754 0.153312 0.039404 01:55
7 0.128885 0.118345 0.030754 01:54
8 0.098144 0.103212 0.025949 01:54
9 0.078017 0.098263 0.024988 01:54
10 0.062568 0.092275 0.021624 01:54
11 0.055316 0.091669 0.021624 01:54

I’ll summarize the training run parameters and resulting TTA error rates on the validation set in the following table. I have sorted this table by model name and descening TTA Error Rate (First Run).

Architecture item_tfms batch_tfms TTA Error Rate (First Run) Minutes per epoch (First Run) TTA Error Rate (Second Run)
convnext_small_in22k Resize((640,480)) aug_transforms(size=(288,224), min_scale=0.75) 0.0178* 01:51 0.0187
convnext_small_in22k Resize((640,480)) aug_transforms(size=(320,240), min_scale=0.75) 0.0202 02:07 0.0226
convnext_small_in22k Resize(480, method='squish') aug_transforms(size=224, min_scale=0.75) 0.0211 01:30 0.0211
convnext_small_in22k Resize(480) aug_transforms(size=224, min_scale=0.75) 0.0216 01:29 0.0202
vit_small_patch16_224 Resize(480) aug_transforms(size=224, min_scale=0.75) 0.0202* 00:44 0.0250
vit_small_patch16_224 Resize(480, method='squish') aug_transforms(size=224, min_scale=0.75) 0.0216 00:47 0.0245
vit_small_patch16_224 Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros) aug_transforms(size=224, min_scale=0.75) 0.0226 00:50 0.0221
swinv2_base_window12_192_22k Resize(480, method='squish') aug_transforms(size=192, min_scale=0.75) 0.0163* 02:30 0.0173
swinv2_base_window12_192_22k Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros) aug_transforms(size=192, min_scale=0.75) 0.0187 03:27 0.0192
swinv2_base_window12_192_22k Resize(480) aug_transforms(size=192, min_scale=0.75) 0.0197 03:22 0.0183
swin_small_patch4_window7_224 Resize(480, method='squish') aug_transforms(size=224, min_scale=0.75) 0.0202* 01:48 0.0207
swin_small_patch4_window7_224 Resize(480) aug_transforms(size=224, min_scale=0.75) 0.0207 01:47 0.0231
swin_small_patch4_window7_224 Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros) aug_transforms(size=224, min_scale=0.75) 0.0221 01:54 0.0183

* = lowest error rate for the architecture

Preparing an Ensemble for Kaggle Submission

I’ll retrain and create an ensemble of the top 3 models based on TTA Error Rate (First Run):

  • swinv2_base_window12_192_22k (0.0163)
  • convnext_small_in22k (0.0178)
  • vit_small_patch16_224 (0.0202)

The swin_small_patch4_window7_224 models did not outperform the quicker/smaller vit model so I won’t use them in this submission.

Later on in the video, Jeremy walks through an example of how he trained large versions of the small models he tested. In this section, he used the following training function, which I’ll use here for these small models, to prepare my submission predictions. Note that Jeremy has removed seed=42 since in the ensemble for submission we want to use different validation sets when training each model (whereas before we wanted to use the same validation set to better compare the performance between models). I’ve also changed a couple of things (I’m not exporting the models, and I’m using a smaller batch size).

# store the tta predictions in a list
tta_res = []
# run this once and re-use for all trainings
tst_files = get_image_files(path/'test_images')
def train(arch, item, batch, accum=False):
    kwargs = {'bs': 16} if accum else {}
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item, batch_tfms=batch, **kwargs)
    cbs = GradientAccumulation(2) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    learn.fine_tune(12, 0.01)
    # TTA predictions using test dataset
    tst_dl = dls.test_dl(tst_files)
arch = 'swinv2_base_window12_192_22k'
train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=192, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.093678 0.758215 0.250360 02:12
epoch train_loss valid_loss error_rate time
0 0.472679 0.248962 0.077847 02:36
1 0.383199 0.263211 0.081211 02:36
2 0.360025 0.292500 0.103316 02:36
3 0.305790 0.223976 0.066314 02:36
4 0.232600 0.209275 0.058145 02:36
5 0.185068 0.171094 0.043729 02:36
6 0.134446 0.165977 0.039885 02:36
7 0.108682 0.135310 0.031716 02:36
8 0.074768 0.124852 0.026430 02:36
9 0.052246 0.107549 0.024027 02:36
10 0.040028 0.102177 0.023546 02:36
11 0.038975 0.102109 0.022585 02:36
len(tta_res), len(tta_res[0][0])
(1, 3469)
arch = 'convnext_small_in22k'
train(arch, item=Resize((640,480)), batch=aug_transforms(size=(288,224), min_scale=0.75))
Downloading: "" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth
epoch train_loss valid_loss error_rate time
0 1.088028 0.659407 0.192696 01:26
epoch train_loss valid_loss error_rate time
0 0.488645 0.251234 0.082172 01:45
1 0.394844 0.260079 0.086497 01:45
2 0.341203 0.206835 0.065834 01:46
3 0.294899 0.183829 0.057665 01:45
4 0.224933 0.172018 0.045651 01:45
5 0.179294 0.139805 0.037482 01:46
6 0.131405 0.104101 0.027871 01:45
7 0.094273 0.112815 0.031235 01:45
8 0.064216 0.106544 0.029313 01:46
9 0.045855 0.091775 0.021144 01:45
10 0.039155 0.086264 0.021624 01:45
11 0.027725 0.083699 0.020183 01:45
len(tta_res), len(tta_res[0][0]), len(tta_res[1][0])
(2, 3469, 3469)
arch = 'vit_small_patch16_224'
train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))
epoch train_loss valid_loss error_rate time
0 1.258543 0.658905 0.220087 00:56
epoch train_loss valid_loss error_rate time
0 0.630974 0.367167 0.113407 01:02
1 0.496218 0.381497 0.124940 01:03
2 0.424657 0.341580 0.111004 01:02
3 0.381134 0.273908 0.087458 01:02
4 0.326845 0.227150 0.072561 01:02
5 0.253998 0.209598 0.062951 01:02
6 0.179893 0.189200 0.046612 01:02
7 0.146728 0.211501 0.045651 01:02
8 0.113472 0.159040 0.036040 01:02
9 0.076088 0.145309 0.033157 01:02
10 0.068731 0.140491 0.031716 01:02
11 0.059864 0.140173 0.030754 01:02
len(tta_res), len(tta_res[0][0]), len(tta_res[1][0]), len(tta_res[2][0])
(3, 3469, 3469, 3469)

Before I stack the predictions and prepare them for the submission, I’ll save the list of predictions:

save_pickle('/kaggle/working/tta_res.pkl', tta_res)

Next, I’ll take a quick detour and follow the steps Jeremy shares in Live Coding 11.

First, he takes the first item from each list in tta_res (the predictions) and stores them in a list called tta_prs. The list returned by learn.tta has a second item of None, which represents the targets (which we don’t have in the test set) we need to pick out just the first item (the predictions).

zipping the items in tta_res creates a list of two tuples: a tuple with the three sets of predictions (the first item in each element of tta_res) and a tuple with three Nones (the second item of each element of tta_res).

Here’s a toy example to illustrate:

list(zip(*[[(1), None],[(2), None]]))
[(1, 2), (None, None)]

The first function rerurns the first element of an iterable object.

Signature: first(x, f=None, negate=False, **kwargs)
def first(x, f=None, negate=False, **kwargs):
    "First element of `x`, optionally filtered by `f`, or None if missing"
    x = iter(x)
    if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
    return next(x, None)
File:      /opt/conda/lib/python3.10/site-packages/fastcore/
Type:      function
first(list(zip(*[[(1), None],[(2), None]])))
(1, 2)

The second element of the zipped tta_res list is a tuple of Nones.

(None, None, None)

I’ll now apply this code to tta_res:

tta_prs = first(zip(*tta_res))

Next, in order to take the mean value of the predictions, we stack them into a tensor:

t_tta = torch.stack(tta_prs)
torch.Size([3, 3469, 10])

Then, we take the mean of the three predictions for each of the 10 classes for each image.

avg_pr = t_tta.mean(0)
torch.Size([3469, 10])

We then get the index of the largest probability out of the 10 classes for each image, which is the “prediction” that the model has made for the image.

idxs = avg_pr.argmax(dim=1)
tensor([7, 8, 3,  ..., 8, 1, 5])

Finally, we convert those indexes to strings of disease names using the vocab and prepare the submission file:

dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=Resize(224))
mapping = dict(enumerate(dls.vocab))
ss = pd.read_csv(path/'sample_submission.csv')
results = pd.Series(idxs.numpy(), name='idxs').map(mapping)
ss.label = results
ss.to_csv('ensemble_subm.csv', index=False)
!head ensemble_subm.csv

Gradient accumulation to prevent out of memory

If you run out of memory while training any of these large models, you can use GradientAccumulation to lower the memory usage. In the training loop we get the gradients, we add the gradients times the learning rate to the weights, and then we zero the gradients. What you could do is halve the batch size, so for example from 64 to 32, and then only zero the gradients every two iterations, and only do the update every two iterations. So you calculate in two batches what you calculate in one batch and it will be mathematically identical, That’s called GradientAccumulation, which added to the Learner as a callback, which are things that change the behavior of the training.

How batches work: we randomly shuffle the dataset, and grab the next batch size of images, we resize them all to be the same size, and we stack them on top of each other. If it’s black and white images for example, we would have 64 (or whatever the batch size is) 640 x 480 (or whatever image size you want) images so we end up with a 64 x 640 x 480 tensor. Pretty much all of the functionality provided by PyTorch will work fine for a mini batch of things just as it would for a single thing.

Inference is often done on CPU instead of GPU since you only need to process one thing at a time. Or people will queue a few of them up and stick them on a GPU.

In my next blog post I walk through the discussion and code from Live Coding 11.