Paddy Doctor Kaggle Competition - Part 3

deep learning

fastai

kaggle competition

paddy doctor

python

In this notebook I work through Jeremy Howard’s Live Coding 10 video in which he continues working on the Paddy Doctor Disease Classification Kaggle Competition.

Author

Vishal Bakshi

Published

February 5, 2024

Background

In the fastai course Part 1 Lesson 6 video Jeremy Howard walked through the notebooks First Steps: Road to the Top, Part 1 and Small models: Road to the Top, Part 2 where he builds increasingly accurate solutions to the Paddy Doctor: Paddy Disease Classification Kaggle Competition. In the video, Jeremy referenced a series of walkthrough videos that he made while working through the four-notebook series for this competition. I’m excited to watch these walkthroughs to better understand how to approach a Kaggle competition from the perspective of a former #1 Kaggle grandmaster.

In this blog post series, I’ll walk through the code Jeremy shared in each of the 6 Live Coding videos focused on this competition, submitting predictions to Kaggle along the way. My last two blog posts in this series reference Jeremy’s Scaling Up: Road to the Top, Part 3 notebook to improve my large model ensemble predictions. Here are the links to each of the blog posts in this series:

Part 1: Live Coding 8
Part 2: Live Coding 9
Part 3: Live Coding 10 (You are here)
Part 4: Live Coding 11
Part 5: Live Coding 12
Part 6: Live Coding 13
Part 7: Improving My Large Ensemble, Part 1
Part 8: Improving My Large Ensemble, Part 2

Link to the Live Coding 10 video

Setup

!pip install -qq timm==0.6.13
import timm
timm.__version__

'0.6.13'

# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *
from fastai.vision.all import *

comp = 'paddy-disease-classification'
path = setup_comp(comp, install='fastai')

/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

path.ls()

(#4) [Path('../input/paddy-disease-classification/sample_submission.csv'),Path('../input/paddy-disease-classification/train_images'),Path('../input/paddy-disease-classification/train.csv'),Path('../input/paddy-disease-classification/test_images')]

trn_path = path/'train_images'

The best models for fine tuning image recognition

All deep learning models will return a set of probabilities. That’s what their final layer returns and we decode them using argmax across them. There’s nothing to stop you from using those probabilities directly.

The Paddy Classification competition are kind of like the natural images you see in ImageNet, but ImageNet doesn’t have any categories about diseases, they have categories about what’s the main object in this image, such as different types of grass or fields or something. It’s a bit different to ImageNet, which is what most of our pretrained models are trained on. Nearly all of the images are the same shape and size in this competition.

There are two key dimensions that really seem to impact how well a model can be fine-tuned: - How similar is your dataset to the dataset used for the pretrained model? - If it’s similar (like PETS to ImageNet), then the critical factor is how well does the fine-tuning of the model maintain the weights that are pretrained. They’re probably not going to change very much. And you can take advantage of really big accurate models that have learned to do almost the exact same thing that you are trying to do. - If it’s not similar (like Planets to ImageNet), a lot of the weights of a pretrained model are going to be useless for fine-tuning this because they’e learned specific features (like what does text look like, what do eyeballs look like, what does fur look like) none of which are going to be useful at all. - How big is your dataset? - On a big dataset, you’ve got time and epochs to take advantage of having lots of parameters in the model to learn to use them effectively. If you don’t have much data you don’t have much ability to do that.

Jeremy and Thomas Capelle analyze which models are the best for fine-tuning and Jeremy published the results in this notebook. They used YAML files for Weights and Biases to define the different models and parameters that they wanted to test. You can use the wandb web GUI to view the training results. This gist has the results.

You can export a pandas.DataFrame to a StringIO() object which essentially stores the data as a string.

from io import StringIO
strm = StringIO()
df.to_csv(strm, index=False)
txt = strm.getvalue()

You can also create a gist programatically:

import ghapi.core as gh
g = gh.GhApi()
gist = g.create_gist('name', txt, filename='name.csv', public=True)

# view URL
gist.html_url

The vit family of models is particularly good at rapidly identifying features of data types it hasn’t seen before (like medical imaging or satellite imagery). They also have good error rate with low memory usage. The swin family, also a transformers-based model like vit, was the most accurate for fine-tuning the Planets dataset. For the Planets dataset, the really big slow models don’t necessarily have better error rates. Which makes sense because if they have heaps of parameters but they’re trying to learn something they’ve never seen before it’s unlikely that we will be able to take advantage of those parameters.

For some models (like vit_small_patch16_224) you can only use 224x224 image sizes, while with others (like convnext_tiny) you can use any sized images.

Jeremy ran the vision model fine-tuning on 3 RTX GPUs for about 12 hours. They didn’t try all combinations of all parameters. Thomas ran a learning rate sweep to get a sense of what learning rates work well, and then they tried a couple of learning rates, a couple of the best resize methods and a couple of the best pooling types across a few broadly different kinds of models across the two different datasets. In every single case, the same learning rate, resize method and pooling method was the best.

Applying learning on Paddy notebook with small models

Let’s try out some of these models for the paddy classification task to identify which ones’ larger versions we should try training next. We use a fixed validation seed (seed=42) so that the same validation set is created each time we run train. The final batch size in a convnext model is 32x32 you generally you want both sides of the image to be sized in multiples of 32. The correct dimensions for Resize is 640 by 480.

def train(arch, item, batch, accum=False):
    kwargs = {'bs': 16} if accum else {}
    dls = ImageDataLoaders.from_folder(trn_path, seed=42, valid_pct=0.2, item_tfms=item, batch_tfms=batch, **kwargs)
    cbs = GradientAccumulation(4) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    learn.fine_tune(12, 0.01)
    return error_rate(*learn.tta(dl=dls.valid))

arch = 'convnext_small_in22k'

train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75))

Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_224.pth" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth

epoch	train_loss	valid_loss	error_rate	time
0	1.069485	0.617368	0.193657	01:13

epoch	train_loss	valid_loss	error_rate	time
0	0.513920	0.278679	0.093224	01:31
1	0.369457	0.236429	0.076886	01:33
2	0.344589	0.229747	0.074003	01:30
3	0.259019	0.175089	0.050457	01:28
4	0.224322	0.149210	0.041326	01:28
5	0.176708	0.155431	0.047573	01:28
6	0.128338	0.155574	0.040846	01:28
7	0.096755	0.103420	0.026430	01:28
8	0.083143	0.086435	0.025469	01:28
9	0.053020	0.089935	0.021624	01:28
10	0.038454	0.082519	0.021624	01:31
11	0.041188	0.081926	0.019222	01:32

TensorBase(0.0211)

train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.092957	0.656337	0.207593	01:12

epoch	train_loss	valid_loss	error_rate	time
0	0.534311	0.283374	0.097069	01:29
1	0.404589	0.271343	0.091783	01:30
2	0.366122	0.263794	0.077367	01:28
3	0.291584	0.194437	0.056223	01:26
4	0.245451	0.202364	0.058145	01:26
5	0.176800	0.145820	0.043248	01:27
6	0.141820	0.128727	0.038443	01:26
7	0.105305	0.103860	0.029313	01:26
8	0.082278	0.099908	0.024988	01:26
9	0.061129	0.090908	0.020183	01:26
10	0.049765	0.085010	0.017780	01:26
11	0.042815	0.082840	0.018260	01:26

TensorBase(0.0202)

train(arch, item=Resize((640,480)), batch=aug_transforms(size=(288,224), min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.074075	0.577121	0.189332	01:27

epoch	train_loss	valid_loss	error_rate	time
0	0.515035	0.284355	0.092263	01:48
1	0.400951	0.292205	0.091783	01:48
2	0.322861	0.263579	0.079769	01:48
3	0.302507	0.182555	0.056223	01:48
4	0.240202	0.166032	0.049015	01:48
5	0.181676	0.171471	0.046132	01:48
6	0.128153	0.124866	0.036040	01:47
7	0.105105	0.111518	0.028352	01:48
8	0.073392	0.093408	0.024988	01:48
9	0.051107	0.083389	0.024027	01:48
10	0.042867	0.083621	0.023066	01:48
11	0.038255	0.084581	0.022585	01:48

TensorBase(0.0187)

train(arch, item=Resize((640,480)), batch=aug_transforms(size=(320,240), min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.043271	0.641115	0.211917	01:40

epoch	train_loss	valid_loss	error_rate	time
0	0.481680	0.278677	0.089380	02:02
1	0.364523	0.263106	0.082653	02:02
2	0.349608	0.226119	0.063431	02:02
3	0.297600	0.197567	0.056223	02:01
4	0.221989	0.189447	0.058145	02:01
5	0.160790	0.156223	0.037482	02:02
6	0.120237	0.125078	0.037963	02:02
7	0.092999	0.136008	0.035079	02:01
8	0.070052	0.101822	0.027391	02:01
9	0.051421	0.095571	0.024507	02:01
10	0.037683	0.093875	0.023066	02:01
11	0.040058	0.093482	0.023066	02:01

TensorBase(0.0226)

arch = 'vit_small_patch16_224'

train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.241114	0.609537	0.202787	01:00

epoch	train_loss	valid_loss	error_rate	time
0	0.603312	0.330619	0.102355	01:05
1	0.454617	0.272407	0.090822	01:05
2	0.432220	0.399525	0.128784	01:05
3	0.343562	0.381830	0.123018	01:05
4	0.276432	0.273114	0.068717	01:06
5	0.229089	0.318629	0.077847	01:05
6	0.167870	0.146931	0.033157	01:05
7	0.117221	0.128760	0.037963	01:05
8	0.090773	0.112749	0.031235	01:05
9	0.073209	0.105501	0.028352	01:05
10	0.060867	0.107474	0.027871	01:05
11	0.061845	0.104577	0.028832	01:05

TensorBase(0.0245)

train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.264427	0.745677	0.241711	00:57

epoch	train_loss	valid_loss	error_rate	time
0	0.636773	0.356237	0.111485	01:03
1	0.512687	0.324432	0.112926	01:03
2	0.445590	0.373493	0.122537	01:03
3	0.386593	0.335397	0.106679	01:03
4	0.314561	0.262394	0.074003	01:03
5	0.236516	0.197571	0.060067	01:03
6	0.197938	0.153093	0.040846	01:03
7	0.159178	0.132239	0.038924	01:03
8	0.109954	0.117727	0.029313	01:03
9	0.084283	0.104230	0.025469	01:03
10	0.073850	0.100741	0.024988	01:03
11	0.064490	0.098695	0.024988	01:03

TensorBase(0.0250)

train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.313841	0.846934	0.269582	01:04

epoch	train_loss	valid_loss	error_rate	time
0	0.678171	0.413112	0.135031	01:11
1	0.497201	0.349746	0.111004	01:10
2	0.411814	0.311638	0.098991	01:10
3	0.410544	0.440684	0.128784	01:10
4	0.309415	0.252958	0.070159	01:10
5	0.241980	0.270128	0.073042	01:10
6	0.186923	0.202601	0.056223	01:10
7	0.130820	0.165027	0.043729	01:10
8	0.092804	0.121890	0.030274	01:10
9	0.072829	0.123613	0.029313	01:10
10	0.069157	0.110147	0.029793	01:10
11	0.054325	0.108744	0.026430	01:09

TensorBase(0.0221)

arch = 'swinv2_base_window12_192_22k'

train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=192, min_scale=0.75))

/opt/conda/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /usr/local/src/pytorch/aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Downloading: "https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12_192_22k.pth" to /root/.cache/torch/hub/checkpoints/swinv2_base_patch4_window12_192_22k.pth

epoch	train_loss	valid_loss	error_rate	time
0	1.036088	0.583672	0.193176	02:14

epoch	train_loss	valid_loss	error_rate	time
0	0.509049	0.234983	0.078328	02:38
1	0.385443	0.205435	0.070159	02:38
2	0.334598	0.355438	0.089380	02:38
3	0.285663	0.368389	0.106679	02:39
4	0.238095	0.159115	0.045651	02:38
5	0.183420	0.140284	0.041326	02:38
6	0.141127	0.129525	0.036040	02:38
7	0.103826	0.111331	0.029313	02:38
8	0.077789	0.109304	0.027391	02:38
9	0.053972	0.096646	0.022585	02:38
10	0.041229	0.088552	0.021624	02:38
11	0.034090	0.088425	0.021144	02:38

TensorBase(0.0173)

train(arch, item=Resize(480), batch=aug_transforms(size=192, min_scale=0.75), accum=True)

epoch	train_loss	valid_loss	error_rate	time
0	1.345018	0.807008	0.224892	02:29

epoch	train_loss	valid_loss	error_rate	time
0	0.566454	0.335172	0.117251	03:16
1	0.569964	0.336681	0.125901	03:17
2	0.562002	0.343439	0.118212	03:17
3	0.469339	0.393603	0.124459	03:17
4	0.297434	0.332929	0.090822	03:17
5	0.269842	0.198136	0.051898	03:17
6	0.186959	0.181704	0.054781	03:17
7	0.134943	0.134798	0.036040	03:17
8	0.113144	0.102160	0.030274	03:17
9	0.085017	0.104802	0.025469	03:17
10	0.048129	0.101891	0.022105	03:17
11	0.057491	0.094901	0.022585	03:17

TensorBase(0.0183)

train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=192, min_scale=0.75), accum=True)

epoch	train_loss	valid_loss	error_rate	time
0	1.316884	1.035790	0.263335	02:35

epoch	train_loss	valid_loss	error_rate	time
0	0.617098	0.291554	0.094666	03:22
1	0.603711	0.409637	0.126862	03:23
2	0.573029	0.425025	0.127823	03:23
3	0.401325	0.402042	0.117732	03:23
4	0.340665	0.308467	0.089380	03:23
5	0.236972	0.177212	0.046132	03:23
6	0.212541	0.151314	0.041807	03:23
7	0.099307	0.110350	0.026430	03:23
8	0.054712	0.108030	0.022105	03:23
9	0.051622	0.100666	0.020183	03:22
10	0.032429	0.102271	0.022105	03:22
11	0.031421	0.097009	0.022105	03:22

TensorBase(0.0192)

arch = 'swin_small_patch4_window7_224'

train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75))

Downloading: "https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth" to /root/.cache/torch/hub/checkpoints/swin_small_patch4_window7_224.pth

epoch	train_loss	valid_loss	error_rate	time
0	1.424551	0.834437	0.278712	01:35

epoch	train_loss	valid_loss	error_rate	time
0	0.659169	0.377870	0.125420	01:48
1	0.487998	0.293272	0.092263	01:48
2	0.439836	0.344214	0.101874	01:49
3	0.337822	0.243527	0.074964	01:48
4	0.262154	0.199788	0.065353	01:49
5	0.206655	0.129096	0.038924	01:48
6	0.179885	0.116743	0.031716	01:48
7	0.118040	0.118282	0.035079	01:48
8	0.092112	0.114298	0.028832	01:48
9	0.078792	0.105398	0.025949	01:48
10	0.064473	0.097622	0.024027	01:48
11	0.057387	0.097082	0.024027	01:49

TensorBase(0.0207)

train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.420280	0.869214	0.276790	01:34

epoch	train_loss	valid_loss	error_rate	time
0	0.727566	0.402595	0.133109	01:47
1	0.549589	0.400400	0.129265	01:48
2	0.440090	0.304687	0.101394	01:48
3	0.397689	0.340592	0.112926	01:48
4	0.288660	0.184638	0.057184	01:48
5	0.246669	0.180551	0.049976	01:47
6	0.189145	0.161568	0.043729	01:48
7	0.151034	0.160868	0.039885	01:48
8	0.110399	0.115093	0.026910	01:48
9	0.084655	0.098188	0.025469	01:48
10	0.070253	0.093308	0.023066	01:48
11	0.064076	0.095348	0.024027	01:48

TensorBase(0.0231)

train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.479291	1.005589	0.330610	01:41

epoch	train_loss	valid_loss	error_rate	time
0	0.758326	0.441894	0.145123	01:55
1	0.548370	0.436102	0.139356	01:54
2	0.444455	0.361651	0.104277	01:55
3	0.370136	0.280115	0.088419	01:55
4	0.269262	0.184901	0.059106	01:54
5	0.242950	0.177827	0.054781	01:55
6	0.171754	0.153312	0.039404	01:55
7	0.128885	0.118345	0.030754	01:54
8	0.098144	0.103212	0.025949	01:54
9	0.078017	0.098263	0.024988	01:54
10	0.062568	0.092275	0.021624	01:54
11	0.055316	0.091669	0.021624	01:54

TensorBase(0.0183)

I’ll summarize the training run parameters and resulting TTA error rates on the validation set in the following table. I have sorted this table by model name and descening TTA Error Rate (First Run).

Architecture	item_tfms	batch_tfms	TTA Error Rate (First Run)	Minutes per epoch (First Run)	TTA Error Rate (Second Run)
convnext_small_in22k	`Resize((640,480))`	`aug_transforms(size=(288,224), min_scale=0.75)`	0.0178*	01:51	0.0187
convnext_small_in22k	`Resize((640,480))`	`aug_transforms(size=(320,240), min_scale=0.75)`	0.0202	02:07	0.0226
convnext_small_in22k	`Resize(480, method='squish')`	`aug_transforms(size=224, min_scale=0.75)`	0.0211	01:30	0.0211
convnext_small_in22k	`Resize(480)`	`aug_transforms(size=224, min_scale=0.75)`	0.0216	01:29	0.0202
vit_small_patch16_224	`Resize(480)`	`aug_transforms(size=224, min_scale=0.75)`	0.0202*	00:44	0.0250
vit_small_patch16_224	`Resize(480, method='squish')`	`aug_transforms(size=224, min_scale=0.75)`	0.0216	00:47	0.0245
vit_small_patch16_224	`Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)`	`aug_transforms(size=224, min_scale=0.75)`	0.0226	00:50	0.0221
swinv2_base_window12_192_22k	`Resize(480, method='squish')`	`aug_transforms(size=192, min_scale=0.75)`	0.0163*	02:30	0.0173
swinv2_base_window12_192_22k	`Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)`	`aug_transforms(size=192, min_scale=0.75)`	0.0187	03:27	0.0192
swinv2_base_window12_192_22k	`Resize(480)`	`aug_transforms(size=192, min_scale=0.75)`	0.0197	03:22	0.0183
swin_small_patch4_window7_224	`Resize(480, method='squish')`	`aug_transforms(size=224, min_scale=0.75)`	0.0202*	01:48	0.0207
swin_small_patch4_window7_224	`Resize(480)`	`aug_transforms(size=224, min_scale=0.75)`	0.0207	01:47	0.0231
swin_small_patch4_window7_224	`Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)`	`aug_transforms(size=224, min_scale=0.75)`	0.0221	01:54	0.0183

* = lowest error rate for the architecture

Preparing an Ensemble for Kaggle Submission

I’ll retrain and create an ensemble of the top 3 models based on TTA Error Rate (First Run):

swinv2_base_window12_192_22k (0.0163)
convnext_small_in22k (0.0178)
vit_small_patch16_224 (0.0202)

The swin_small_patch4_window7_224 models did not outperform the quicker/smaller vit model so I won’t use them in this submission.

Later on in the video, Jeremy walks through an example of how he trained large versions of the small models he tested. In this section, he used the following training function, which I’ll use here for these small models, to prepare my submission predictions. Note that Jeremy has removed seed=42 since in the ensemble for submission we want to use different validation sets when training each model (whereas before we wanted to use the same validation set to better compare the performance between models). I’ve also changed a couple of things (I’m not exporting the models, and I’m using a smaller batch size).

# store the tta predictions in a list
tta_res = []

# run this once and re-use for all trainings
tst_files = get_image_files(path/'test_images')
tst_files.sort()

def train(arch, item, batch, accum=False):
    kwargs = {'bs': 16} if accum else {}
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item, batch_tfms=batch, **kwargs)
    cbs = GradientAccumulation(2) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    learn.fine_tune(12, 0.01)
    
    # TTA predictions using test dataset
    tst_dl = dls.test_dl(tst_files)
    tta_res.append(learn.tta(dl=dls.test_dl(tst_files)))

arch = 'swinv2_base_window12_192_22k'

train(arch, item=Resize(480, method='squish'), batch=aug_transforms(size=192, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.093678	0.758215	0.250360	02:12

epoch	train_loss	valid_loss	error_rate	time
0	0.472679	0.248962	0.077847	02:36
1	0.383199	0.263211	0.081211	02:36
2	0.360025	0.292500	0.103316	02:36
3	0.305790	0.223976	0.066314	02:36
4	0.232600	0.209275	0.058145	02:36
5	0.185068	0.171094	0.043729	02:36
6	0.134446	0.165977	0.039885	02:36
7	0.108682	0.135310	0.031716	02:36
8	0.074768	0.124852	0.026430	02:36
9	0.052246	0.107549	0.024027	02:36
10	0.040028	0.102177	0.023546	02:36
11	0.038975	0.102109	0.022585	02:36

len(tta_res), len(tta_res[0][0])

(1, 3469)

arch = 'convnext_small_in22k'

train(arch, item=Resize((640,480)), batch=aug_transforms(size=(288,224), min_scale=0.75))

Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_224.pth" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth

epoch	train_loss	valid_loss	error_rate	time
0	1.088028	0.659407	0.192696	01:26

epoch	train_loss	valid_loss	error_rate	time
0	0.488645	0.251234	0.082172	01:45
1	0.394844	0.260079	0.086497	01:45
2	0.341203	0.206835	0.065834	01:46
3	0.294899	0.183829	0.057665	01:45
4	0.224933	0.172018	0.045651	01:45
5	0.179294	0.139805	0.037482	01:46
6	0.131405	0.104101	0.027871	01:45
7	0.094273	0.112815	0.031235	01:45
8	0.064216	0.106544	0.029313	01:46
9	0.045855	0.091775	0.021144	01:45
10	0.039155	0.086264	0.021624	01:45
11	0.027725	0.083699	0.020183	01:45

len(tta_res), len(tta_res[0][0]), len(tta_res[1][0])

(2, 3469, 3469)

arch = 'vit_small_patch16_224'

train(arch, item=Resize(480), batch=aug_transforms(size=224, min_scale=0.75))

epoch	train_loss	valid_loss	error_rate	time
0	1.258543	0.658905	0.220087	00:56

epoch	train_loss	valid_loss	error_rate	time
0	0.630974	0.367167	0.113407	01:02
1	0.496218	0.381497	0.124940	01:03
2	0.424657	0.341580	0.111004	01:02
3	0.381134	0.273908	0.087458	01:02
4	0.326845	0.227150	0.072561	01:02
5	0.253998	0.209598	0.062951	01:02
6	0.179893	0.189200	0.046612	01:02
7	0.146728	0.211501	0.045651	01:02
8	0.113472	0.159040	0.036040	01:02
9	0.076088	0.145309	0.033157	01:02
10	0.068731	0.140491	0.031716	01:02
11	0.059864	0.140173	0.030754	01:02

len(tta_res), len(tta_res[0][0]), len(tta_res[1][0]), len(tta_res[2][0])

(3, 3469, 3469, 3469)

Before I stack the predictions and prepare them for the submission, I’ll save the list of predictions:

save_pickle('/kaggle/working/tta_res.pkl', tta_res)

Next, I’ll take a quick detour and follow the steps Jeremy shares in Live Coding 11.

First, he takes the first item from each list in tta_res (the predictions) and stores them in a list called tta_prs. The list returned by learn.tta has a second item of None, which represents the targets (which we don’t have in the test set) we need to pick out just the first item (the predictions).

zipping the items in tta_res creates a list of two tuples: a tuple with the three sets of predictions (the first item in each element of tta_res) and a tuple with three Nones (the second item of each element of tta_res).

Here’s a toy example to illustrate:

list(zip(*[[(1), None],[(2), None]]))

[(1, 2), (None, None)]

The first function rerurns the first element of an iterable object.

first??

Signature: first(x, f=None, negate=False, **kwargs)
Source:   
def first(x, f=None, negate=False, **kwargs):
    "First element of `x`, optionally filtered by `f`, or None if missing"
    x = iter(x)
    if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
    return next(x, None)
File:      /opt/conda/lib/python3.10/site-packages/fastcore/basics.py
Type:      function

first(list(zip(*[[(1), None],[(2), None]])))

(1, 2)

The second element of the zipped tta_res list is a tuple of Nones.

list(zip(*tta_res))[1]

(None, None, None)

I’ll now apply this code to tta_res:

tta_prs = first(zip(*tta_res))

len(tta_prs[0])

Next, in order to take the mean value of the predictions, we stack them into a tensor:

t_tta = torch.stack(tta_prs)

t_tta.shape

torch.Size([3, 3469, 10])

Then, we take the mean of the three predictions for each of the 10 classes for each image.

avg_pr = t_tta.mean(0)

avg_pr.shape

torch.Size([3469, 10])

We then get the index of the largest probability out of the 10 classes for each image, which is the “prediction” that the model has made for the image.

idxs = avg_pr.argmax(dim=1)
idxs.shape

torch.Size([3469])

idxs

tensor([7, 8, 3,  ..., 8, 1, 5])

Finally, we convert those indexes to strings of disease names using the vocab and prepare the submission file:

dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=Resize(224))
mapping = dict(enumerate(dls.vocab))
ss = pd.read_csv(path/'sample_submission.csv')
results = pd.Series(idxs.numpy(), name='idxs').map(mapping)
ss.label = results
ss.to_csv('ensemble_subm.csv', index=False)

!head ensemble_subm.csv

image_id,label
200001.jpg,hispa
200002.jpg,normal
200003.jpg,blast
200004.jpg,blast
200005.jpg,blast
200006.jpg,brown_spot
200007.jpg,dead_heart
200008.jpg,brown_spot
200009.jpg,hispa

Gradient accumulation to prevent out of memory

If you run out of memory while training any of these large models, you can use GradientAccumulation to lower the memory usage. In the training loop we get the gradients, we add the gradients times the learning rate to the weights, and then we zero the gradients. What you could do is halve the batch size, so for example from 64 to 32, and then only zero the gradients every two iterations, and only do the update every two iterations. So you calculate in two batches what you calculate in one batch and it will be mathematically identical, That’s called GradientAccumulation, which added to the Learner as a callback, which are things that change the behavior of the training.

How batches work: we randomly shuffle the dataset, and grab the next batch size of images, we resize them all to be the same size, and we stack them on top of each other. If it’s black and white images for example, we would have 64 (or whatever the batch size is) 640 x 480 (or whatever image size you want) images so we end up with a 64 x 640 x 480 tensor. Pretty much all of the functionality provided by PyTorch will work fine for a mini batch of things just as it would for a single thing.

Inference is often done on CPU instead of GPU since you only need to process one thing at a time. Or people will queue a few of them up and stick them on a GPU.

In my next blog post I walk through the discussion and code from Live Coding 11.