!pip install -qq timm==0.6.13
import timm
timm.__version__
'0.6.13'
Vishal Bakshi
February 5, 2024
In the fastai course Part 1 Lesson 6 video Jeremy Howard walked through the notebooks First Steps: Road to the Top, Part 1 and Small models: Road to the Top, Part 2 where he builds increasingly accurate solutions to the Paddy Doctor: Paddy Disease Classification Kaggle Competition. In the video, Jeremy referenced a series of walkthrough videos that he made while working through the four-notebook series for this competition. I’m excited to watch these walkthroughs to better understand how to approach a Kaggle competition from the perspective of a former #1 Kaggle grandmaster.
In this blog post series, I’ll walk through the code Jeremy shared in each of the 6 Live Coding videos focused on this competition, submitting predictions to Kaggle along the way. My last two blog posts in this series reference Jeremy’s Scaling Up: Road to the Top, Part 3 notebook to improve my large model ensemble predictions. Here are the links to each of the blog posts in this series:
# install fastkaggle if not available
try: import fastkaggle
except ModuleNotFoundError:
!pip install -Uq fastkaggle
from fastkaggle import *
from fastai.vision.all import *
comp = 'paddy-disease-classification'
path = setup_comp(comp, install='fastai')
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
(#4) [Path('../input/paddy-disease-classification/sample_submission.csv'),Path('../input/paddy-disease-classification/train_images'),Path('../input/paddy-disease-classification/train.csv'),Path('../input/paddy-disease-classification/test_images')]
All deep learning models will return a set of probabilities. That’s what their final layer returns and we decode them using argmax
across them. There’s nothing to stop you from using those probabilities directly.
The Paddy Classification competition are kind of like the natural images you see in ImageNet, but ImageNet doesn’t have any categories about diseases, they have categories about what’s the main object in this image, such as different types of grass or fields or something. It’s a bit different to ImageNet, which is what most of our pretrained models are trained on. Nearly all of the images are the same shape and size in this competition.
There are two key dimensions that really seem to impact how well a model can be fine-tuned: - How similar is your dataset to the dataset used for the pretrained model? - If it’s similar (like PETS to ImageNet), then the critical factor is how well does the fine-tuning of the model maintain the weights that are pretrained. They’re probably not going to change very much. And you can take advantage of really big accurate models that have learned to do almost the exact same thing that you are trying to do. - If it’s not similar (like Planets to ImageNet), a lot of the weights of a pretrained model are going to be useless for fine-tuning this because they’e learned specific features (like what does text look like, what do eyeballs look like, what does fur look like) none of which are going to be useful at all. - How big is your dataset? - On a big dataset, you’ve got time and epochs to take advantage of having lots of parameters in the model to learn to use them effectively. If you don’t have much data you don’t have much ability to do that.
Jeremy and Thomas Capelle analyze which models are the best for fine-tuning and Jeremy published the results in this notebook. They used YAML files for Weights and Biases to define the different models and parameters that they wanted to test. You can use the wandb web GUI to view the training results. This gist has the results.
You can export a pandas.DataFrame
to a StringIO()
object which essentially stores the data as a string.
You can also create a gist programatically:
import ghapi.core as gh
g = gh.GhApi()
gist = g.create_gist('name', txt, filename='name.csv', public=True)
# view URL
gist.html_url
The vit
family of models is particularly good at rapidly identifying features of data types it hasn’t seen before (like medical imaging or satellite imagery). They also have good error rate with low memory usage. The swin
family, also a transformers-based model like vit
, was the most accurate for fine-tuning the Planets dataset. For the Planets dataset, the really big slow models don’t necessarily have better error rates. Which makes sense because if they have heaps of parameters but they’re trying to learn something they’ve never seen before it’s unlikely that we will be able to take advantage of those parameters.
For some models (like vit_small_patch16_224) you can only use 224x224 image sizes, while with others (like convnext_tiny) you can use any sized images.
Jeremy ran the vision model fine-tuning on 3 RTX GPUs for about 12 hours. They didn’t try all combinations of all parameters. Thomas ran a learning rate sweep to get a sense of what learning rates work well, and then they tried a couple of learning rates, a couple of the best resize methods and a couple of the best pooling types across a few broadly different kinds of models across the two different datasets. In every single case, the same learning rate, resize method and pooling method was the best.
Let’s try out some of these models for the paddy classification task to identify which ones’ larger versions we should try training next. We use a fixed validation seed (seed=42
) so that the same validation set is created each time we run train
. The final batch size in a convnext model is 32x32 you generally you want both sides of the image to be sized in multiples of 32. The correct dimensions for Resize
is 640
by 480
.
def train(arch, item, batch, accum=False):
kwargs = {'bs': 16} if accum else {}
dls = ImageDataLoaders.from_folder(trn_path, seed=42, valid_pct=0.2, item_tfms=item, batch_tfms=batch, **kwargs)
cbs = GradientAccumulation(4) if accum else []
learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
learn.fine_tune(12, 0.01)
return error_rate(*learn.tta(dl=dls.valid))
Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_224.pth" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.069485 | 0.617368 | 0.193657 | 01:13 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.513920 | 0.278679 | 0.093224 | 01:31 |
1 | 0.369457 | 0.236429 | 0.076886 | 01:33 |
2 | 0.344589 | 0.229747 | 0.074003 | 01:30 |
3 | 0.259019 | 0.175089 | 0.050457 | 01:28 |
4 | 0.224322 | 0.149210 | 0.041326 | 01:28 |
5 | 0.176708 | 0.155431 | 0.047573 | 01:28 |
6 | 0.128338 | 0.155574 | 0.040846 | 01:28 |
7 | 0.096755 | 0.103420 | 0.026430 | 01:28 |
8 | 0.083143 | 0.086435 | 0.025469 | 01:28 |
9 | 0.053020 | 0.089935 | 0.021624 | 01:28 |
10 | 0.038454 | 0.082519 | 0.021624 | 01:31 |
11 | 0.041188 | 0.081926 | 0.019222 | 01:32 |
TensorBase(0.0211)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.092957 | 0.656337 | 0.207593 | 01:12 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.534311 | 0.283374 | 0.097069 | 01:29 |
1 | 0.404589 | 0.271343 | 0.091783 | 01:30 |
2 | 0.366122 | 0.263794 | 0.077367 | 01:28 |
3 | 0.291584 | 0.194437 | 0.056223 | 01:26 |
4 | 0.245451 | 0.202364 | 0.058145 | 01:26 |
5 | 0.176800 | 0.145820 | 0.043248 | 01:27 |
6 | 0.141820 | 0.128727 | 0.038443 | 01:26 |
7 | 0.105305 | 0.103860 | 0.029313 | 01:26 |
8 | 0.082278 | 0.099908 | 0.024988 | 01:26 |
9 | 0.061129 | 0.090908 | 0.020183 | 01:26 |
10 | 0.049765 | 0.085010 | 0.017780 | 01:26 |
11 | 0.042815 | 0.082840 | 0.018260 | 01:26 |
TensorBase(0.0202)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.074075 | 0.577121 | 0.189332 | 01:27 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.515035 | 0.284355 | 0.092263 | 01:48 |
1 | 0.400951 | 0.292205 | 0.091783 | 01:48 |
2 | 0.322861 | 0.263579 | 0.079769 | 01:48 |
3 | 0.302507 | 0.182555 | 0.056223 | 01:48 |
4 | 0.240202 | 0.166032 | 0.049015 | 01:48 |
5 | 0.181676 | 0.171471 | 0.046132 | 01:48 |
6 | 0.128153 | 0.124866 | 0.036040 | 01:47 |
7 | 0.105105 | 0.111518 | 0.028352 | 01:48 |
8 | 0.073392 | 0.093408 | 0.024988 | 01:48 |
9 | 0.051107 | 0.083389 | 0.024027 | 01:48 |
10 | 0.042867 | 0.083621 | 0.023066 | 01:48 |
11 | 0.038255 | 0.084581 | 0.022585 | 01:48 |
TensorBase(0.0187)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.043271 | 0.641115 | 0.211917 | 01:40 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.481680 | 0.278677 | 0.089380 | 02:02 |
1 | 0.364523 | 0.263106 | 0.082653 | 02:02 |
2 | 0.349608 | 0.226119 | 0.063431 | 02:02 |
3 | 0.297600 | 0.197567 | 0.056223 | 02:01 |
4 | 0.221989 | 0.189447 | 0.058145 | 02:01 |
5 | 0.160790 | 0.156223 | 0.037482 | 02:02 |
6 | 0.120237 | 0.125078 | 0.037963 | 02:02 |
7 | 0.092999 | 0.136008 | 0.035079 | 02:01 |
8 | 0.070052 | 0.101822 | 0.027391 | 02:01 |
9 | 0.051421 | 0.095571 | 0.024507 | 02:01 |
10 | 0.037683 | 0.093875 | 0.023066 | 02:01 |
11 | 0.040058 | 0.093482 | 0.023066 | 02:01 |
TensorBase(0.0226)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.241114 | 0.609537 | 0.202787 | 01:00 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.603312 | 0.330619 | 0.102355 | 01:05 |
1 | 0.454617 | 0.272407 | 0.090822 | 01:05 |
2 | 0.432220 | 0.399525 | 0.128784 | 01:05 |
3 | 0.343562 | 0.381830 | 0.123018 | 01:05 |
4 | 0.276432 | 0.273114 | 0.068717 | 01:06 |
5 | 0.229089 | 0.318629 | 0.077847 | 01:05 |
6 | 0.167870 | 0.146931 | 0.033157 | 01:05 |
7 | 0.117221 | 0.128760 | 0.037963 | 01:05 |
8 | 0.090773 | 0.112749 | 0.031235 | 01:05 |
9 | 0.073209 | 0.105501 | 0.028352 | 01:05 |
10 | 0.060867 | 0.107474 | 0.027871 | 01:05 |
11 | 0.061845 | 0.104577 | 0.028832 | 01:05 |
TensorBase(0.0245)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.264427 | 0.745677 | 0.241711 | 00:57 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.636773 | 0.356237 | 0.111485 | 01:03 |
1 | 0.512687 | 0.324432 | 0.112926 | 01:03 |
2 | 0.445590 | 0.373493 | 0.122537 | 01:03 |
3 | 0.386593 | 0.335397 | 0.106679 | 01:03 |
4 | 0.314561 | 0.262394 | 0.074003 | 01:03 |
5 | 0.236516 | 0.197571 | 0.060067 | 01:03 |
6 | 0.197938 | 0.153093 | 0.040846 | 01:03 |
7 | 0.159178 | 0.132239 | 0.038924 | 01:03 |
8 | 0.109954 | 0.117727 | 0.029313 | 01:03 |
9 | 0.084283 | 0.104230 | 0.025469 | 01:03 |
10 | 0.073850 | 0.100741 | 0.024988 | 01:03 |
11 | 0.064490 | 0.098695 | 0.024988 | 01:03 |
TensorBase(0.0250)
train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=224, min_scale=0.75))
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.313841 | 0.846934 | 0.269582 | 01:04 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.678171 | 0.413112 | 0.135031 | 01:11 |
1 | 0.497201 | 0.349746 | 0.111004 | 01:10 |
2 | 0.411814 | 0.311638 | 0.098991 | 01:10 |
3 | 0.410544 | 0.440684 | 0.128784 | 01:10 |
4 | 0.309415 | 0.252958 | 0.070159 | 01:10 |
5 | 0.241980 | 0.270128 | 0.073042 | 01:10 |
6 | 0.186923 | 0.202601 | 0.056223 | 01:10 |
7 | 0.130820 | 0.165027 | 0.043729 | 01:10 |
8 | 0.092804 | 0.121890 | 0.030274 | 01:10 |
9 | 0.072829 | 0.123613 | 0.029313 | 01:10 |
10 | 0.069157 | 0.110147 | 0.029793 | 01:10 |
11 | 0.054325 | 0.108744 | 0.026430 | 01:09 |
TensorBase(0.0221)
/opt/conda/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /usr/local/src/pytorch/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Downloading: "https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12_192_22k.pth" to /root/.cache/torch/hub/checkpoints/swinv2_base_patch4_window12_192_22k.pth
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.036088 | 0.583672 | 0.193176 | 02:14 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.509049 | 0.234983 | 0.078328 | 02:38 |
1 | 0.385443 | 0.205435 | 0.070159 | 02:38 |
2 | 0.334598 | 0.355438 | 0.089380 | 02:38 |
3 | 0.285663 | 0.368389 | 0.106679 | 02:39 |
4 | 0.238095 | 0.159115 | 0.045651 | 02:38 |
5 | 0.183420 | 0.140284 | 0.041326 | 02:38 |
6 | 0.141127 | 0.129525 | 0.036040 | 02:38 |
7 | 0.103826 | 0.111331 | 0.029313 | 02:38 |
8 | 0.077789 | 0.109304 | 0.027391 | 02:38 |
9 | 0.053972 | 0.096646 | 0.022585 | 02:38 |
10 | 0.041229 | 0.088552 | 0.021624 | 02:38 |
11 | 0.034090 | 0.088425 | 0.021144 | 02:38 |
TensorBase(0.0173)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.345018 | 0.807008 | 0.224892 | 02:29 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.566454 | 0.335172 | 0.117251 | 03:16 |
1 | 0.569964 | 0.336681 | 0.125901 | 03:17 |
2 | 0.562002 | 0.343439 | 0.118212 | 03:17 |
3 | 0.469339 | 0.393603 | 0.124459 | 03:17 |
4 | 0.297434 | 0.332929 | 0.090822 | 03:17 |
5 | 0.269842 | 0.198136 | 0.051898 | 03:17 |
6 | 0.186959 | 0.181704 | 0.054781 | 03:17 |
7 | 0.134943 | 0.134798 | 0.036040 | 03:17 |
8 | 0.113144 | 0.102160 | 0.030274 | 03:17 |
9 | 0.085017 | 0.104802 | 0.025469 | 03:17 |
10 | 0.048129 | 0.101891 | 0.022105 | 03:17 |
11 | 0.057491 | 0.094901 | 0.022585 | 03:17 |
TensorBase(0.0183)
train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=192, min_scale=0.75), accum=True)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.316884 | 1.035790 | 0.263335 | 02:35 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.617098 | 0.291554 | 0.094666 | 03:22 |
1 | 0.603711 | 0.409637 | 0.126862 | 03:23 |
2 | 0.573029 | 0.425025 | 0.127823 | 03:23 |
3 | 0.401325 | 0.402042 | 0.117732 | 03:23 |
4 | 0.340665 | 0.308467 | 0.089380 | 03:23 |
5 | 0.236972 | 0.177212 | 0.046132 | 03:23 |
6 | 0.212541 | 0.151314 | 0.041807 | 03:23 |
7 | 0.099307 | 0.110350 | 0.026430 | 03:23 |
8 | 0.054712 | 0.108030 | 0.022105 | 03:23 |
9 | 0.051622 | 0.100666 | 0.020183 | 03:22 |
10 | 0.032429 | 0.102271 | 0.022105 | 03:22 |
11 | 0.031421 | 0.097009 | 0.022105 | 03:22 |
TensorBase(0.0192)
Downloading: "https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth" to /root/.cache/torch/hub/checkpoints/swin_small_patch4_window7_224.pth
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.424551 | 0.834437 | 0.278712 | 01:35 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.659169 | 0.377870 | 0.125420 | 01:48 |
1 | 0.487998 | 0.293272 | 0.092263 | 01:48 |
2 | 0.439836 | 0.344214 | 0.101874 | 01:49 |
3 | 0.337822 | 0.243527 | 0.074964 | 01:48 |
4 | 0.262154 | 0.199788 | 0.065353 | 01:49 |
5 | 0.206655 | 0.129096 | 0.038924 | 01:48 |
6 | 0.179885 | 0.116743 | 0.031716 | 01:48 |
7 | 0.118040 | 0.118282 | 0.035079 | 01:48 |
8 | 0.092112 | 0.114298 | 0.028832 | 01:48 |
9 | 0.078792 | 0.105398 | 0.025949 | 01:48 |
10 | 0.064473 | 0.097622 | 0.024027 | 01:48 |
11 | 0.057387 | 0.097082 | 0.024027 | 01:49 |
TensorBase(0.0207)
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.420280 | 0.869214 | 0.276790 | 01:34 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.727566 | 0.402595 | 0.133109 | 01:47 |
1 | 0.549589 | 0.400400 | 0.129265 | 01:48 |
2 | 0.440090 | 0.304687 | 0.101394 | 01:48 |
3 | 0.397689 | 0.340592 | 0.112926 | 01:48 |
4 | 0.288660 | 0.184638 | 0.057184 | 01:48 |
5 | 0.246669 | 0.180551 | 0.049976 | 01:47 |
6 | 0.189145 | 0.161568 | 0.043729 | 01:48 |
7 | 0.151034 | 0.160868 | 0.039885 | 01:48 |
8 | 0.110399 | 0.115093 | 0.026910 | 01:48 |
9 | 0.084655 | 0.098188 | 0.025469 | 01:48 |
10 | 0.070253 | 0.093308 | 0.023066 | 01:48 |
11 | 0.064076 | 0.095348 | 0.024027 | 01:48 |
TensorBase(0.0231)
train(arch, item=Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros), batch=aug_transforms(size=224, min_scale=0.75))
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.479291 | 1.005589 | 0.330610 | 01:41 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.758326 | 0.441894 | 0.145123 | 01:55 |
1 | 0.548370 | 0.436102 | 0.139356 | 01:54 |
2 | 0.444455 | 0.361651 | 0.104277 | 01:55 |
3 | 0.370136 | 0.280115 | 0.088419 | 01:55 |
4 | 0.269262 | 0.184901 | 0.059106 | 01:54 |
5 | 0.242950 | 0.177827 | 0.054781 | 01:55 |
6 | 0.171754 | 0.153312 | 0.039404 | 01:55 |
7 | 0.128885 | 0.118345 | 0.030754 | 01:54 |
8 | 0.098144 | 0.103212 | 0.025949 | 01:54 |
9 | 0.078017 | 0.098263 | 0.024988 | 01:54 |
10 | 0.062568 | 0.092275 | 0.021624 | 01:54 |
11 | 0.055316 | 0.091669 | 0.021624 | 01:54 |
TensorBase(0.0183)
I’ll summarize the training run parameters and resulting TTA error rates on the validation set in the following table. I have sorted this table by model name and descening TTA Error Rate (First Run).
Architecture | item_tfms | batch_tfms | TTA Error Rate (First Run) | Minutes per epoch (First Run) | TTA Error Rate (Second Run) |
---|---|---|---|---|---|
convnext_small_in22k | Resize((640,480)) |
aug_transforms(size=(288,224), min_scale=0.75) |
0.0178* | 01:51 | 0.0187 |
convnext_small_in22k | Resize((640,480)) |
aug_transforms(size=(320,240), min_scale=0.75) |
0.0202 | 02:07 | 0.0226 |
convnext_small_in22k | Resize(480, method='squish') |
aug_transforms(size=224, min_scale=0.75) |
0.0211 | 01:30 | 0.0211 |
convnext_small_in22k | Resize(480) |
aug_transforms(size=224, min_scale=0.75) |
0.0216 | 01:29 | 0.0202 |
vit_small_patch16_224 | Resize(480) |
aug_transforms(size=224, min_scale=0.75) |
0.0202* | 00:44 | 0.0250 |
vit_small_patch16_224 | Resize(480, method='squish') |
aug_transforms(size=224, min_scale=0.75) |
0.0216 | 00:47 | 0.0245 |
vit_small_patch16_224 | Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros) |
aug_transforms(size=224, min_scale=0.75) |
0.0226 | 00:50 | 0.0221 |
swinv2_base_window12_192_22k | Resize(480, method='squish') |
aug_transforms(size=192, min_scale=0.75) |
0.0163* | 02:30 | 0.0173 |
swinv2_base_window12_192_22k | Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros) |
aug_transforms(size=192, min_scale=0.75) |
0.0187 | 03:27 | 0.0192 |
swinv2_base_window12_192_22k | Resize(480) |
aug_transforms(size=192, min_scale=0.75) |
0.0197 | 03:22 | 0.0183 |
swin_small_patch4_window7_224 | Resize(480, method='squish') |
aug_transforms(size=224, min_scale=0.75) |
0.0202* | 01:48 | 0.0207 |
swin_small_patch4_window7_224 | Resize(480) |
aug_transforms(size=224, min_scale=0.75) |
0.0207 | 01:47 | 0.0231 |
swin_small_patch4_window7_224 | Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros) |
aug_transforms(size=224, min_scale=0.75) |
0.0221 | 01:54 | 0.0183 |
* = lowest error rate for the architecture
I’ll retrain and create an ensemble of the top 3 models based on TTA Error Rate (First Run):
The swin_small_patch4_window7_224 models did not outperform the quicker/smaller vit model so I won’t use them in this submission.
Later on in the video, Jeremy walks through an example of how he trained large versions of the small models he tested. In this section, he used the following training function, which I’ll use here for these small models, to prepare my submission predictions. Note that Jeremy has removed seed=42
since in the ensemble for submission we want to use different validation sets when training each model (whereas before we wanted to use the same validation set to better compare the performance between models). I’ve also changed a couple of things (I’m not exporting the models, and I’m using a smaller batch size).
def train(arch, item, batch, accum=False):
kwargs = {'bs': 16} if accum else {}
dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item, batch_tfms=batch, **kwargs)
cbs = GradientAccumulation(2) if accum else []
learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
learn.fine_tune(12, 0.01)
# TTA predictions using test dataset
tst_dl = dls.test_dl(tst_files)
tta_res.append(learn.tta(dl=dls.test_dl(tst_files)))
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.093678 | 0.758215 | 0.250360 | 02:12 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.472679 | 0.248962 | 0.077847 | 02:36 |
1 | 0.383199 | 0.263211 | 0.081211 | 02:36 |
2 | 0.360025 | 0.292500 | 0.103316 | 02:36 |
3 | 0.305790 | 0.223976 | 0.066314 | 02:36 |
4 | 0.232600 | 0.209275 | 0.058145 | 02:36 |
5 | 0.185068 | 0.171094 | 0.043729 | 02:36 |
6 | 0.134446 | 0.165977 | 0.039885 | 02:36 |
7 | 0.108682 | 0.135310 | 0.031716 | 02:36 |
8 | 0.074768 | 0.124852 | 0.026430 | 02:36 |
9 | 0.052246 | 0.107549 | 0.024027 | 02:36 |
10 | 0.040028 | 0.102177 | 0.023546 | 02:36 |
11 | 0.038975 | 0.102109 | 0.022585 | 02:36 |
Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_224.pth" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.088028 | 0.659407 | 0.192696 | 01:26 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.488645 | 0.251234 | 0.082172 | 01:45 |
1 | 0.394844 | 0.260079 | 0.086497 | 01:45 |
2 | 0.341203 | 0.206835 | 0.065834 | 01:46 |
3 | 0.294899 | 0.183829 | 0.057665 | 01:45 |
4 | 0.224933 | 0.172018 | 0.045651 | 01:45 |
5 | 0.179294 | 0.139805 | 0.037482 | 01:46 |
6 | 0.131405 | 0.104101 | 0.027871 | 01:45 |
7 | 0.094273 | 0.112815 | 0.031235 | 01:45 |
8 | 0.064216 | 0.106544 | 0.029313 | 01:46 |
9 | 0.045855 | 0.091775 | 0.021144 | 01:45 |
10 | 0.039155 | 0.086264 | 0.021624 | 01:45 |
11 | 0.027725 | 0.083699 | 0.020183 | 01:45 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.258543 | 0.658905 | 0.220087 | 00:56 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.630974 | 0.367167 | 0.113407 | 01:02 |
1 | 0.496218 | 0.381497 | 0.124940 | 01:03 |
2 | 0.424657 | 0.341580 | 0.111004 | 01:02 |
3 | 0.381134 | 0.273908 | 0.087458 | 01:02 |
4 | 0.326845 | 0.227150 | 0.072561 | 01:02 |
5 | 0.253998 | 0.209598 | 0.062951 | 01:02 |
6 | 0.179893 | 0.189200 | 0.046612 | 01:02 |
7 | 0.146728 | 0.211501 | 0.045651 | 01:02 |
8 | 0.113472 | 0.159040 | 0.036040 | 01:02 |
9 | 0.076088 | 0.145309 | 0.033157 | 01:02 |
10 | 0.068731 | 0.140491 | 0.031716 | 01:02 |
11 | 0.059864 | 0.140173 | 0.030754 | 01:02 |
Before I stack the predictions and prepare them for the submission, I’ll save the list of predictions:
Next, I’ll take a quick detour and follow the steps Jeremy shares in Live Coding 11.
First, he takes the first item from each list in tta_res
(the predictions) and stores them in a list called tta_prs
. The list returned by learn.tta
has a second item of None
, which represents the targets (which we don’t have in the test set) we need to pick out just the first item (the predictions).
zip
ping the items in tta_res
creates a list of two tuples: a tuple with the three sets of predictions (the first item in each element of tta_res
) and a tuple with three None
s (the second item of each element of tta_res
).
Here’s a toy example to illustrate:
The first
function rerurns the first element of an iterable object.
Signature: first(x, f=None, negate=False, **kwargs) Source: def first(x, f=None, negate=False, **kwargs): "First element of `x`, optionally filtered by `f`, or None if missing" x = iter(x) if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs) return next(x, None) File: /opt/conda/lib/python3.10/site-packages/fastcore/basics.py Type: function
The second element of the zipped tta_res
list is a tuple of None
s.
I’ll now apply this code to tta_res
:
Next, in order to take the mean value of the predictions, we stack
them into a tensor:
Then, we take the mean of the three predictions for each of the 10 classes for each image.
We then get the index of the largest probability out of the 10 classes for each image, which is the “prediction” that the model has made for the image.
Finally, we convert those indexes to strings of disease names using the vocab
and prepare the submission file:
If you run out of memory while training any of these large models, you can use GradientAccumulation
to lower the memory usage. In the training loop we get the gradients, we add the gradients times the learning rate to the weights, and then we zero the gradients. What you could do is halve the batch size, so for example from 64 to 32, and then only zero the gradients every two iterations, and only do the update every two iterations. So you calculate in two batches what you calculate in one batch and it will be mathematically identical, That’s called GradientAccumulation
, which added to the Learner
as a callback, which are things that change the behavior of the training.
How batches work: we randomly shuffle the dataset, and grab the next batch size of images, we resize them all to be the same size, and we stack them on top of each other. If it’s black and white images for example, we would have 64 (or whatever the batch size is) 640 x 480 (or whatever image size you want) images so we end up with a 64 x 640 x 480 tensor. Pretty much all of the functionality provided by PyTorch will work fine for a mini batch of things just as it would for a single thing.
Inference is often done on CPU instead of GPU since you only need to process one thing at a time. Or people will queue a few of them up and stick them on a GPU.
In my next blog post I walk through the discussion and code from Live Coding 11.