# It's a good idea to ensure you're running the latest version of any libraries you need.
# `!pip install -Uqq <libraries>` upgrades to the latest version of <libraries>
# NB: You can safely ignore any warnings or errors pip spits out about running as root or incompatibilities
!pip install -Uqq fastai fastbook duckduckgo_search timm
Practical Deep Learnings For Coders - Part 1 Notes and Examples
Practical Deep Learning for Coders - Part 1
Vishal Bakshi
This notebook contains my notes (of course videos, example notebooks and book chapters) and exercises of Part 1 of the course Practical Deep Learning for Coders.
Lesson 1: Getting Started
Notebook Exercise
The first thing I did was to run through the lesson 1 notebook from start to finish. In this notebook, they download training and validation images of birds and forests then train an image classifier with 100% accuracy in identifying images of birds.
The first exercise is for us to create our own image classifier with our own image searches. I’ll create a classifier which accurately predicts an image of an alligator.
I’ll start by using their example code for getting images using DuckDuckGo image search:
from duckduckgo_search import ddg_images
from fastcore.all import *
def search_images(term, max_images=30):
print(f"Searching for '{term}'")
return L(ddg_images(term, max_results=max_images)).itemgot('image')
The search_images
function takes a search term
and max_images
maximum number of images value. It prints out a line of text that it’s "Searching for"
the term
and returns an L
object with the image
URL.
The ddg_images
function returns a list
of JSON objects containing the title
, image
URL, thumbnail
URL, height
, width
and source
of the image.
= ddg_images('alligator', max_results=1)
search_object search_object
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:60: UserWarning: ddg_images is deprecated. Use DDGS().images() generator
warnings.warn("ddg_images is deprecated. Use DDGS().images() generator")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:64: UserWarning: parameter page is deprecated
warnings.warn("parameter page is deprecated")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:66: UserWarning: parameter max_results is deprecated
warnings.warn("parameter max_results is deprecated")
[{'title': 'The Creature Feature: 10 Fun Facts About the American Alligator | WIRED',
'image': 'https://www.wired.com/wp-content/uploads/2015/03/Gator-2.jpg',
'thumbnail': 'https://tse4.mm.bing.net/th?id=OIP.FS96VErnOXAGSWU092I_DQHaE8&pid=Api',
'url': 'https://www.wired.com/2015/03/creature-feature-10-fun-facts-american-alligator/',
'height': 3456,
'width': 5184,
'source': 'Bing'}]
Wrapping this list in L
object and calling .itemgot('image')
on it extracts URL value associated with the image
key in the JSON object.
'image') L(search_object).itemgot(
(#1) ['https://www.wired.com/wp-content/uploads/2015/03/Gator-2.jpg']
Next, they provide some code to download the image to a destination filename and view the image:
= search_images('alligator', max_images=1)
urls
from fastdownload import download_url
= 'alligator.jpg'
dest 0], dest, show_progress=False)
download_url(urls[
from fastai.vision.all import *
= Image.open(dest)
im 256,256) im.to_thumb(
Searching for 'alligator'
For my not-alligator images, I’ll use images of a swamp.
'swamp photos', max_images=1)[0], 'swamp.jpg', show_progress=False)
download_url(search_images(open('swamp.jpg').to_thumb(256,256) Image.
Searching for 'swamp photos'
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:60: UserWarning: ddg_images is deprecated. Use DDGS().images() generator
warnings.warn("ddg_images is deprecated. Use DDGS().images() generator")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:64: UserWarning: parameter page is deprecated
warnings.warn("parameter page is deprecated")
/usr/local/lib/python3.9/dist-packages/duckduckgo_search/compat.py:66: UserWarning: parameter max_results is deprecated
warnings.warn("parameter max_results is deprecated")
In the following code, I’ll search for both terms, alligator
and swamp
and store the images in alligator_or_not/alligator
and alligator_or_not/swamp
paths, respectively.
The parents=TRUE
argument creates any intermediate parent directories that don’t exist (in this case, the alligator_or_not
directory). The exist_ok=TRUE
argument suppresses the FileExistsError
and does nothing.
= 'swamp','alligator'
searches = Path('alligator_or_not')
path from time import sleep
for o in searches:
= (path/o)
dest =True, parents=True)
dest.mkdir(exist_ok=search_images(f'{o} photo'))
download_images(dest, urls10) # Pause between searches to avoid over-loading server
sleep(=search_images(f'{o} sun photo'))
download_images(dest, urls10)
sleep(=search_images(f'{o} shade photo'))
download_images(dest, urls10)
sleep(/o, max_size=400, dest=path/o) resize_images(path
Searching for 'swamp photo'
Searching for 'swamp sun photo'
Searching for 'swamp shade photo'
Searching for 'alligator photo'
Searching for 'alligator sun photo'
Searching for 'alligator shade photo'
Next, I’ll train my model using the code they have provided.
The get_image_files
function is a fastai function which takes a Path
object and returns an L
object with paths to the image files.
type(get_image_files(path))
fastcore.foundation.L
get_image_files(path)
(#349) [Path('alligator_or_not/swamp/1b3c3a61-0f7f-4dc2-a704-38202d593207.jpg'),Path('alligator_or_not/swamp/9c9141f2-024c-4e26-b343-c1ca1672fde8.jpeg'),Path('alligator_or_not/swamp/1340dd85-5d98-428e-a861-d522c786c3d7.jpg'),Path('alligator_or_not/swamp/2d3f91dc-cc5f-499b-bec6-7fa0e938fb13.jpg'),Path('alligator_or_not/swamp/84afd585-ce46-4016-9a09-bd861a5615db.jpg'),Path('alligator_or_not/swamp/6222f0b6-1f5f-43ec-b561-8e5763a91c61.jpg'),Path('alligator_or_not/swamp/a71c8dcb-7bbb-4dba-8ae6-8a780d5c27c6.jpg'),Path('alligator_or_not/swamp/bbd1a832-a901-4e8f-8724-feac35fa8dcb.jpg'),Path('alligator_or_not/swamp/45b358b3-1a12-41d4-8972-8fa98b2baa52.jpg'),Path('alligator_or_not/swamp/cf664509-8eb6-42c8-9177-c17f48bc026b.jpg')...]
The fastai parent_label
function takes a Path
object and returns a string of the file’s parent folder name.
'alligator_or_not/swamp/18b55d4f-3d3b-4013-822b-724489a23f01.jpg')) parent_label(Path(
'swamp'
Some image files that are downloaded may be corrupted, so they have provided a verify_images
function to find images that can’t be opened. Those images are then removed (unlink
ed) from the path.
= verify_images(get_image_files(path))
failed map(Path.unlink)
failed.len(failed)
1
failed
(#1) [Path('alligator_or_not/alligator/1eb55508-274b-4e23-a6ae-dbbf1943a9d1.jpg')]
= DataBlock(
dls =(ImageBlock, CategoryBlock),
blocks=get_image_files,
get_items=RandomSplitter(valid_pct=0.2, seed=42),
splitter=parent_label,
get_y=[Resize(192, method='squish')]
item_tfms=32)
).dataloaders(path, bs
=6) dls.show_batch(max_n
I’ll train the model using their code which uses the resnet18
image classification model, and fine_tune
s it for 3 epochs.
= vision_learner(dls, resnet18, metrics=error_rate)
learn 3) learn.fine_tune(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.690250 | 0.171598 | 0.043478 | 00:03 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.127188 | 0.001747 | 0.000000 | 00:02 |
1 | 0.067970 | 0.006409 | 0.000000 | 00:02 |
2 | 0.056453 | 0.004981 | 0.000000 | 00:02 |
The accuracy is 100%.
Next, I’ll test the model as they’ve done in the lesson.
'alligator.jpg').to_thumb(256,256) PILImage.create(
= learn.predict(PILImage.create('alligator.jpg'))
is_alligator,_,probs print(f"This is an: {is_alligator}.")
print(f"Probability it's an alligator: {probs[0]:.4f}")
This is an: alligator.
Probability it's an alligator: 1.0000
Video Notes
In this section, I’ll take notes while I watch the lesson 1 video.
- This is the fifth version of the course!
- What seemed impossible in 2015 (image recognition of a bird) is now free and something we can build in 2 minutes.
- All models need numbers as their inputs. Images are already stored as numbers in computers. [PixSpy] allows you to (among other things) view the color of each pixel in an image file.
- A
DataBlock
gives fastai all the information it needs to create a computer vision model. - Creating really interesting, real, working programs with deep learning is something that doesn’t take a lot of code, math, or more than a laptop computer. It’s pretty accessible.
- Deep Learning models are doing things that very few, if any of us, believed would be possible to do by computers in our lifetime.
- See the Practical Data Ethics course as well.
- Meta Learning: How To Learn Deep Learning And Thrive In The Digital World.
- Books on learning/education:
- Mathematician’s Lament by Paul Lockhart
- Making Learning Whole by David Perkins
- Why are we able to create a bird-recognizer in a minute or two? And why couldn’t we do it before?
- 2012: Project looking at 5-year survival of breast cancer patients, pre-deep learning approach
- Assembled a team to build ideas for thousands of features that required a lot of expertise, took years.
- They fed these features into a logistic regression model to predict survival.
- Neural networks don’t require us to build these features, they build them for us.
- 2015: Matthew D. Zeiler and Rob Fergus looked inside a neural network to see what it had learned.
- We don’t give it features, we ask it to learn features.
- The neural net is the basic function used in deep learning.
- You start with a random neural network, feed it examples and you have it learn to recognize things.
- The deeper you get, the more sophisticated the features it can find are.
- What we’re going to learn is how neural networks do this automatically.
- This is the key difference in why we can now do things that we couldn’t previously conceive of as possible.
- 2012: Project looking at 5-year survival of breast cancer patients, pre-deep learning approach
- An image recognizer can also be used to classify sounds (pictures of waveforms).
- Turning time series into pictures for image classification.
- fastai is built on top of PyTorch.
!pip install -Uqq fastai
to update.- Always view your data at every step of building a model.
- For computer vision algorithms you don’t need particularly big images.
- For big images, most of the time is taken up opening it, the neural net on the GPU is must faster.
- The main thing you’re going to try and figure out is how do I get this data into my model?
DataBlock
blocks=(ImageBlock, CategoryBlock)
:ImageBlock
is the type of input to the model,CategoryBlock
is the type of model outputget_image_files(path)
returns a list of all image files in apath
.- It’s critical that you put aside some data for testing the accuracy of your model (validation set) with something like
RandomSplitter
for thesplitter
parameter. get_y
tells fastai how to get the correct label for the photo.- Most computer vision architectures need all of your inputs to be the same size, using
Resize
(eithercrop
out a piece in the middle orsquish
the image) for the parameteritem_tfms
. DataLoaders
contains iterators that PyTorch can run through to grab batches of your data to feed the training algorithm.show_batch
shows you a batch of input/label pairs.- A
Learner
combines a model (the actual neural network that we are training) and the data we use to train it with. - PyTorch Image Models (timm).
- resnet has already been trained to recognize over 1 million images of over 1000 different types. fastai downloads this so you can start with a neural network that can do a lot.
fine_tune
takes those pretrained weights downloaded for you and adjusts them in a carefully controlled way to teach the model differences between your dataset and what it was originally trained for.- You pass
.predict
an image, which is how you would deploy your model, returns whether it’s a bird or not as a string, integer and probability of whether it’s a bird (in this example).
In the code blocks below, I’ll train the different types of models presented in the video lesson.
Image Segmentation
from fastai.vision.all import *
= untar_data(URLs.CAMVID_TINY)
path = SegmentationDataLoaders.from_label_func(
dls = 8, fnames = get_image_files(path/"images"),
path, bs = lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
label_func = np.loadtxt(path/'codes.txt', dtype=str)
codes
)
= unet_learner(dls, resnet34)
learn 8) learn.fine_tune(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 3.454409 | 3.015761 | 00:06 |
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.928762 | 1.719756 | 00:02 |
1 | 1.649520 | 1.394089 | 00:02 |
2 | 1.533350 | 1.344445 | 00:02 |
3 | 1.414438 | 1.279674 | 00:02 |
4 | 1.291168 | 1.063977 | 00:02 |
5 | 1.174492 | 0.980055 | 00:02 |
6 | 1.073124 | 0.931532 | 00:02 |
7 | 0.992161 | 0.922516 | 00:02 |
=3, figsize=(7,8)) learn.show_results(max_n
It’s amazing how many it’s getting correct because this model was trained in about 24 seconds using a tiny amount of data.
I’ll take a look at the codes out of curiousity, which is an array of string elements describing different objects in view.
/'codes.txt', dtype=str) np.loadtxt(path
array(['Animal', 'Archway', 'Bicyclist', 'Bridge', 'Building', 'Car',
'CartLuggagePram', 'Child', 'Column_Pole', 'Fence', 'LaneMkgsDriv',
'LaneMkgsNonDriv', 'Misc_Text', 'MotorcycleScooter', 'OtherMoving',
'ParkingBlock', 'Pedestrian', 'Road', 'RoadShoulder', 'Sidewalk',
'SignSymbol', 'Sky', 'SUVPickupTruck', 'TrafficCone',
'TrafficLight', 'Train', 'Tree', 'Truck_Bus', 'Tunnel',
'VegetationMisc', 'Void', 'Wall'], dtype='<U17')
Tabular Analysis
from fastai.tabular.all import *
= untar_data(URLs.ADULT_SAMPLE)
path
= TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names='salary',
dls = ['workclass', 'education', 'marital-status', 'occupation',
cat_names 'relationship', 'race'],
= ['age', 'fnlwgt', 'education-num'],
cont_names = [Categorify, FillMissing, Normalize])
procs
dls.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | State-gov | Some-college | Divorced | Adm-clerical | Own-child | White | False | 42.0 | 138162.000499 | 10.0 | <50k |
1 | Private | HS-grad | Married-civ-spouse | Other-service | Husband | Asian-Pac-Islander | False | 40.0 | 73025.003080 | 9.0 | <50k |
2 | Private | Assoc-voc | Married-civ-spouse | Prof-specialty | Wife | White | False | 36.0 | 163396.000571 | 11.0 | >=50k |
3 | Private | HS-grad | Never-married | Sales | Own-child | White | False | 18.0 | 110141.999831 | 9.0 | <50k |
4 | Self-emp-not-inc | 12th | Divorced | Other-service | Unmarried | White | False | 28.0 | 33035.002716 | 8.0 | <50k |
5 | ? | 7th-8th | Separated | ? | Own-child | White | False | 50.0 | 346013.994175 | 4.0 | <50k |
6 | Self-emp-inc | HS-grad | Never-married | Farming-fishing | Not-in-family | White | False | 36.0 | 37018.999571 | 9.0 | <50k |
7 | State-gov | Masters | Married-civ-spouse | Prof-specialty | Husband | White | False | 37.0 | 239409.001471 | 14.0 | >=50k |
8 | Self-emp-not-inc | Doctorate | Married-civ-spouse | Prof-specialty | Husband | White | False | 50.0 | 167728.000009 | 16.0 | >=50k |
9 | Private | HS-grad | Married-civ-spouse | Tech-support | Husband | White | False | 38.0 | 247111.001513 | 9.0 | >=50k |
For tabular models, there’s not generally going to be a pretrained model that already does something like what you want because every table of data is very different, so generally it doesn’t make too much sense to fine_tune
a tabular model.
= tabular_learner(dls, metrics=accuracy)
learn 2) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.373780 | 0.365976 | 0.832770 | 00:06 |
1 | 0.356514 | 0.358780 | 0.833999 | 00:05 |
Collaborative Filtering
The basis of most recommendation systems.
from fastai.collab import *
= untar_data(URLs.ML_SAMPLE)
path = CollabDataLoaders.from_csv(path/'ratings.csv')
dls
dls.show_batch()
userId | movieId | rating | |
---|---|---|---|
0 | 457 | 457 | 3.0 |
1 | 407 | 2959 | 5.0 |
2 | 294 | 356 | 4.0 |
3 | 78 | 356 | 5.0 |
4 | 596 | 3578 | 4.5 |
5 | 547 | 541 | 3.5 |
6 | 105 | 1193 | 4.0 |
7 | 176 | 4993 | 4.5 |
8 | 430 | 1214 | 4.0 |
9 | 607 | 858 | 4.5 |
There’s actually no pretrained collaborative filtering model so we could use fit_one_cycle
but fine_tune
works here as well.
= collab_learner(dls, y_range=(0.5, 5.5))
learn 10) learn.fine_tune(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.498450 | 1.417215 | 00:00 |
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.375927 | 1.357755 | 00:00 |
1 | 1.274781 | 1.176326 | 00:00 |
2 | 1.033917 | 0.870168 | 00:00 |
3 | 0.810119 | 0.719341 | 00:00 |
4 | 0.704180 | 0.679201 | 00:00 |
5 | 0.640635 | 0.667121 | 00:00 |
6 | 0.623741 | 0.661391 | 00:00 |
7 | 0.620811 | 0.657624 | 00:00 |
8 | 0.606947 | 0.656678 | 00:00 |
9 | 0.605081 | 0.656613 | 00:00 |
learn.show_results()
userId | movieId | rating | rating_pred | |
---|---|---|---|---|
0 | 15.0 | 35.0 | 4.5 | 3.886339 |
1 | 68.0 | 64.0 | 5.0 | 3.822170 |
2 | 62.0 | 33.0 | 4.0 | 3.088149 |
3 | 39.0 | 91.0 | 4.0 | 3.788227 |
4 | 37.0 | 7.0 | 5.0 | 4.434169 |
5 | 38.0 | 98.0 | 3.5 | 4.380877 |
6 | 3.0 | 25.0 | 3.0 | 3.443295 |
7 | 23.0 | 13.0 | 2.0 | 3.220192 |
8 | 15.0 | 7.0 | 4.0 | 4.306846 |
Note: RISE turnes your notebook into a presentation.
Generally speaking, if it’s something that a human can do reasonably quickly, even an expert human (like look at a Go board and decide if it’s a good board or not) then that’s probably something that deep learning will probably be good at. If it’s something that takes logical thought process over time, particularly if it’s not based on much data, deep learning probably won’t do that well.
The first neural network was built in 1957. The basic ideas have not changed much at all.
What’s going on in these models?
- Arthur Samuel in late 1950s invented Machine Learning.
- Normal program: input -> program -> results.
- Machine Learning model: input and weights (parameters) -> model -> results.
- The model is a mathematical function that takes the input, multiplies them with one set of weights and adds them up, then does that again for a second set of weights, and so forth.
- It takes all of the negative numbers and replaces them with 0.
- It takes all those numbers as inputs to the next layer.
- And it repeats a few times.
- Weights start out as being random.
- A more useful workflow: input/weights -> model -> results -> loss -> update weights.
- The loss is a number that says how good the results were.
- We need a way to come up with a new set of weights that are a bit better than the current weights.
- “bit better” weights means it makes the loss a bit better.
- If we make it a little bit better a few times, it’ll eventually get good.
- Neural nets proven to solve any computable function (i.e. it’s flexible enough to update weights until the results are good).
- “Generate artwork based on someone’s twitter bio” is a computable function.
- Once we’ve finished the training procedure we don’t the loss and the weights can be integrated into the model.
- We end up with inputs -> model -> results which looks like our original idea of a program.
- Deploying a model will have lots of tricky details but there will be one line of code which says
learn.predict
which takes an input and provides results. - The most important thing to do is experiment.
Book Notes
Chapter 1: Your Deep Learning Journey In this section, I’ll take notes while I read Chapter 1 in the textbook.
Deep Learning is for Everyone
- What you don’t need for deep learning: lots of math, lots of data, lots of expensive computers.
- Deep learning is a computer technique to extract and transform data by using multiple layers of neural networks. Each of these layers takes its inputs from previous layers and progressively refines them. The layers are trained by algorithms that minimize their errors and improve their accuracy. In this way, the network learns to perform a specified task.
Neural Networks: A Brief History
- Warren McCulloch and Walter Pitts developed a mathematical model of an artificial neuron in 1943.
- Most of Pitt’s famous work was done while he was homeless.
- Psychologist Frank Rosenblatt further developed the artificial neuron to give it the ability to learn and built the first device that used these principles, the Mark I Perceptron, which was able to recognize simple shapes.
- Marvin Minsky and Seymour Papert wrote a book about the Perceptron and showed that using multiple layers of the devices would allow the limitations of a single layer to be addressed.
- The 1986 book Parallel Distributed Processing (PDP) by David Rumelhart, James McClelland, and the PDP Research Group defined PDP as requiring the following:
- A set of processing units.
- A state of activation.
- An output function for each unit.
- A pattern of connectivity among units.
- A propogation rule for propagating patterns of activities through the network of connectivities.
- An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit.
- A learning rule whereby patterns of connectivity are modified by experience.
- An environment within which the system must operate.
How to Learn Deep Learning
- The hardest part of deep learning is artisanal: how do you know if you’ve got enough data, whether it is in the right format, if your model is training properly, and, if it’s not, what you should do about it?
from fastai.vision.all import *
= untar_data(URLs.PETS)/'images'
path
def is_cat(x): return x[0].isupper()
= ImageDataLoaders.from_name_func(
dls
path,
get_image_files(path),=0.2,
valid_pct=42,
seed=is_cat,
label_func=Resize(224)
item_tfms
)
dls.show_batch()
= cnn_learner(dls, resnet34, metrics=error_rate)
learn 1) learn.fine_tune(
/usr/local/lib/python3.10/dist-packages/fastai/vision/learner.py:288: UserWarning: `cnn_learner` has been renamed to `vision_learner` -- please update your code
warn("`cnn_learner` has been renamed to `vision_learner` -- please update your code")
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|██████████| 83.3M/83.3M [00:00<00:00, 162MB/s]
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.140327 | 0.019135 | 0.007442 | 01:05 |
epoch | train_loss | valid_loss | error_rate | time |
---|
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.070464 | 0.024966 | 0.006766 | 01:00 |
The error rate is the proportion of images that were incorrectly identified.
Check this model actually works with an image of a dog or cat. I’ll download a picture from google and use it for prediction:
import ipywidgets as widgets
= widgets.FileUpload()
uploader uploader
= PILImage.create(uploader.data[0])
im = learn.predict(im)
is_cat, _, probs 256) im.to_thumb(
print(f'Is this a cat?: {is_cat}.')
print(f"Probability it's a cat: {probs[1].item():.6f}")
Is this a cat?: True.
Probability it's a cat: 1.000000
What is Machine Learning?
- A traditional program: inputs -> program -> results.
- In 1949, IBM researcher Arthur Samuel started working on machine learning. His basic idea was this: instead of telling the computer the exact steps required to solve a problem, show it examples of the problem to solve, and let it figure out how to solve it itself.
- In 1961 his checkers-playing program had learned so much that it beat the Connecticut state champion.
- Weights are just variables and a weight assignment is a particular choice of values for those variables.
- The program’s inputs are values that it processes in order to produce its results (for instance, taking image pixels as inputs, and returning the classification “dog” as a result).
- Because the weights affect the program, they are in a sense another kind of input.
- A program using weight assignment: inputs and weights -> model -> results.
- A model is a special kind of program, on that can do many different things depending on the weights.
- Weights = parameters, with the term “weights” reserved for a particulat type of model parameter.
- Learning would become entirely automatic when the adjustment of the weights was also automatic.
- Training a maching learning model: inputs and weights -> model -> results -> performance -> update weights.
- results are different than the performance of a model.
- Using a trained model as a program -> inputs -> model -> results.
- maching learning is the training of programs developed by allowing a computer to learn from its experience, rather than through manually coding the individual steps.
What is a Neural Network?
- Neural networks is a mathematical function that can solve any problem to any level of accuracy.
- Stochastic Gradient Descent (SGD) is a completely general way to update the weights of a neural network, to make it improve at any given task.
- Image classification problem:
- Our inputs are the images.
- Our weights are the weights in the neural net.
- Our model is a neural net.
- Our results are the values that are calculated by the neural net, like “dog” or “cat”.
A Bit of Deep Learning Jargon
- The functional form of the model is called its architecture.
- The weights are called parameters.
- The predictions are calculated from the independent variable, which is the data not including the labels.
- The results or the model are called predictions.
- The measure of performance is called the loss.
- The loss depends not only on the predictions, but also on the correct labels (also known as targets or the dependent variable).
- Detailed training loop: inputs and parameters -> architecture -> predictions (+ labels) -> loss -> update parameters.
Limitations Inherent to Machine Learning
- A model cannot be created without data.
- A model can learn to operate on only the patterns seen in the input data used to train it.
- This learning approach creates only predictions, not recommended actions.
- It’s not enough to just have examples of input data, we need labels for that data too.
- Positive feedback loop: the more the model is used, the more biased the data becomes, making the model even more biased, and so forth.
How Our Image Recognizer Works
item_tfms
are applied to each item whilebatch_tfms
are applied to a batch of items at a time using the GPU.- A classification model attempts to predict a class, or category.
- A regression model is one that attempts to predict one or more numeric quantities, such as temperature or location.
- The parameter
seed=42
sets the random seed to the same value every time we run this code, which means we get the same validation set every time we run it. This way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set. - We care about how well our model works on previously unseen images.
- The longer you train for, the better your accuracy will get on the training set; the validation set accuracy will also improve for a while, but eventually it will start getting worse as the model starts to memorize the training set rather than finding generalizable underlying patterns in the data. When this happens, we say that the model is overfitting.
- Overfitting is the single most important and challenging issue when training for all machine learning practitioners, and all algorithms.
- You should only use methods to avoid overfitting after you have confirmed that overfitting is occurring (i.e., if you have observed the validation accuracy getting worse during training)
- fastai defaults to
valid_pct=0.2
. - Models using architectures with more layers take longer to train and are more prone to overfitting, on the other hand, when using more data, they can be quite a bit more accurate.
- A metric is a function that measures the quality of the model’s predictions using the validation set.
- error_rate tells you what percentage of inputs in the validation set are being classified incorrectly.
- accuracy =
1.0 - error_rate
. - The entire purpose of loss is to define a “measure of performance” that the training system can use to update weights automatically. A good choice for loss is a choice that is easy for stochastic gradient descent to use. But a metric is defined for human consumption, so a good metric is one that is easy for you to understand.
- A model that has weights that have already been trained on another dataset is called a pretrained model.
- When using a pretrained model,
cnn_learner
will remove the last layer and replace it with one or more new layers with randomized weights. This last part of the model is known as the head. - Using a pretrained model for a task different from what is was originally trained for is known as transfer learning.
- The architecture only describes a template for a mathematical function; it doesn’t actually do anything until we provide values for the millions of parameters it contains.
- To fit a model, we have to provide at least one piece of information: how many times to look at each image (known as number of epochs).
fit
will fit a model (i.e., look at images in the training set multiple times, each time updating the parameters to make the predictions closer and closer to the target labels).- Fine-Tuning: a transfer learning technique that updates the parameters of a pretrained model by training for additional epochs using a different task from that used for pretraining.
fine_tune
has a few parameters you can set, but in the default form it does two steps:- Use one epoch to fit just those parts of the model necessary to get the new random head to work correctly with your dataset.
- Use the number of epochs requested when calling the method to fit the entire model, updating the weights of the later layers (especially the head) faster than the earlier layers (which don’t require many changes from the pretrained weights).
- The head of the model is the part that is newly added to be specific to the new dataset.
- An epoch is one complete pass through the dataset.
What Our Image Recognizer Learned
- When we fine tune our pretrained models, we adapt what the last layers focus on to specialize on the problem at hand.
Image Recognizers Can Tackle Non-Image Tasks
- A lot of things can be represented as images.
- Sound can be converted to a spectogram.
- Times series data can be created into an image using Gramian Angular Difference Field (GADF).
- If the human eye can recognize categories from the images, then a deep learning model should be able to do so too.
Jargon Recap
Term | Meaning |
---|---|
Label | The data that we’re trying to predict |
Architecture | The template of the model that we’re trying to fit; i.e., the actual mathematical function that we’re passing the input data and parameters to |
Model | The combination of the architecture with a particular set of parameters |
Parameters | The values in the model that change what task it can do and that are updated through model training |
Fit | Update the parameters of the model such that the predictions of the model using the input data match the target labels |
Train | A synonym for fit |
Pretrained Model | A model that has already been trained, generally using a large dataset, and will be fine-tuned |
Fine-tune | Update a pretrained model for a different task |
Epoch | One complete pass through the input data |
Loss | A measure of how good the model is, chosen to drive training via SGD |
Metric | A measurement of how good the model is using the validation set, chosen for human consumption |
Validation set | A set of data held out from training, used only for measuring how good the model is |
Training set | The data used for fitting the model; does not include any data from the validation set |
Overfitting | Training a model in such a way that it remembers specific features of the input data, rather than generalizing wel to data not seen during training |
CNN | Convolutional neural network; a type of neural network that works particularly well for computer vision tasks |
Deep Learning is Not Just for Image Classification
- Segmentation
- Natural language processing (see below)
- Tabular (see Adults income classification above)
- Collaborative filtering (see MovieLens ratings predictor above)
- Start by using one of the cut-down dataset versions and later scale up to the full-size version. This is how the world’s top practitioners do their modeling in practice; they do most of their experimentation and prototyping with subsets of their data, and use the full dataset only when they have a good understanding of what they have to do.
Validation Sets and Test Sets
- If the model makes an accurate prediction for a data item, that should be because it has learned characteristics of that kind of item, and not because the model has been shaped by actually having seen that particular item.
- Hyperparameters: various modeling choices regarding network architecture, learning rates, data augmentation strategies, and other factors.
- We, as modelers, are evaluating the model by looking at predictions on the validation data when we decide to explore new hyperparameter values and we are in danger of overfitting the validation data through human trial and error and exploration.
- The test set can be used only to evaluate the model at the very end of our efforts.
- Training data is fully exposed to training and modeling processes, validation data is less exposed and test data is fully hidden.
- The test and validation sets should have enough data to ensure that you get a good estimate of your accuracy.
- The discipline of the test set helps us keep ourselves intellectually honest.
- It’s a good idea for you to try out a simple baseline model yourself, so you know what a really simply model can achieve.
Use Judgment in Defining Test Sets
- A key property of the validation and test sets is that they must be representative of the new data you will see in the future.
- As an example, for time series data, use earlier dates for training set and later more recent dates as validation set
- The data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.
from fastai.text.all import *
# I'm using IMDB_SAMPLE instead of the full IMDB dataset since it either takes too long or
# I get a CUDA Out of Memory error if the batch size is more than 16 for the full dataset
# Using a batch size of 16 with the sample dataset works fast
= TextDataLoaders.from_csv(
dls =untar_data(URLs.IMDB_SAMPLE),
path='texts.csv',
csv_fname=1,
text_col=0,
label_col=16)
bs
dls.show_batch()
text | category | |
---|---|---|
0 | xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is | negative |
1 | xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies | positive |
2 | xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after 20 - odd years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present , | negative |
3 | xxbos i really wanted to love this show . i truly , honestly did . \n\n xxmaj for the first time , gay viewers get their own version of the " the xxmaj bachelor " . xxmaj with the help of his obligatory " hag " xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance of love with 15 suitors ( or " mates " as they are referred to in the show ) . xxmaj the only problem is half of them are straight and xxmaj james does n't know this . xxmaj if xxmaj james picks a gay one , they get a trip to xxmaj new xxmaj zealand , and xxmaj if he picks a straight one , straight guy gets $ 25 , xxrep 3 0 . xxmaj how can this not be fun | negative |
4 | xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics that are terribly dated today , the game xxunk you into the role of xxunk even * think * xxmaj i 'm going to attempt spelling his last name ! ) , an xxmaj american xxup xxunk . caught in an underground bunker . xxmaj you fight and search your way through xxunk in order to achieve different xxunk for the six xxunk , let 's face it , most of them are just an excuse to hand you a weapon | positive |
5 | xxbos xxmaj i 'm sure things did n't exactly go the same way in the real life of xxmaj homer xxmaj hickam as they did in the film adaptation of his book , xxmaj rocket xxmaj boys , but the movie " october xxmaj sky " ( an xxunk of the book 's title ) is good enough to stand alone . i have not read xxmaj hickam 's memoirs , but i am still able to enjoy and understand their film adaptation . xxmaj the film , directed by xxmaj joe xxmaj xxunk and written by xxmaj lewis xxmaj xxunk , xxunk the story of teenager xxmaj homer xxmaj hickam ( jake xxmaj xxunk ) , beginning in xxmaj october of 1957 . xxmaj it opens with the sound of a radio broadcast , bringing news of the xxmaj russian satellite xxmaj xxunk , the first artificial satellite in | positive |
6 | xxbos xxmaj to review this movie , i without any doubt would have to quote that memorable scene in xxmaj tarantino 's " pulp xxmaj fiction " ( xxunk ) when xxmaj jules and xxmaj vincent are talking about xxmaj mia xxmaj wallace and what she does for a living . xxmaj jules tells xxmaj vincent that the " only thing she did worthwhile was pilot " . xxmaj vincent asks " what the hell is a pilot ? " and xxmaj jules goes into a very well description of what a xxup tv pilot is : " well , the way they make shows is , they make one show . xxmaj that show 's called a ' pilot ' . xxmaj then they show that show to the people who make shows , and on the strength of that one show they decide if they 're going to | negative |
7 | xxbos xxmaj how viewers react to this new " adaption " of xxmaj shirley xxmaj jackson 's book , which was promoted as xxup not being a remake of the original 1963 movie ( true enough ) , will be based , i suspect , on the following : those who were big fans of either the book or original movie are not going to think much of this one … and those who have never been exposed to either , and who are big fans of xxmaj hollywood 's current trend towards " special effects " being the first and last word in how " good " a film is , are going to love it . \n\n xxmaj things i did not like about this adaption : \n\n 1 . xxmaj it was xxup not a true adaption of the book . xxmaj from the xxunk i had | negative |
8 | xxbos xxmaj the trouble with the book , " memoirs of a xxmaj geisha " is that it had xxmaj japanese xxunk but underneath the xxunk it was all an xxmaj american man 's way of thinking . xxmaj reading the book is like watching a magnificent ballet with great music , sets , and costumes yet performed by xxunk animals dressed in those xxunk far from xxmaj japanese ways of thinking were the characters . \n\n xxmaj the movie is n't about xxmaj japan or real geisha . xxmaj it is a story about a few xxmaj american men 's mistaken ideas about xxmaj japan and geisha xxunk through their own ignorance and misconceptions . xxmaj so what is this movie if it is n't about xxmaj japan or geisha ? xxmaj is it pure fantasy as so many people have said ? xxmaj yes , but then why | negative |
= text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn 4, 1e-2) learn.fine_tune(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.629276 | 0.553454 | 0.740000 | 00:19 |
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.466581 | 0.548400 | 0.740000 | 00:30 |
1 | 0.410401 | 0.418941 | 0.825000 | 00:30 |
2 | 0.286162 | 0.410872 | 0.830000 | 00:31 |
3 | 0.192047 | 0.405275 | 0.845000 | 00:31 |
# view actual vs prediction
learn.show_results()
text | category | category_ | |
---|---|---|---|
0 | xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' " la xxmaj xxunk , " based on a play by xxmaj arthur xxmaj xxunk , who is given an " inspired by " credit . xxmaj it starts from one person , a prostitute , standing on a street corner in xxmaj brooklyn . xxmaj she is picked up by a home contractor , who has sex with her on the hood of a car , but ca n't come . xxmaj he refuses to pay her . xxmaj when he 's off xxunk , she | positive | positive |
1 | xxbos xxmaj bonanza had a great cast of wonderful actors . xxmaj xxunk xxmaj xxunk , xxmaj pernell xxmaj whitaker , xxmaj michael xxmaj xxunk , xxmaj dan xxmaj blocker , and even xxmaj guy xxmaj williams ( as the cousin who was brought in for several episodes during 1964 to replace xxmaj adam when he was leaving the series ) . xxmaj the cast had chemistry , and they seemed to genuinely like each other . xxmaj that made many of their weakest stories work a lot better than they should have . xxmaj it also made many of their best stories into great western drama . \n\n xxmaj like any show that was shooting over thirty episodes every season , there are bound to be some weak ones . xxmaj however , most of the time each episode had an interesting story , some kind of conflict , | positive | negative |
2 | xxbos i watched xxmaj grendel the other night and am compelled to put together a xxmaj public xxmaj service xxmaj announcement . \n\n xxmaj grendel is another version of xxmaj beowulf , the thousand - year - old xxunk - saxon epic poem . xxmaj the scifi channel has a growing catalog of xxunk and uninteresting movies , and the previews promised an xxunk low - budget mini - epic , but this one xxunk to let me switch xxunk . xxmaj it was xxunk , xxunk , bad . i watched in xxunk and horror at the train wreck you could n't tear your eyes away from . i reached for a xxunk and managed to capture part of what i was seeing . xxmaj the following may contain spoilers or might just save your xxunk . xxmaj you 've been warned . \n\n - xxmaj just to get | negative | negative |
3 | xxbos xxmaj this is the last of four xxunk from xxmaj france xxmaj i 've xxunk for viewing during this xxmaj christmas season : the others ( in order of viewing ) were the uninspired xxup the xxup black xxup tulip ( 1964 ; from the same director as this one but not nearly as good ) , the surprisingly effective xxup lady xxmaj oscar ( 1979 ; which had xxunk as a xxmaj japanese manga ! ) and the splendid xxup cartouche ( xxunk ) . xxmaj actually , i had watched this one not too long ago on late - night xxmaj italian xxup tv and recall not being especially xxunk over by it , so that i was genuinely surprised by how much i enjoyed it this time around ( also bearing in mind the xxunk lack of enthusiasm shown towards the film here and elsewhere when | positive | positive |
4 | xxbos xxmaj this is not really a zombie film , if we 're xxunk zombies as the dead walking around . xxmaj here the protagonist , xxmaj xxunk xxmaj louque ( played by an unbelievably young xxmaj dean xxmaj xxunk ) , xxunk control of a method to create zombies , though in fact , his ' method ' is to mentally project his thoughts and control other living people 's minds turning them into hypnotized slaves . xxmaj this is an interesting concept for a movie , and was done much more effectively by xxmaj xxunk xxmaj lang in his series of ' dr . xxmaj mabuse ' films , including ' dr . xxmaj mabuse the xxmaj xxunk ' ( 1922 ) and ' the xxmaj testament of xxmaj dr . xxmaj mabuse ' ( 1933 ) . xxmaj here it is unfortunately xxunk to his quest to | negative | positive |
5 | xxbos " once upon a time there was a charming land called xxmaj france … . xxmaj people lived happily then . xxmaj the women were easy and the men xxunk in their favorite xxunk : war , the only xxunk of xxunk which the people could enjoy . " xxmaj the war in question was the xxmaj seven xxmaj year 's xxmaj war , and when it was noticed that there were more xxunk of soldiers than soldiers , xxunk were sent out to xxunk the ranks . \n\n xxmaj and so it was that xxmaj fanfan ( gerard xxmaj philipe ) , caught xxunk a farmer 's daughter in a pile of hay , escapes marriage by xxunk in the xxmaj xxunk xxunk … but only by first believing his future as xxunk by a gypsy , that he will win fame and fortune in xxmaj his xxmaj | positive | positive |
6 | xxbos xxup ok , let me again admit that i have n't seen any other xxmaj xxunk xxmaj ivory ( the xxunk ) films . xxmaj nor have i seen more celebrated works by the director , so my capacity to xxunk xxmaj before the xxmaj rains outside of analysis of the film itself is xxunk . xxmaj with that xxunk , let me begin . \n\n xxmaj before the xxmaj rains is a different kind of movie that does n't know which genre it wants to be . xxmaj at first , it pretends to be a romance . xxmaj in most romances , the protagonist falls in love with a supporting character , is separated from the supporting character , and is ( sometimes ) united with his or her partner . xxmaj this movie 's hero has already won the heart of his lover but can not | negative | negative |
7 | xxbos xxmaj first off , anyone looking for meaningful " outcome xxunk " cinema that packs some sort of social message with meaningful performances and soul searching dialog spoken by dedicated , xxunk , heartfelt xxunk , please leave now . xxmaj you are wasting your time and life is short , go see the new xxmaj xxunk xxmaj jolie movie , have a good cry , go out & buy a xxunk car or throw away your conflict xxunk if that will make you feel better , and leave us alone . \n\n xxmaj do n't let the door hit you on the way out either . xxup the xxup incredible xxup melting xxup man is a grade b minus xxunk horror epic shot in the xxunk of xxmaj oklahoma by a young , xxup tv friendly cast & crew , and concerns itself with an astronaut who is | positive | negative |
8 | xxbos " national xxmaj treasure " ( 2004 ) is a thoroughly misguided xxunk - xxunk of plot xxunk that borrow from nearly every xxunk and dagger government conspiracy cliché that has ever been written . xxmaj the film stars xxmaj nicholas xxmaj cage as xxmaj benjamin xxmaj xxunk xxmaj xxunk ( how precious is that , i ask you ? ) ; a seemingly normal fellow who , for no other reason than being of a xxunk of like - minded misguided fortune hunters , decides to steal a ' national treasure ' that has been hidden by the xxmaj united xxmaj states xxunk fathers . xxmaj after a bit of subtext and background that plays laughably ( unintentionally ) like xxmaj indiana xxmaj jones meets xxmaj the xxmaj patriot , the film xxunk into one misguided xxunk after another attempting to create a ' stanley xxmaj xxunk | negative | negative |
= "I really liked the movie!"
review_text learn.predict(review_text)
('positive', tensor(1), tensor([0.0174, 0.9826]))
Questionnaire
- Do you need these for deep learning?
- Lots of Math (FALSE).
- Lots of Data (FALSE).
- Lots of expensive computers (FALSE).
- A PhD (FALSE).
- Name five areas where deep learning is now the best tool in the world
- Natural Language Processing (NLP).
- Computer vision.
- Medicine.
- Image generation.
- Recommendation systems.
- What was the name of the first device that was based on the principle of the artificial neuron?
- Mark I Perceptron.
- Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
- A series of processing units.
- A state of activation.
- An output function for each unit.
- A pattern of connectivity among units.
- A propagation rule for propagating patterns of activities through the network of connectivities.
- An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce an output for the unit.
- A learning rule whereby patterns of connectivity are modified by experience.
- An environment within which the system must operate.
- What were the two theoretical misunderstandings that held back the field of neural networks?
- Using multiple layers of the device would allow limitations of one layer to be addressed—this was ignored.
- More than two layers are needed to get practical, good perforamnce—only in the last decade has this been more widely appreciated and applied.
- What is a GPU?
- A Graphical Processing Unit, which can perform thousands of tasks at the same time.
- Open a notebook and execute a cell containing:
1+1
. What happens?- Depending on the server, it may take some time for the output to generate, but running this cell will output
2
.
- Depending on the server, it may take some time for the output to generate, but running this cell will output
- Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen.
- (I did this for the notebook shared for Lesson 1).
- Complete the Jupyter Notebook online appendix.
- Done. Will reference some of it again.
- Why is it hard to use a traditional computer program to recognize images in a photo?
- Because it’s hard to instruct a computer clear instructions to recognize images.
- What did Samuel mean by “weight assignment”?
- A particular choice for weights (variables)
- What term do we normally use in deep learning for what Samuel called “weights”?
- Parameters
- Draw a picture that summarizes Samuel’s view of a machine learning model
- input and weights -> model -> results -> performance -> update weights/inputs
- Why is it hard to understand why a deep learning model makes a particular prediction?
- Because a deep learning model has many layers and connectivities and activations between neurons that are not intuitive to our understanding.
- What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?
- Universal approximation theorem.
- What do you need in order to train a model?
- Labeled data (Inputs and targets).
- Architecture.
- Initial weights.
- A measure of performance (loss, accuracy).
- A way to update the model (SGD).
- How could a feedback loop impact the rollout of a predictive policing model?
- The model will end up predicting where arrests are made, not where crime is taking place, so more police officers will go to locations where more arrests are predicted and feed that data back to the model which will reinforce the prediction of arrests in those areas, continuing this feedback loop of predictions -> arrests -> predictions.
- Do we always have to use 224x224-pixel images with the cat recognition model?
- No, that’s just the convention for image recognition models.
- You can use larger images but it will slow down the training process (it takes longer to open up bigger images).
- What is the difference between classification and regression?
- Classification predicts discrete classes or categories.
- Regression predicts continuous values.
- What is a validation set? What is a test set? Why do we need them?
- A validation set is a dataset upon which a model’s accuracy (or metrics in general) is calculated during training, as well as the dataset upon which the performance of different hyperparameters (like batch size and learning rate) are measured.
- A test set is a dataset upon which a model’s final performance is measured, a truly unseen dataset for both the model and the practitioner
- What will fastai do if you don’t provide a validation set?
- Set aside a random 20% of the data as the validation set by default
- Can we always use a random sample for a validation set? Why or why not?
- No, in situations where we want to ensure that the model’s accuracy is evaluated on data the model has not seen, we should not use a random validation set. Instead, we should create an intentional validation set. For example:
- For time series data, use the most recent dates as the validation set
- For human recognition data, use images of different people for training and validation sets
- No, in situations where we want to ensure that the model’s accuracy is evaluated on data the model has not seen, we should not use a random validation set. Instead, we should create an intentional validation set. For example:
- What is overfitting? Provide an example.
- Overfitting is when a model memorizes features of the training dataset instead of learning generalizations of the features in the data. An example of this is when a model memorizes training data facial features but then cannot recognize different faces in the real world. Another example is when a model memorizes the handwritten digits in the training data, so it cannot then recognize digits written in different handwriting. Overfitting can be observed during training when the validation loss starts to increase as the training loss decreases.
- What is a metric? How does it differ from loss?
- A metric a measurement of how good a model is performing, chosen for human consumption. A loss is also a measurement of how good a model is performing, but it’s chosen to drive training using an optimizer.
- How can pretrained models help?
- Pretrained models are already good at recognizing many generalized features and so they can help by providing a set of weights in an architecture that are capable, reducing the amount of time you need to train a model specific to your task.
- What is the “head” of the model?
- The last/top few neural network layers which are replaced with randomized weights in order to specialize your model via training on the task at hand (and not the task it was pretrained to perform).
- What kinds of features do the early layers of a CNN find? How about the later layers?
- Early layers: simple features lie lines, color gradients
- Later layers: compelx features like dog faces, outlines of people
- Are image models useful only for photos?
- No! Lots of things can be represented by images so if you can represent something (like a sound) as an image (spectogram) and differences between classes/categories are easily recognizable by the human eye, you can train an image classifier to recognize it.
- What is an architecture?
- A template, mathematical function, to which you pass input data to in order to fit/train a model
- What is segmentation?
- Recognizing different objects in an image based on pixel colors (each object is a different pixel color)
- What is
y_range
used for? When do we need it?- It’s used to specify the output range of a regression model. We need it when the target is a continuous value.
- What are hyperparameters?
- Modeling choices such as network architecture, learning rates, data augmentation strategies and other higher level choices that govern the meaning of the weight parameters.
- What is the best way to avoid failures when using AI in an organization?
- Making sure you have good validation and test sets to evaluate the performance of a model on real world data.
- Trying out a simple baseline model to know what level of performance such a model can achieve.
Further Research
- Why is a GPU useful for deep learning? How is a CPU different, and why is it less effective for deep learning?
- CPU vs GPU for Machine Learning
- CPUs process tasks in a sequential manner, GPUs process tasks in parallel.
- GPUs can have thousands of cores, processing tasks at the same time.
- GPUs have many cores processing at low speeds, CPUs have few cores processing at high speeds.
- Some algorithms are optimized for CPUs rather than GPUs (time series data, recommendation systems that need lots of memory).
- Neural networks are designed to process tasks in parallel.
- CPU vs GPU in Machine Learning Algorithms: Which is Better?
- Machine Learning Operations Preferred on CPUs
- Recommendation systems that involve huge memory for embedding layers.
- Support vector machines, time-series data, algorithms that don’t require parallel computing.
- Recurrent neural networks because they use sequential data.
- Algorithms with intensive branching.
- Machine Learning Operations Preferred on GPUs
- Operations that involve parallelism.
- Machine Learning Operations Preferred on CPUs
- Why Deep Learning Uses GPUs
- Neural networks are specifically made for running in parallel.
- CPU vs GPU for Machine Learning
- Try to think of three areas where feedback loops might impact the use of machine learning. See if you can find documented examples of that happening in practice.
- Hidden Risks of Machine Learning Applied to Healthcare: Unintended Feedback Loops Between Models and Future Data Causing Model Degradation
- If clinicians fully trust the machine learning model (100% adoption of the predicted label) the false positive rate (FPR) grows uncontrollably with the number of updates.
- Runaway Feedback Loops in Predictive Policing
- Once police are deployed based on these predictions, data from observations in the neighborhood is then used to further update the model.
- Discovered crime data (e.g., arrest counts) are used to help update the model, and the process is repeated.
- Predictive policing systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless of the true crime rate.
- Pitfalls of Predictive Policing: An Ethical Analysis
- Predictive policing relies on a large database of previous crime data and forecasts where crime is likely to occur. Since the program relies on old data, those previous arrests need to be unbiased to generate unbiased forecasts.
- People of color are arrested far more often than white people for committing the same crime.
- Racially biased arrest data creates biased forecasts in neighborhoods where more people of color are arrested.
- If the predictive policing algorithm is using biased data to divert more police forces towards less affluent neighborhoods and neighborhoods of color, then those neighborhoods are not receiving the same treatment as others.
- Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say
- The algorithm COMPAS which predicts whether a person is “high-risk” and deemed more likely to be arrested in the future, leads to being imprisoned (instead of sent to rehab) or longer sentences.
- Can bots discriminate? It’s a big question as companies use AI for hiring
- If an older candidate makes it past the resume screening process but gets confused by or interacts poorly with the chatbot, that data could teach the algorithm that candidates with similar profiles should be ranked lower
- Echo chambers, rabbit holes, and ideological bias: How YouTube recommends content to real users
- We find that YouTube’s algorithm pushes real users into (very) mild ideological echo chambers.
- We found that 14 out of 527 (~3%) of our users ended up in rabbit holes.
- Finally, we found that, regardless of the ideology of the study participant, the algorithm pushes all users in a moderately conservative direction.
- Hidden Risks of Machine Learning Applied to Healthcare: Unintended Feedback Loops Between Models and Future Data Causing Model Degradation
Lesson 2: Deployment
I’m going to do things a bit differently than how I approached Lesson 1. Jeremy suggested that we first watch the video without pausing in order to understand what we’re going to do and then watch it a second time and follow along. I also want to be mindful of how long I’m running my Paperspace Gradient maching (at $0.51/hour) so that I don’t run the machine when I don’t need its GPU.
So, here’s how I’m going to approach Lesson 2: - Read the Chapter 2 Questionnaire so I know what I’ll be “tested” on at the end - Watch the video without taking notes or running code - Rewatch the video and take notes in this notebook - Add the Kaggle code cells to this notebook and run them in Paperspace - Read the Gradio tutorial without running code - Re-read the Gradio tutorial and follow along with my own code - Read Chapter 2 in the textbook and run code in this notebook in Paperspace - Read Chapter 2 in the textbook and take notes in this notebook (including answers to the Questionnaire)
With this approach, I’ll have a big picture understanding of each step of the lesson and I’ll minimize the time I’m spending running my Paperspace Gradient machine.
Video Notes
Link to this lesson’s video.
- In this lesson we’re doing things that hasn’t been in courses like this before.
- Resource: aiquizzes.com—I signed up and answered a couple of questions.
- Don’t forget the FastAI Forums
- Click “Summarize this Topic” to get a list of the most upvoted posts
- How do we go about putting a model in production?
- Figure out what problem you want to solve
- Figure out how to get data for it
- Gather some data
- Use DuckDuckGo image function
- Download data
- Get rid of images that failed to open
- Data cleaning
- Before you clean your data, train the model
ImageClassifierCleaner
can be used to clean (delete or re-label) the wrongly labeled data in the dataset- cleaner orders by loss so you only need to look at the first few
- Always build a model to find out what things are difficult to recognize in your data and to find the things the model can help you find that are problems in the data
- Train your model again
- Deploy to HuggingFace Spaces
- Install Jupyter Notebook Extensions to get features like table of contents and collapsible sections (with which you can also navigate sections using arrow keys)
- Type
??
followed by function name to get source code - Type
?
followed by function name to get brief info - If you have nbdev installed
doc(<fn>)
will give you link to documentation - Different ways to resize an image
ResizeMethod.Squish
(to see the whole picture with different aspect ratio)ResizeMethod.Pad
(whole image in correct aspect ratio)
- Data Augmentation
RandomResizedCrop
(different bit of an image everytime)batch_tfms=aug_tranforms()
(images get turned, squished, warped, saturated, recolored, etc.)- Use if you are training for more than 5-10 epochs
- In memory, real-time, the image is being resized/cropped/etc.
- Confusion matrix (
ClassificationInterpretation
)- Only meaningful for category labels
- Shows what category errors your model is making (actual vs predicted)
- In a lot of situations this will let you know what the hard categories to classify are (e.g. breeds of pets hard to identify)
.plot_top_losses
tells us where the loss is the highest (prediction/actual/loss/probability)- A loss will be bad (high) if we are wrong + confident or right + unconfident
- On your computer, normal RAM doesn’t get filled up as it saves RAM to hard disk (swapping). GPUs don’t do swapping so do only one thing at a time so you’re not using up all the memory.
- Gradio + HuggingFace Spaces
- Here is my Hello World HuggingFace Space!
- Next, we’ll put a deep learning model in production. In the code cells below, I will train and export a dog vs cat classifier.
# import all the stuff we need from fastai
from fastai.vision.all import *
from fastbook import *
# download and decompress our dataset
= untar_data(URLs.PETS)/'images' path
# define a function to label our images
def is_cat(x): return x[0].isupper()
# create `DataLoaders`
= ImageDataLoaders.from_name_func('.',
dls
get_image_files(path),= 0.2,
valid_pct = 42,
seed = is_cat,
label_func = Resize(192)) item_tfms
# view batch
dls.show_batch()
# train our model using resnet18 to keep it small and fast
= vision_learner(dls, resnet18, metrics = error_rate)
learn 3) learn.fine_tune(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.9/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.199976 | 0.072374 | 0.020298 | 00:19 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.061802 | 0.081512 | 0.020974 | 00:20 |
1 | 0.047748 | 0.030506 | 0.010149 | 00:18 |
2 | 0.021600 | 0.026245 | 0.006766 | 00:18 |
# export our trained learner
'model.pkl') learn.export(
- Following the script in the video, as well as the
git-lfs
andrequirements.txt
in Tanishq Abraham’s tutorial, I deployed a Dog and Cat Classifier on HuggingFace Spaces. - If you run the training for long enough (high number of epochs) the error rate will get worse. We’ll learn why in a future lesson.
- Use fastsetup to setup your local machine with Python and Jupyter.
- They recommend using mamba instead of conda as it is faster.
Notebook Exercise
In the cells below, I’ll run the code provided in the Chapter 2 notebook.
# prepare path and subfolder names
= 'grizzly', 'black', 'teddy'
bear_types = Path('bears') path
# download images of grizzly, black and teddy bears
if not path.exists():
path.mkdir()for o in bear_types:
= (path/o)
dest = True)
dest.mkdir(exist_ok = search_images_ddg(f'{o} bear')
results = results) download_images(dest, urls
# view file paths
= get_image_files(path)
fns fns
(#570) [Path('bears/grizzly/ca9c20c9-e7f4-4383-b063-d00f5b3995b2.jpg'),Path('bears/grizzly/226bc60a-8e2e-4a18-8680-6b79989a8100.jpg'),Path('bears/grizzly/2e68f914-0924-42ed-9e2e-19963fa03a37.jpg'),Path('bears/grizzly/38e2d057-3eb2-4e8e-8e8c-fa409052aaad.jpg'),Path('bears/grizzly/6abc4bc4-2e88-4e28-8ce4-d2cbdb05d7b5.jpg'),Path('bears/grizzly/3c44bb93-2ac5-40a3-a023-ce85d2286846.jpg'),Path('bears/grizzly/2c7b3f99-4c8e-4feb-9342-dacdccf60509.jpg'),Path('bears/grizzly/a59f16a6-fa06-42d5-9d79-b84e130aa4e3.jpg'),Path('bears/grizzly/d1be6dc8-da42-4bee-ac31-0976b175f1e3.jpg'),Path('bears/grizzly/7bc0d3bd-a8dd-477a-aa16-449124a1afb5.jpg')...]
# get list of corrupted images
= verify_images(fns)
failed failed
(#24) [Path('bears/grizzly/2e68f914-0924-42ed-9e2e-19963fa03a37.jpg'),Path('bears/grizzly/f77cfeb5-bfd2-4c39-ba36-621f117a65f6.jpg'),Path('bears/grizzly/37aa7eed-5a83-489d-b8f5-54020ba41390.jpg'),Path('bears/black/90a464ad-b0a7-4cf5-86ff-72d507857007.jpg'),Path('bears/black/f03a0ceb-4983-4b8f-a001-84a0875704e8.jpg'),Path('bears/black/6193c1cf-fda4-43f9-844e-7ba7efd33044.jpg'),Path('bears/teddy/474bdbb3-de2f-49e5-8c5b-62b4f3f50548.JPG'),Path('bears/teddy/58755f3f-227f-4fad-badc-a7d644e54296.JPG'),Path('bears/teddy/eb55dc00-3d01-4385-a7da-d81ac5211696.jpg'),Path('bears/teddy/97eadc96-dc4e-4b3f-8486-88352a3b2270.jpg')...]
# remove corrupted image files
map(Path.unlink) failed.
(#24) [None,None,None,None,None,None,None,None,None,None...]
# create DataBlockfor training
= DataBlock(
bears = (ImageBlock, CategoryBlock),
blocks = get_image_files,
get_items = RandomSplitter(valid_pct = 0.2, seed = 42),
splitter = parent_label,
get_y = Resize(128)
item_tfms )
# create DataLoaders object
= bears.dataloaders(path) dls
# view training batch -- looks good!
= 4, nrows = 1) dls.show_batch(max_n
# view validation batch -- looks good!
= 4, nrows = 1) dls.valid.show_batch(max_n
# observe how images react to the "squish" ResizeMethod
= bears.new(item_tfms = Resize(128, ResizeMethod.Squish))
bears = bears.dataloaders(path)
dls = 4, nrows = 1) dls.valid.show_batch(max_n
Notice how the grizzlies in the third image look abnormally skinny, since the image is squished.
# observe how images react to the "pad" ResizeMethod
= bears.new(item_tfms = Resize(128, ResizeMethod.Pad, pad_mode = 'zeros'))
bears = bears.dataloaders(path)
dls = 4, nrows = 1) dls.valid.show_batch(max_n
In these images, the original aspect ratio is maintained.
# observe how images react to the transform RandomResizedCrop
= bears.new(item_tfms = RandomResizedCrop(128, min_scale = 0.3))
bears = bears.dataloaders(path)
dls = 4, nrows = 1) dls.valid.show_batch(max_n
# observe how images react to data augmentation transforms
= bears.new(item_tfms=Resize(128), batch_tfms = aug_transforms(mult = 2))
bears = bears.dataloaders(path)
dls # note that data augmentation occurs on training set
= 8, nrows = 2, unique = True) dls.train.show_batch(max_n
# train the model in order to clean the data
= bears.new(
bears = RandomResizedCrop(224, min_scale = 0.5),
item_tfms = aug_transforms())
batch_tfms
= bears.dataloaders(path)
dls dls.show_batch()
# train the model
= vision_learner(dls, resnet18, metrics = error_rate)
learn 4) learn.fine_tune(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 100MB/s]
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.221027 | 0.206999 | 0.055046 | 00:34 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.225023 | 0.177274 | 0.036697 | 00:32 |
1 | 0.162711 | 0.189059 | 0.036697 | 00:31 |
2 | 0.144491 | 0.191644 | 0.027523 | 00:31 |
3 | 0.122036 | 0.188296 | 0.018349 | 00:31 |
# view Confusion Matrix
= ClassificationInterpretation.from_learner(learn)
interp interp.plot_confusion_matrix()
The model confused a grizzly for a black bear and a black bear for a grizzly bear. It didn’t confuse any of the teddy bears, which makes sense given how different they look to real bears.
# view images with the highest losses
5, nrows = 1) interp.plot_top_losses(
The fourth image has two humans in it, which is likely why the model didn’t recognize the bear. The model correctly predicted the the third and fifth images but with low confidence (57% and 69%).
# clean the training and validation sets
from fastai.vision.widgets import *
= ImageClassifierCleaner(learn)
cleaner cleaner
I cleaned up the images (deleting an image of a cat, another of a cartoon bear, a dog, and a blank image).
# delete or move images based on the dropdown selections made in the cleaner
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)
# create new dataloaders object
= bears.new(
bears = RandomResizedCrop(224, min_scale = 0.5),
item_tfms = aug_transforms())
batch_tfms
= bears.dataloaders(path)
dls dls.show_batch()
# retrain the model
= vision_learner(dls, resnet18, metrics = error_rate)
learn 4) learn.fine_tune(
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.289331 | 0.243501 | 0.074074 | 00:32 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.225567 | 0.256021 | 0.064815 | 00:32 |
1 | 0.218850 | 0.288018 | 0.055556 | 00:34 |
2 | 0.184954 | 0.315183 | 0.055556 | 00:31 |
3 | 0.141363 | 0.308634 | 0.055556 | 00:31 |
Weird!! After cleaning the data, the model got worse (1.8% error rate is now 5.6%). I’ll run the cleaning routine again and retrain the model to see if it makes a difference. Perhaps there are still erroneous images in the mix.
# view Confusion Matrix
= ClassificationInterpretation.from_learner(learn)
interp interp.plot_confusion_matrix()
This time, the model incorrectly predicted 3 grizzlies as black bears, 2 black bears as grizzlies and 1 black bear as a teddy.
= ImageClassifierCleaner(learn)
cleaner cleaner
# delete or move images based on the dropdown selections made in the cleaner
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)
# create new dataloaders object
= bears.new(
bears = RandomResizedCrop(224, min_scale = 0.5),
item_tfms = aug_transforms())
batch_tfms
= bears.dataloaders(path)
dls # The lower right image (cartoon bear) is one that I selected "Delete" for
# in the cleaner so I'm not sure why it's still there
# I'm wondering if there's something wrong with the cleaner or how I'm using it?
dls.show_batch()
# retrain the model
= vision_learner(dls, resnet18, metrics = error_rate)
learn 4) learn.fine_tune(
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.270627 | 0.130137 | 0.046729 | 00:31 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.183445 | 0.078030 | 0.028037 | 00:32 |
1 | 0.201080 | 0.053461 | 0.018692 | 00:33 |
2 | 0.183515 | 0.019479 | 0.009346 | 00:37 |
3 | 0.144900 | 0.012682 | 0.000000 | 00:31 |
I’m still not confident that this is a 100% accurate model given the bad images in the training set (such as the cartoon bear) but I’m going to go with it for now.
Book Notes
Chapter 2: From Model to Production
- Underestimating the constraints and overestimating the capabilities of deep learning may lead to frustratingly poor results, at least until you gain some experience and can solve the problems that arise.
- Overstimating the constraints and underestimating the capabilities of deep learning may mean you do not attempt a solvable problem because you talk yourself out of it.
- The most important thing (as you learn deep learning) is to ensure that you have a project to work on.
- The goal is not to find the “perfect” dataset or project, but just to get started and iterate from there.
- Complete every step as well as you can in a reasonable amount of time, all the way to the end.
- Computer vision
- Object recognition: recognize items in an image
- Object detection: recognition + highlight the location and name of each found object.
- Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style from those used to train the model.
- NLP
- Deep learning is not good at generating correct responses.
- Text generation models will always be technologically a bit ahead of models for recognizing automatically generated text.
- Google’s online translation system is based on deep learning.
- Combining text and images
- A deep learning model can be trained on input images with output captions written in English, and can learn to generate surprisingly appropriate captions automatically for new images (with no guarantee the captions will be correct).
- Deep learning should be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely.
- Tabular data
- If you already have a system that is using random forests or gradient boosting machines then switching to or adding deep learning may not result in any dramatic improvement.
- Deep learning greatly increases the variety of columns that you can include.
- Deep learning models generally take longer to train than random forests or gradient boosting machines.
- Recommendation systems
- A special type of tabular data (a high-cardinality categorical variable representing users and another one representing products or something similar).
- Deep learning models are good at handling high cardinality categorical variables and thus recommendation systems.
- Deep learning models do well when combining these variables with other kinds of data such as natural language, images, or additional metadata represented as tables such as user information, previous transactions, and so forth.
- Nearly all machine learning approaches have th downside that they tell you only which products a particular user might like, rather than what recommendations would be helpful for a user.
- Other data types
- Using NLP deep learning methods is the current SOTA approach for many types of protein analysis since protein chains look a lot like natural language documents.
- The Drivetrain Approach
- Defined objective
- Levers (what inputs can we control)
- Data (what inputs we can collect)
- Models (how the levers influence the objective)
- Gathering data
- For most projects you can find the data online.
- Use
duckduckgo_search
- From Data to DataLoaders
DataLoaders
is a thin class that just stores whateverDataLoader
objects you pass to it and makes them available astrain
andvalid
.- To turn data into a
DataLoaders
object we need to tell fastai four things:- What kinds of data we are working with.
- How to get the list of items.
- How to label these items.
- How to create the validation set.
- With the
DataBlock
API you can customize every stage of the creation of yourDataLoaders
:
= DataBlock(
bears =(ImageBlock, CategoryBlock),
blocks=get_image_files,
get_items=RandomSplitter(valid_pct=0.2, seed=42),
splitter=parent_label,
get_y=Resize(128)) item_tfms
- explanation of
DataBlock
blocks
specifies types for independent (the thing we are using to make predictions from) and dependent (our target) variables.- Computers don’t really know how to create random numbers at all, but simply create lists of numbers that look random; if you provide the same starting point for that list each time–called the seed–then you will get the exact same list each time.
- Images need to be all the same size.
- A
DataLoader
is a class that provides batches of a few items at a time to the GPU. - fastai default batch size is 64 items.
Resize
crops the images to fit a square shape, alternatively you can pad (ResizeMethod.Pad
) or squish (ResizeMethod.Squish
) the images to fit the square.- Squishing (model learns that things look differently from how they actually are), cropping (removal of features that would allow us to perform recognition) and padding (lot of empty space which is just wasted computation) are wasteful or problematic approaches. Instead, randomly select part of the image and then crop to just that part. On each epoch, we randomly select a different part of each image (
RandomResizedCrop(min_scale)
). - Training the neural network with examples of images in which objects are in slightly different places and are slightly different sizes helps it to understand the basic concept of what an object is and how it can be represented in an image.
- Data Augmentation
- refers to creating random variations of our input data, such that they appear different but do not change the meaning of the data (rotation, flipping, perspective warping, brightness changes, and contrast changes).
aug_transforms()
provides a standard set of augmentations.- Use
batch_tfms
to process a batch at a time on the GPU to save time.
- Training your model and using it to clean your data
- View confusion matrix with
ClassificationInterpretation.from_learner(learn)
. The diagonal shows images that are classified correctly. Calculated using validation set. - Sort images by loss using
interp.plot_top_losses()
. - Loss is high if the model is incorrect (especially if it’s also confident) or if it’s correct but not confident.
- A model can help you find data issues more quickly.
- View confusion matrix with
- Using the model for inference
learn.export()
will export a .pkl file.- Get predictions with
learn_inf.predict(<input>)
. This returns three things: the predicted category in the same format you originally provided, the index of the predicted category and the probabilities for each category. - You can access the
DataLoaders
as an attribute of theLearner
:learn_inf.dls
.
- Deploying your app
- You almost certainly do not need a GPU to serve your model in production.
- To classify a few users’ images at a time, you need high-volume. If you do have this scenario, use Microsoft’s ONNX Runtime or AWS SageMaker.
- Recommended wherever possible to deploy the model itself to a server and have your mobile/edge application connect to it as a web service.
- If your application uses sensitive data, your users may be concerned about an approach that sends that data to a remote server.
- How to Avoid Disaster
- Understanding and testing the behavior of a deep learning model is much more difficult than with most other code you write.
- The kinds of photos that people are most likely to upload to the internet are the kinds of photos that do a good job of clearly and artistically displaying their subject matter, which isn’t the kind of input this system is going to be getting in real life. We may need to do a lot of our own data collection and labeling to create a useful system.
- out-of-domain data: data that our model sees in production that is very different from what it saw during training.
- domain shift: data that our model sees changes over time.
- Deployment process
- Manual Process: run model in parallel, humans check all predictions.
- Limited scope deployment: careful human supervision, time or geography limited.
- Gradual expansion: good reporting systems needed, consider what could go wrong.
- Unforeseen consequences and feedback loops
- Your model may change the behavior of the system it’s a part of.
- feedback loops can result in negative implications of bias getting worse.
- A helpful exercise prior to rolling out a significant machine learning system is to consider the question “What would happen if it went really, really well?”
- Questionnaire
- Where do text models currently have a major deficiency?
- Providing correct or accurate information.
- What are possible negative societal implications of text generation models?
- The viral spread of misinformation, which can lead to real actions and harms.
- In situations where a model might make mistakes, and those mistakes could be harmful, what is a god alternative to automating a process?
- Run the model in parallel with a human checking its predictions.
- What kind of tabular data is deep learning particularly good at?
- High-cardinality categorical data.
- What’s a key downside of directly using a deep learning model for recommendation systems?
- It will only tell you which products a particular user might like, rather than what recommendations may be helpful for a user.
- What are the steps of the Drivetrain Approach?
- Define an objective
- Determine what inputs (levers) you can control
- Collect data
- Create models (how the levers influence the objective)
- How do the steps of the Drivetrain Approach map to a recommendation system?
- Objective: drive additional sales due to recommendations.
- Level: ranking of the recommendations.
- Data: must be collectd to generate recommendations that will cause new sales.
- Models: two for purchasing probabilities conditional on seeing or not seeing a recommendation, the difference between these two probabilities is a utility function for a given recommendation to a customer (low in cases when algorithm recommends a familiar book that the customer has already rejected, or a book they would have bought even without the recommendation).
- Create an image recognition model using data you curate, and deploy it on the web.
- Here.
- What is
DataLoaders
?- A class that creates validation and training sets/batches that are fed to the GPUS
- What four things do we need to tell fastai to create
DataLoaders
?- What kinds of data we are working with (independent and dependent variables).
- How to get the list of items.
- How to label these items.
- How to create the validation set.
- What does the
splitter
parameter toDataBlock
do?- Set aside a percentage of the data as the validation set.
- How do we ensure a random split always gives the same validation set?
- Set the
seed
parameter to the same value.
- Set the
- What letters are often used to signify the independent and dependent variables?
- Independent: x
- Dependent: y
- What’s the difference between crop, pad and squish resize approaches? When might you choose one over the others?
- Crop: takes a section of the image and resizes it to the desired size. Use when it’s not necessary to have the model traing on the whole image.
- Pad: keep the image aspect ratio as is, add white/black padding to make a square. Use when it’s necessary to have the model train on the whole image.
- Squish: distorts the image to fit a square. Use when it’s not necessary to have the model train on the original aspect ratio.
- What is data augmentation? Why is it needed?
- Data augmentation is the creation of random variations of input data through techniques like rotation, flipping, brightness changes, contrast changes, perspective warping. It is needed to help the model learn to recognize objects under different lighting/perspective conditions.
- Provide an example of where the bear classification model might work poorly in production, due to structural or style differences in the training data.
- What is the difference between
item_tfms
andbatch_tfms
?item_tfms
are transforms that are applied to each item in the set.batch_tfms
are transforms applied to a batch of items in the set.
- What is a confusion matrix?
- A matrix that shows the counts of predicted (columns) vs. actual (rows) labels, with the diagonal being correctly predicted data.
- What does
export
save?- Both the architecture and the parameters as a
.pkl
file.
- Both the architecture and the parameters as a
- What is called when we use a model for making predictions, instead of training?
- Inference
- What are IPython widgets?
- interactive browser controls for Jupyter Notebooks.
- When would you use a CPU for deployment? When might a GPU be better?
- CPU: low-volume, single-user inputs for prediction.
- GPU: high-volume, multiple-user inputs for predictions.
- What are the downsides of deploying your app to a server, instead of to a client (or edge) device such as a phone or PC?
- Requires internet connectivity (and latency).
- Sensitive data transfer may not be okay with your users.
- Managing complexity and scaling the server creates additional overhead.
- What are three examples of problems that could occur when rolling out a bear warning system in practice?
- out-of-domain data: the images captured of real bears may not be represented in the model’s training or validation datasets.
- Number of bear alerts doubles or halves after rollout of the new system in some location.
- out-of-domain data: the cameras may capture low-resolution images of the bears when the training and validation set had high resolution images.
- What is out-of-domain data?
- Data your model sees in production that it hasn’t seen during training.
- What is domain shift?
- Changes in the data that our model sees in production over time.
- What are the three steps in the deployment process?
- Manual Process
- Limited scope deployment
- Gradual expansion
- Where do text models currently have a major deficiency?
- Further Research
- Consider how the Drivetrain Approach maps to a project or problem you’re interested in.
- I’ll take the example of a project I will be working on to practice what I’m learning in this book: training a deep learning model which correctly classifies the typeface from a collection of single letter.
- The objective: correctly classify typeface from a collection of single letters.
- Levers: observe key features of key letters that are the “tell” of a typeface.
- Data: using an HTML canvas object and Adobe Fonts, generate images of single letters of multiple fonts associated with each category of typeface.
- Models: output the probabilities of each typeface a given collection of single letters is predicted as. This allows for some flexibility in how you categorize letters based on the shared characteristics of more than one typeface that the particular font may possess.
- I’ll take the example of a project I will be working on to practice what I’m learning in this book: training a deep learning model which correctly classifies the typeface from a collection of single letter.
- When might it be best to avoid certain types of data augmentation?
- In my typeface example, it’s best to avoid perspective warping because it will change key features used to recognize a typeface.
- For a project you’re interested in applying deep learning to, consider the thought experiment, “What would happen if it went really, really well?”
- If my typeface classifier works really well, I imagine it would be used by people to take pictures of real-world text and learn what typeface it is. This may inspire a new wave of typeface designers. If a feedback loop was possible, and the classifier went viral, the very definition of typefaces may be affected by popular opinion. Taken a step further, a generative model may be inspired by this classifier, and a new wave of AI typeface would be launched—however this last piece is highly undesirable unless the training of the model involves appropriate licensing and attribution of the typefaces used that are created by humans. Furthermore, from what I understand from reading about typefaces, the process of creating a typeface is an amazing experience and should not be replaced with AI generators. If I created such a generative model (in part 2 of the course) and it went viral (do HuggingFace Spaces go viral? Cuz that’s where I would launch it), I would take it down.
- Start a blog (done!)
- Consider how the Drivetrain Approach maps to a project or problem you’re interested in.
Lesson 3: Neural Net Foundations
Video Notes
Link to this lesson’s video.
- How to do a fast.ai lesson
- Watch lecture
- Run notebook & experiment
- Reproduce results
- Repeat with different dataset
- fastbook repo contains “clean” folder with notebooks without markdown text.
- Two concepts: training the model and using it for inference.
- Over 500 architectures in
timm
(PyTorch Image Models). timm.list_models(pattern)
will list models matching the pattern.- Pass string name of timm model to the
Learner
like:vision_learner(dls, 'timm model string', ...)
. in22
= ImageNet with 22k categories,1k
= ImageNet with 1k categories.learn.predict
probabilities are in the order oflearn.dls.vocab
.learn.model
contains the trained model which contains lots of nested layers.learn.model.get_submodule
takes a dotted string navigating through the hierarchy.- Machine learning models fit functions to data.
- Things between dollar signs is LaTeX
"$...$"
. - General form of quadratic:
def quad(a,b,c,x): return a*x**2 + b*x + c
partial
fromfunctools
fixes parameters to a function.- Loss functions tells us how good our model is.
@interact
fromipywidgets
allows sliders tied to the function its above.- Mean Squared Error:
def mse(preds, acts): return ((preds - acts)**2).mean()
- For each parameter we need to know: does the loss get better when we increase or decrease the parameter?
- The derivative is the function that tells you: if you increase the input does the output increase or decrease, and by how much?
*params
spreads out the list into its elements and passes each to the function.- 1-D (rank 1) tensor (lists of numbers), 2-D tensor (tables of numbers) 3-D tensor (layers of tables of numbers) and so on.
tensor.requires_grad_()
calculates the gradient of the values in the tensor whenever its used in calculation.loss.backward()
calculates gradients on the inputs to the loss function.abc.grad
attribute added after gradients are calculated.- negative gradient means increasing the parameter will decrease the loss.
- update parameters
with torch.no_grad()
so PyTorch doesn’t calculate the gradient (since it’s being used in a function). We don’t want the derivative of the parameter update, we only want the derivative with respect to the loss. - Automate the steps
- Calculate Mean Squared Error
- Call
.backward.
- Subtract gradient * small number from the parameters
- All optimizers are built on the concept of gradient descent (calculate gradients and decrease the loss).
- We need a better function than quadratics
- Rectified Linear Unit:
def rectified_linear(m,b,x):
= m*x + b
y return torch.clip(y, 0.)
torch.clip
turns values less than value specified to the value specified (in this case, it turns negative values to 0.).- Adding rectified linear functions together gives us an arbitrarily squiggly function that will match as close as we want to the data.
- ReLU in 2D gives you surfaces, volumes in 3D, etc.
- With this incredibly simple foundation you can construct an arbitrarily precise, accurate model.
- When you have ReLU’s getting added together, and gradient descent to optimize the parameters, and samples of inputs and outputs that you want, the computer “draws the owl” so to speak.
- Deep learning is using gradient descent to set some parameters to make a wiggly function (the addition of lots of rectified linear units or something very similar to that) that matches your data.
- When selecting an architecture, the biggest beginner mistake is that they jump to the highest-accuracy models.
- At the start of the project, just use resnet18 so you can spend all of your time trying things out (data augmentation, data cleaning, different external data) as fast as possible.
- Trying better architectures is the very last thing to do.
- How do I know if I have enough data?
- Vast majority of projects in industry wait far too long until they train their first model.
- Train your first model on day 1 with whatever CSV files you can hack together.
- Semi-supervised training lets you get dramatically more out of your data.
- Often it’s easy to get lots of inputs but hard to get lots of outputs (labels).
- Units of parameter gradients: for each increase in parameter of 1, the gradient is the amount the loss would change by (if it stayed at that slope—which it doesn’t because it’s a curve).
- Once you get close enough to the optimal parameter value, all loss functions look like quadratics
- The slope of the loss function decreases as you approach the optimal
- Learning rate (a hyperparameter) is multiplied by the gradient, the product of which is subtracted from the parameters
- If you pick a learning rate that’s too large, you will diverge; if you pick too small, it’ll take too long to train.
- http://matrixmultiplication.xyz/
- Matrix multiplication is the critical foundational mathematical operation in deep learning
- GPUs are good at matrix multiplication with tensor cores (multiply together two 4x4 matrices)
- Use a spreadsheet to train a deep learning model on the Kaggle Titanic dataset in which you’re trying to predict if a person survived.
- Columns included (convert some of them to binary categorical variables):
- Survivor
- Pclass
- Convert to Pclass_1 and Pclass_2 (both 1/0).
- Sex
- Convert to Male (0/1) column.
- Age
- Remove blanks.
- Normalize (Age/Max(Age))
- SibSp (how many siblings they have)
- Parch (# of parents/children aboard)
- Fare
- Lots of very small and very large fares, log of it has a much more even distribution. (LOG10(Fare + 1).
- Embarked (which city they got on at)
- Remove blanks.
- Convert to Embark_S and Embark_C (both 1/0)
- Ones
- Add a column of 1s.
- Create random numbers for params (including Const) with
=RAND() - 0.5
. - Regression
- Use
SUMPRODUCT
to calculate linear function. - Loss of linear function is (linear function result - Survived) ^ 2.
- Average loss = AVERAGE(individual losses).
- User “Solver” with GRG Nonlinear Solving Method. Set Objective to minimize the cell with average loss. Change parameter variables.
- Use
- Neural Net
- Two sets of params.
- Two linear columns.
- Two ReLU columns.
- Adding two linear functions together gives you a linear function, we want all those wiggles (non-linearity) so we use ReLUs.
- ReLU:
IF(lin1 < 0, 0, lin1)
- Preds = sum of the two ReLUs.
- Loss same as regression.
- Solver process the same as well.
- Neural Net (Matrix Multiplication)
- Transpose params into two columns.
=MMULT(...)
for Lin1 and Lin2 columns.- Keep ReLU, Preds and Loss column the same.
- Optimize params using Solver.
- Helpful reminder to build intuition around matrix multiplication: it’s doing the same thing as the
SUMPRODUCT
s.
- Dummy variables: Pclass_1, Pclass_2, etc.
- Columns included (convert some of them to binary categorical variables):
- Next lesson: NLP
- It’s about making predictions with text data which most of the time is in the form of prose.
- First Farsi NLP resource was created by a student of the first fastai course.
- NLP most commonly and practically used for classification.
- Document = one or two words, a book, a wikipedia page, any length.
- Classification = figure out a category for a document.
- Sentiment analysis
- Author identification
- Legal discovery (is this document in-scope or out-of-scope)
- Organizing documents by topic
- Triaging inbound emails
- Classification of text looks similar to images.
- We’re going to use a different library: HuggingFace Transformers
- Helpful to see how things are done in more than one library.
- HuggingFace Transformers doesn’t have the same high-level API. Have to do more stuff manually. Which is good for students at this point of the course.
- It’s a good library.
- Before the next lesson take a look at the NLP notebook and U.S. Patent to Phrase Matching data.
- Trying to figure out in patents whether two concepts are referring to the same thing. The document is text1, text2, and the category is similar (1) or not-similar (0).
- Will also talk about the two very important topics of validation sets and metrics.
Notebook Exercise
Training and Deploying: Pets Classifier
In this section, I’ll train a Pets dataset classifier as done by Jeremy in this notebook.
from fastai.vision.all import *
import timm
= untar_data(URLs.PETS)/'images'
path
# Create DataLoaders object
= ImageDataLoaders.from_name_func('.',
dls
get_image_files(path),=0.2,
valid_pct=42,
seed=RegexLabeller(pat = r'^([^/]+)_\d+'),
label_func=Resize(224)) item_tfms
=4) dls.show_batch(max_n
# train using resnet34 as architecture
= vision_learner(dls, resnet34, metrics=error_rate)
learn 3) learn.fine_tune(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /root/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|██████████| 83.3M/83.3M [00:00<00:00, 196MB/s]
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.496086 | 0.316146 | 0.100135 | 01:12 |
epoch | train_loss | valid_loss | error_rate | time |
---|
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.441153 | 0.315289 | 0.093369 | 01:04 |
1 | 0.289844 | 0.215224 | 0.069012 | 01:05 |
2 | 0.123374 | 0.191152 | 0.060217 | 01:03 |
The pets classifier, using resnet34 and 3 epochs, is about 94% accurate.
# train using a timm architecture
# from the convnext family of architectures
= vision_learner(dls, 'convnext_tiny_in22k', metrics=error_rate).to_fp16()
learn 3) learn.fine_tune(
/usr/local/lib/python3.10/dist-packages/timm/models/_factory.py:114: UserWarning: Mapping deprecated model name convnext_tiny_in22k to current convnext_tiny.fb_in22k.
model = create_fn(
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.130913 | 0.240275 | 0.085927 | 01:06 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.277886 | 0.193888 | 0.061570 | 01:08 |
1 | 0.196232 | 0.174544 | 0.055480 | 01:09 |
2 | 0.127525 | 0.156720 | 0.048038 | 01:07 |
Using convnext_tiny_in22k, the model is about 95.2% accurate, about a 20% decrease in error rate.
# export to use in gradio app
'pets_model.pkl') learn.export(
You can view my pets classifier gradio app here.
Which image models are best?
In this section, I’ll plot the timm model results as shown in Jeremy’s notebook.
import pandas as pd
# load data
= pd.read_csv("../../../fastai-course/data/results-imagenet.csv")
df_results df_results.head()
model | top1 | top1_err | top5 | top5_err | param_count | img_size | crop_pct | interpolation | |
---|---|---|---|---|---|---|---|---|---|
0 | eva02_large_patch14_448.mim_m38m_ft_in22k_in1k | 90.052 | 9.948 | 99.048 | 0.952 | 305.08 | 448 | 1.0 | bicubic |
1 | eva02_large_patch14_448.mim_in22k_ft_in22k_in1k | 89.966 | 10.034 | 99.012 | 0.988 | 305.08 | 448 | 1.0 | bicubic |
2 | eva_giant_patch14_560.m30m_ft_in22k_in1k | 89.786 | 10.214 | 98.992 | 1.008 | 1,014.45 | 560 | 1.0 | bicubic |
3 | eva02_large_patch14_448.mim_in22k_ft_in1k | 89.624 | 10.376 | 98.950 | 1.050 | 305.08 | 448 | 1.0 | bicubic |
4 | eva02_large_patch14_448.mim_m38m_ft_in1k | 89.570 | 10.430 | 98.922 | 1.078 | 305.08 | 448 | 1.0 | bicubic |
top1 = what percent of the time the model predicts the correct label with the highest probability.
top5 = what percent of the time the model predits the correct label with the top 5 highest probabilities.
# remove additional text from model name
'model_org'] = df_results['model']
df_results['model'] = df_results['model'].str.split('.').str[0]
df_results[ df_results.head()
model | top1 | top1_err | top5 | top5_err | param_count | img_size | crop_pct | interpolation | model_org | |
---|---|---|---|---|---|---|---|---|---|---|
0 | eva02_large_patch14_448 | 90.052 | 9.948 | 99.048 | 0.952 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_m38m_ft_in22k_in1k |
1 | eva02_large_patch14_448 | 89.966 | 10.034 | 99.012 | 0.988 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_in22k_ft_in22k_in1k |
2 | eva_giant_patch14_560 | 89.786 | 10.214 | 98.992 | 1.008 | 1,014.45 | 560 | 1.0 | bicubic | eva_giant_patch14_560.m30m_ft_in22k_in1k |
3 | eva02_large_patch14_448 | 89.624 | 10.376 | 98.950 | 1.050 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_in22k_ft_in1k |
4 | eva02_large_patch14_448 | 89.570 | 10.430 | 98.922 | 1.078 | 305.08 | 448 | 1.0 | bicubic | eva02_large_patch14_448.mim_m38m_ft_in1k |
def get_data(part, col):
# get benchmark data and merge with model data
= pd.read_csv(f'../../../fastai-course/data/benchmark-{part}-amp-nhwc-pt111-cu113-rtx3090.csv').merge(df_results, on='model')
df # convert samples/sec to sec/sample
'secs'] = 1. / df[col]
df[# pull out the family name from the model name
'family'] = df.model.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
df[# removing `resnetv2_50d_gn` and `resnet50_gn` for some reason
= df[~df.model.str.endswith('gn')]
df # not sure why the following line is here, "in22" was removed in cell above
str.contains('in22'),'family'] = df.loc[df.model.str.contains('in22'),'family'] + '_in22'
df.loc[df.model.str.contains('resnet.*d'),'family'] = df.loc[df.model.str.contains('resnet.*d'),'family'] + 'd'
df.loc[df.model.# only returns subset of families
return df[df.family.str.contains('^re[sg]netd?|beit|convnext|levit|efficient|vit|vgg|swin')]
# load benchmark inference data
= get_data('infer', 'infer_samples_per_sec')
df df.head()
model | infer_samples_per_sec | infer_step_time | infer_batch_size | infer_img_size | param_count_x | top1 | top1_err | top5 | top5_err | param_count_y | img_size | crop_pct | interpolation | model_org | secs | family | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12 | levit_128s | 21485.80 | 47.648 | 1024 | 224 | 7.78 | 76.526 | 23.474 | 92.872 | 7.128 | 7.78 | 224 | 0.900 | bicubic | levit_128s.fb_dist_in1k | 0.000047 | levit |
13 | regnetx_002 | 17821.98 | 57.446 | 1024 | 224 | 2.68 | 68.746 | 31.254 | 88.536 | 11.464 | 2.68 | 224 | 0.875 | bicubic | regnetx_002.pycls_in1k | 0.000056 | regnetx |
15 | regnety_002 | 16673.08 | 61.405 | 1024 | 224 | 3.16 | 70.278 | 29.722 | 89.528 | 10.472 | 3.16 | 224 | 0.875 | bicubic | regnety_002.pycls_in1k | 0.000060 | regnety |
17 | levit_128 | 14657.83 | 69.849 | 1024 | 224 | 9.21 | 78.490 | 21.510 | 94.012 | 5.988 | 9.21 | 224 | 0.900 | bicubic | levit_128.fb_dist_in1k | 0.000068 | levit |
18 | regnetx_004 | 14440.03 | 70.903 | 1024 | 224 | 5.16 | 72.398 | 27.602 | 90.828 | 9.172 | 5.16 | 224 | 0.875 | bicubic | regnetx_004.pycls_in1k | 0.000069 | regnetx |
# plot the data
import plotly.express as px
= 1000, 800
w,h
def show_all(df, title, size):
return px.scatter(df,
=w,
width=h,
height=df[size]**2,
size=title,
title='secs',
x='top1',
y=True,
log_x='family',
color='model_org',
hover_name=[size]
hover_data
)
'Inference', 'infer_img_size') show_all(df,
# plot a subset of the data
= 'levit|resnetd?|regnetx|vgg|convnext.*|efficientnetv2|beit|swin'
subs
def show_subs(df, title, size, subs):
= df[df.family.str.fullmatch(subs)]
df_subs return px.scatter(df_subs,
=w,
width=h,
height=df_subs[size]**2,
size=title,
title='ols',
trendline={'log_x':True},
trendline_options='secs',
x='top1',
y=True,
log_x='family',
color='model_org',
hover_name=[size])
hover_data
'Inference', 'infer_img_size', subs) show_subs(df,
# plot inference speed vs parameter count
px.scatter(df,=w,
width=h,
height='param_count_x',
x='secs',
y=True,
log_x=True,
log_y='infer_img_size',
color='model_org',
hover_name=['infer_samples_per_sec', 'family']
hover_data )
# repeat plots for training data
= get_data('train', 'train_samples_per_sec')
tdf 'Training', 'train_img_size') show_all(tdf,
# subset of training data
'Training', 'train_img_size', subs) show_subs(tdf,
How does a neural net really work?
In this section, I’ll recreate the content in Jeremy’s notebook here, where he walks through a quadratic example of training a function to match the data.
A neural network layer:
- Multiplies each input by a number of values. These values are known as parameters.
- Adds them up for each group of values.
- Replaces the negative numbers with zeros.
# helper functions
from ipywidgets import interact
from fastai.basics import *
# helper functions
'figure', dpi=90)
plt.rc(
def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
= torch.linspace(min,max, 100)[:,None]
x if ylim: plt.ylim(ylim)
plt.plot(x, f(x), color)if title is not None: plt.title(title)
In the plot_function
definition, I’ll look into why [:,None]
is added after torch.linspace(min, max, 100)
-1, 1, 10), torch.linspace(-1, 1, 10).shape torch.linspace(
(tensor([-1.0000, -0.7778, -0.5556, -0.3333, -0.1111, 0.1111, 0.3333, 0.5556,
0.7778, 1.0000]),
torch.Size([10]))
-1, 1, 10)[:,None], torch.linspace(-1, 1, 10)[:,None].shape torch.linspace(
(tensor([[-1.0000],
[-0.7778],
[-0.5556],
[-0.3333],
[-0.1111],
[ 0.1111],
[ 0.3333],
[ 0.5556],
[ 0.7778],
[ 1.0000]]),
torch.Size([10, 1]))
[:, None]
adds a dimension to the tensor.
Next he fits a quadratic function to data:
def f(x): return 3*x**2 + 2*x + 1
'$3x^2 + 2x + 1$') plot_function(f,
In order to simulate “finding” or “learning” the right model fit, he creates a general quadratic function:
def quad(a, b, c, x): return a*x**2 + b*x + c
and uses partial
to make new quadratic functions:
def mk_quad(a, b, c): return partial(quad, a, b, c)
# recreating original quadratic with mk_quad
= mk_quad(3, 2, 1)
f2 plot_function(f2)
f2
functools.partial(<function quad at 0x148c6d000>, 3, 2, 1)
quad
<function __main__.quad(a, b, c, x)>
Next he simulates noisy measurements of the quadratic f
:
# `scale` parameter is the standard deviation of the distribution
def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)
# noise function matches quadratic x + x^2 (with noise) + constant noise
def add_noise(x, mult, add): return x * (1+noise(x, mult)) + noise(x,add)
42)
np.random.seed(
= torch.linspace(-2, 2, steps=20)[:, None]
x = add_noise(f(x), 0.15, 1.5) y
# values match Jeremy's
5], y[:5] x[:
(tensor([[-2.0000],
[-1.7895],
[-1.5789],
[-1.3684],
[-1.1579]]),
tensor([[11.8690],
[ 6.5433],
[ 5.9396],
[ 2.6304],
[ 1.7947]], dtype=torch.float64))
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x148e16320>
# overlay data with variable quadratic
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
plt.scatter(x, y)=(-3,13)) plot_function(mk_quad(a, b, c), ylim
Important note changing sliders: only after changing b
and c
values do you realize that a
also needs to be changed.
Next, he creates a measure for how well the quadratic fits the data, mean absolute error (distance from each data point to the curve).
def mae(preds, acts): return (torch.abs(preds-acts)).mean()
# update interactive plot
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
= mk_quad(a,b,c)
f
plt.scatter(x,y)= mae(f(x), y)
loss =(-3,12), title=f"MAE: {loss:.2f}") plot_function(f, ylim
In a neural network we’ll have tens of millions or more parameters to fit and thousands or millions of data points to fit them to, which we can’t do manually with sliders. We need to automate this process.
If we know the gradient of our mae()
function with respect to our parameters, a
, b
and c
, then that means we know how adjusting a parameter will change the function. If, say, a
has a negative gradient, then we know increasing a
will decrease mae()
. So we find the gradient of the parameters with respect to the loss function and adjust our parameters a bit in the opposite direction of the gradient sign.
To do this we need a function that will take the parameters as a single vector:
def quad_mae(params):
= mk_quad(*params)
f return mae(f(x), y)
# testing it out
# should equal 2.4219
1.1, 1.1, 1.1]) quad_mae([
tensor(2.4219, dtype=torch.float64)
# pick an arbitrary starting point for our parameters
= torch.tensor([1.1, 1.1, 1.1])
abc
# tell pytorch to calculate its gradients
abc.requires_grad_()
# calculate loss
= quad_mae(abc)
loss loss
tensor(2.4219, dtype=torch.float64, grad_fn=<MeanBackward0>)
# calculate gradients
loss.backward()
# view gradients
abc.grad
tensor([-1.3529, -0.0316, -0.5000])
# increase parameters to decrease loss based on gradient sign
with torch.no_grad():
-= abc.grad*0.01
abc = quad_mae(abc)
loss
print(f'loss={loss:.2f}')
loss=2.40
The loss has gone down from 2.4219
to 2.40
. We’re moving in the right direction.
The small number we multiply gradients by is called the learning rate and is the most important hyper-parameter to set when training a neural network.
# use a loop to do a few more iterations
for i in range(10):
= quad_mae(abc)
loss
loss.backward()with torch.no_grad(): abc -= abc.grad*0.01
print(f'step={i}; loss={loss:.2f}')
step=0; loss=2.40
step=1; loss=2.36
step=2; loss=2.30
step=3; loss=2.21
step=4; loss=2.11
step=5; loss=1.98
step=6; loss=1.85
step=7; loss=1.72
step=8; loss=1.58
step=9; loss=1.46
The loss continues to decrease. Here are our parameters and their gradients at this stage:
abc
tensor([1.9634, 1.1381, 1.4100], requires_grad=True)
abc.grad
tensor([-13.4260, -1.0842, -4.5000])
A neural network can approximate any computable function, given enough parameters using two key steps:
- Matrix multiplication.
- The function \(max(x,0)\), which simply replaces all negative numbers with zero.
The combination of a linear function and \(max\) is called a rectified linear unit and can be written as:
def rectified_linear(m,b,x):
= m*x+b
y return torch.clip(y, 0.)
1, 1)) plot_function(partial(rectified_linear,
# we can do the same thing using PyTorch
import torch.nn.functional as F
def rectified_linear2(m,b,x): return F.relu(m*x+b)
1,1)) plot_function(partial(rectified_linear2,
Create an interactive ReLU:
@interact(m=1.5, b=1.5)
def plot_relu(m, b):
=(-1,4)) plot_function(partial(rectified_linear, m, b), ylim
Observe what happens when we add two ReLUs together:
def double_relu(m1,b1,m2,b2,x):
return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)
@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
=(-1,6)) plot_function(partial(double_relu, m1,b1,m2,b2), ylim
Creating a triple ReLU function to fit our data:
def triple_relu(m1,b1,m2,b2,m3,b3,x):
return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x) + rectified_linear(m3,b3,x)
def mk_triple_relu(m1,b1,m2,b2,m3,b3): return partial(triple_relu, m1,b1,m2,b2,m3,b3)
@interact(m1=-1.5, b1=-1.5, m2=0.5, b2=0.5, m3=1.5, b3=1.5)
def plot_double_relu(m1, b1, m2, b2, m3, b3):
= mk_triple_relu(m1,b1,m2,b2,m3,b3)
f
plt.scatter(x,y)= mae(f(x), y)
loss =(-3,12), title=f"MAE: {loss:.2f}") plot_function(f, ylim
This same approach can be extended to functions with 2, 3, or more parameters. Drawing squiggly lines through some points is literally all that deep learning does. The above steps will, given enough time and enough data, create (for example) an owl recognizer if you feed it enough owls and non-owls.
We can could do thousands of computations on a GPU instead of the above CPU computation. We can greatly reduce the amount of computation and data needed by using a convolution instead of a matrix multiplication. We could make things much faster if, instead of starting with random parameters, we start with parameters of someone else’s model that does something similar to what we want (transfer learning).
Gradient Descent with Microsoft Excel
Following the instructions in the fastai course lesson video, I’ve created a Microsoft Excel deep learning model here for the Titanic Kaggle data.
As shown in the course video, I trained three different models—linear regression, neural net (using SUMPRODUCT
) and neural net (using MMULT
). After running Microsoft Excel’s Solver, I got the final (different than video) mean loss for each model:
- linear: 0.14422715
- nnet: 0.14385956
- mmult: 0.14385956
The linear model loss in the video was about 0.10 and the neural net loss was about 0.08. So, my models didn’t do as well.
Book Notes
In this section, I’ll take notes while reading Chapter 4 in the fastai textbook.
Pixels: The Foundations of Computer Vision
- We’ll use the MNIST dataset for our experiments, which contains handwritten digits.
- MNIST is collected by the National Institute of Standards and Technology and collated into a machine learning dataset by Yann Lecun who used MNIST in 1998 in LeNet-5, the first computer system to demonstrate practically useful recognition of handwritten digits.
- We’ve seen that the only consisten trait among every fast.ai student who’s gone on to be a world-class practitioner is that they are all very tenacious.
- In this chapter we’ll create a model that can classify any image as a 3 or a 7.
from fastai.vision.all import *
= untar_data(URLs.MNIST_SAMPLE) path
# ls method added by fastai
# lists the count of items
path.ls()
(#3) [Path('/root/.fastai/data/mnist_sample/labels.csv'),Path('/root/.fastai/data/mnist_sample/train'),Path('/root/.fastai/data/mnist_sample/valid')]
/'train').ls() (path
(#2) [Path('/root/.fastai/data/mnist_sample/train/3'),Path('/root/.fastai/data/mnist_sample/train/7')]
# 3 and 7 are the labels
= (path/'train'/'3').ls().sorted()
threes = (path/'train'/'7').ls().sorted()
sevens threes
(#6131) [Path('/root/.fastai/data/mnist_sample/train/3/10.png'),Path('/root/.fastai/data/mnist_sample/train/3/10000.png'),Path('/root/.fastai/data/mnist_sample/train/3/10011.png'),Path('/root/.fastai/data/mnist_sample/train/3/10031.png'),Path('/root/.fastai/data/mnist_sample/train/3/10034.png'),Path('/root/.fastai/data/mnist_sample/train/3/10042.png'),Path('/root/.fastai/data/mnist_sample/train/3/10052.png'),Path('/root/.fastai/data/mnist_sample/train/3/1007.png'),Path('/root/.fastai/data/mnist_sample/train/3/10074.png'),Path('/root/.fastai/data/mnist_sample/train/3/10091.png')...]
# view one of the images
= threes[1]
im3_path = Image.open(im3_path)
im3 im3
# the image is stored as numbers
4:10, 4:10] array(im3)[
array([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 29],
[ 0, 0, 0, 48, 166, 224],
[ 0, 93, 244, 249, 253, 187],
[ 0, 107, 253, 253, 230, 48],
[ 0, 3, 20, 20, 15, 0]], dtype=uint8)
# same thing, but a PyTorch tensor
4:10, 4:10] tensor(im3)[
tensor([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 29],
[ 0, 0, 0, 48, 166, 224],
[ 0, 93, 244, 249, 253, 187],
[ 0, 107, 253, 253, 230, 48],
[ 0, 3, 20, 20, 15, 0]], dtype=torch.uint8)
# use pandas.DataFrame to color code the array
= tensor(im3)
im3_t = pd.DataFrame(im3_t[4:15, 4:22])
df **{'font-size': '6pt'}).background_gradient('Greys') df.style.set_properties(
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 29 | 150 | 195 | 254 | 255 | 254 | 176 | 193 | 150 | 96 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 48 | 166 | 224 | 253 | 253 | 234 | 196 | 253 | 253 | 253 | 253 | 233 | 0 | 0 | 0 |
3 | 0 | 93 | 244 | 249 | 253 | 187 | 46 | 10 | 8 | 4 | 10 | 194 | 253 | 253 | 233 | 0 | 0 | 0 |
4 | 0 | 107 | 253 | 253 | 230 | 48 | 0 | 0 | 0 | 0 | 0 | 192 | 253 | 253 | 156 | 0 | 0 | 0 |
5 | 0 | 3 | 20 | 20 | 15 | 0 | 0 | 0 | 0 | 0 | 43 | 224 | 253 | 245 | 74 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 249 | 253 | 245 | 126 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 101 | 223 | 253 | 248 | 124 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 11 | 166 | 239 | 253 | 253 | 253 | 187 | 30 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 16 | 248 | 250 | 253 | 253 | 253 | 253 | 232 | 213 | 111 | 2 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 43 | 98 | 98 | 208 | 253 | 253 | 253 | 253 | 187 | 22 | 0 |
The background white pixels are stored a the number 0, black is the number 255, and shades of grey between the two. The entire image contains 28 pixels across and 28 pixels down for a total of 768 pixels.
How might a computer recognize these two digits?
Ideas:
3s and 7s have distinct features. A seven has generally two straight lines at different angles, a three as two sets of curves stacked on each other. The point where the two curves intersect could be a recognizable feature of the the digit three. The point where the two straight-ish lines intersect could be a recognizable feature of the digit seven. One feature of confusion could be handwritten threes with a straight line at the top, similar to a seven. Another feature of confusion could be a handwritten 3 with a straight-ish ending stroke at the bottom, matching a similar stroke of a 7.
First Try: Pixel Similarity
Idea: find the average pixel value for every pixel of the 3s, then do the same for the 7s. To classify an image, see which of the two ideal digits the image is most similar to.
Baseline: A simple model that you are confident should perform reasonably well. It should be simple to implement and easy to test, so that you can then test each of your improved ideas and make sure they are always better than your baseline. Without starting with a sensible baseline, it is difficult to know whether your super-fancy models are any good.
# list comprehension of all digit images
= [tensor(Image.open(o)) for o in sevens]
seven_tensors = [tensor(Image.open(o)) for o in threes]
three_tensors len(three_tensors), len(seven_tensors)
(6131, 6265)
# use fastai's show_image to display tensor images
1]); show_image(three_tensors[
For every pixel position, we want to compute the average over all the images of the intensity of that pixel. To do this, combine all the images in this list into a single three-dimensional tensor.
When images are floats, the pixel values are expected to be between 0 and 1.
= torch.stack(seven_tensors).float()/255
stacked_sevens = torch.stack(three_tensors).float()/255
stacked_threes stacked_threes.shape
torch.Size([6131, 28, 28])
# the length of a tensor's shape is its rank
# rank is the number of axes and dimensions in a tensor
# shape is the size of each axis of a tensor
len(stacked_threes.shape)
3
# rank of a tensor
stacked_threes.ndim
3
We calculate the mean of all the image tensors by taking the mean along dimension 0 of our stacked, rank-3 tensor. This is the dimension that indexes over all the images.
= stacked_threes.mean(0)
mean3 mean3.shape
torch.Size([28, 28])
; show_image(mean3)
This is the ideal number 3 based on the dataset. It’s saturated where all the images agree it should be saturated (much of the background, the intersection of the two curves, and top and bottom curve), but it becomes wispy and blurry where the images disagree.
# do the same for sevens
= stacked_sevens.mean(0)
mean7 ; show_image(mean7)
How would I calculate how similar a particular image is to each of our ideal digits?
I would take the average of the absolute difference between each pixel’s intensity and the corresponding mean digit pixel intensity. The lower the average difference, the closer the digit is to the ideal digit.
# sample 3
= stacked_threes[1]
a_3 ; show_image(a_3)
L1 norm = Mean of the absolute value of differences.
Root mean squared error (RMSE) = square root of mean of the square of differences.
# L1 norm
= (a_3 - mean3).abs().mean()
dist_3_abs
# RMSE
= ((a_3 - mean3)**2).mean().sqrt()
dist_3_sqr dist_3_abs, dist_3_sqr
(tensor(0.1114), tensor(0.2021))
# L1 norm
= (a_3 - mean7).abs().mean()
dist_7_abs
# RMSE
= ((a_3 - mean7)**2).mean().sqrt()
dist_7_sqr dist_7_abs, dist_7_sqr
(tensor(0.1586), tensor(0.3021))
For both L1 norm and RMSE, the distance between the 3 and the “ideal” 3 is less than the distance to the ideal 7, so our simple model will give the right prediction in this case.
Both distances are provided in PyTorch:
float(), mean7), F.mse_loss(a_3, mean7).sqrt() F.l1_loss(a_3.
(tensor(0.1586), tensor(0.3021))
MSE = mean squared error.
MSE will penalize bigger mistakes more heavily (and be lenient with small mistakes) than L1 norm.
NumPy Arrays and PyTorch Tensors
A NumPy array is a multidimensional table of data with all items of the same type.
jagged array: nested arrays of different sizes.
If the items of the array are all of simple type such as integer or float, NumPy will store them as a compact C data structure in memory.
PyTorch tensors cannot be jagged. PyTorch tensors can live on the GPU. And can calculate their derivatives.
# creating arrays and tensors
= [[1,2,3], [4,5,6]]
data = array(data)
arr = tensor(data)
tns
arr
array([[1, 2, 3],
[4, 5, 6]])
tns
tensor([[1, 2, 3],
[4, 5, 6]])
# select a row
1] tns[
tensor([4, 5, 6])
# select a column
1] tns[:,
tensor([2, 5])
# slice
1, 1:3] tns[
tensor([5, 6])
# standard operators
+ 1 tns
tensor([[2, 3, 4],
[5, 6, 7]])
# tensor type
type() tns.
'torch.LongTensor'
# tensor changes type when needed
* 1.5).type() (tns
'torch.FloatTensor'
Computing Metrics Using Broadcasting
metric = a number that is calculated based on the predictions of our model and the correct labels in our dataset in order to tell us how good our model is.
Calculate the metric on the validation set.
= torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255
valid_3_tens
= torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255
valid_7_tens
valid_3_tens.shape, valid_7_tens.shape
(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))
# measure distance between image and ideal
def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
mnist_distance(a_3, mean3)
tensor(0.1114)
# calculate mnist_distance for digit 3 validation images
= mnist_distance(valid_3_tens, mean3)
valid_3_dist valid_3_dist, valid_3_dist.shape
(tensor([0.1109, 0.1202, 0.1276, ..., 0.1357, 0.1262, 0.1157]),
torch.Size([1010]))
PyTorch broadcasts mean3
to each of the 1010 valid_3_dist
tensors in order to calculate the distance. It doesn’t actually copy mean3
1010 times. It does the whole calculation in C (or CUDA for GPU).
In mean((-1, -2))
, the tuple (-1, -2)
represents a range of axes. This tells PyTorch that we want to take the mean ranging over the values indexed by the last two axes of the tensor—the horizontal and the vertical dimensions of an image.
If the distance between the digit in question and the ideal 3 is less than the distance to the ideal 7, then it’s a 3:
def is_3(x): return mnist_distance(x, mean3) < mnist_distance(x, mean7)
float() is_3(a_3), is_3(a_3).
(tensor(True), tensor(1.))
# full validation set---thanks to broadcasting
is_3(valid_3_tens)
tensor([ True, True, True, ..., False, True, True])
# calculate accuracy
= is_3(valid_3_tens).float().mean()
accuracy_3s = (1 - is_3(valid_7_tens).float()).mean()
accuracy_7s
+ accuracy_7s) / 2 accuracy_3s, accuracy_7s, (accuracy_3s
(tensor(0.9168), tensor(0.9854), tensor(0.9511))
We are getting more than 90% accuracy on both 3s and 7s. But they are very different looking digits and we’re classifying only 2 out of 10 digits, so we need to make a better model.
Stochastic Gradient Descent
Arthur Samuel’s description of machine learning
Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.
Our pixel similarity approach doesn’t have any weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. We can’t improve our pixel similarity approach.
We could look at each individual pixel and come up with a set of weights for each, such that the highest weights are associated with those pixels most likely to be black for a particular category. For example, pixels toward the bottom right are not very likely to be activate for a 7, so they should have a low weight for a 7, but ther are likely to be activated for an 8, so they should have a high weight for an 8. This can be represented as a function and set of weight values for each possible category, for instance, the probability of being the number 8:
def pr_eight(x,w) = (x*w).sum()
X is the image, represented as a vector (with all the rows stacked up end to end into a single long line) and the weights are a vector W. We need some way to update the weights to make them a little bit better. We want to find the specific values for the vector W that cause the result of our function to be high for those images that are 8s and low for those images that are not. Searching for the best vector W is a way to search for the best function for recognizing 8s.
Steps required to turn this function into a machine learning classifier:
- Initialize the weights.
- For each image, use these weights to predict whether it appears to be a 3 or a 7.
- Based on these predictions, calculate how good the model is (its loss).
- Calculate the gradient, which measures for each weight how changing that weight would change the loss.
- Step (that is, change) all the weights based on that calculation.
- Go back to step 2 and repeat the process.
- Iterate until you decide to stop the training process (for instance, because the model is good enough or you don’t want to wait any longer).
Initialize: Initialize parameters to random values.
Loss: We need a function that will return a number that is small if the performance of the model is good (by convention).
Step: Gradients allow us to directly figure out in which direction and by roughly how much to change each weight.
Stop: Keep training until the accuracy of the model started getting worse or we ran out of time, or once the number of epochs we decided are complete.
Calculating Gradients
Create an example loss function:
def f(x): return x**2
Pick a tensor value at which we want gradients:
= tensor(3.).requires_grad_() xt
= f(xt)
yt yt
tensor(9., grad_fn=<PowBackward0>)
Calculate gradients (backpropagation–during the backward pass of the network, as opposed to forward pass which is where the activations are calculated):
yt.backward()
View the gradients:
xt.grad
tensor(6.)
The derivative of x**2 is 2*x. When x = 3 the derivative is 6, as calculated above.
Calculating vector gradients:
= tensor([3., 4., 10.]).requires_grad_()
xt xt
tensor([ 3., 4., 10.], requires_grad=True)
Add sum
to our function so it takes a vector and returns a scalar:
def f(x): return (x**2).sum()
= f(xt)
yt yt
tensor(125., grad_fn=<SumBackward0>)
yt.backward() xt.grad
tensor([ 6., 8., 20.])
If the gradients are very large, that may suggest that we have more adjustments to do, whereas if they are very small, that may suggest that we are close to the optimal value.
Stepping with a Learning Rate
Deciding how to change our parameters based on the values of the gradients—multiplying the gradient by some small number called the learning rate (LR):
w -= w.grad * lr
This is knowns as stepping your parameters using an optimization step.
If you pick a learning rate too low, that can mean having to do a lot of steps. If you pick a learning rate too high, that’s even worse, because it can result in the loss getting worse. If the learning rate is too high it may also “bounce” around.
An End-to-End SGD Example
Example: measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, get slower as it went up the hill, and speed up again going downhill.
= torch.arange(0,20).float(); time time
tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19.])
= torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
speed speed
tensor([72.1328, 55.1778, 39.8417, 33.9289, 21.9506, 18.0992, 11.3346, 0.3637,
7.3242, 4.0297, 3.9236, 4.1486, 1.9496, 6.1447, 12.7890, 23.8966,
30.6053, 45.6052, 53.5180, 71.2243])
; plt.scatter(time, speed)
We added a bit of random noise since measuring things manually isn’t precise.
What was the roller coaster’s speed? Using SGD, we can try to find a function that matches our observations. Guess that it will be a quadratic of the form a*(time**2) + (b*t) + c
.
We want to distinguish clearly between the function’s input (the time when we are measuring the coaster’s speed) and its parameters (the values that define which quadratic we’re trying).
Collect parameters in one argument and separate t
and params
in the function’s signature:
def f(t, params):
= params
a,b,c return a*(t**2) + (b*t) + c
Define a loss function:
def mse(preds, targets): return ((preds-targets)**2).mean()
Step 1: Initialize the parameters
= torch.randn(3).requires_grad_() params
Step 2: Calculate the predictions
= f(time, params) preds
Create a little function to see how close our predictions are to our targets:
def show_preds(preds, ax=None):
if ax is None: ax=plt.subplots()[1]
ax.scatter(time, speed)='red')
ax.scatter(time, to_np(preds), color-300,100)
ax.set_ylim(
show_preds(preds)
Step 3: Calculate the loss
= mse(preds, speed)
loss loss
tensor(11895.1143, grad_fn=<MeanBackward0>)
Step 4: Calculate the gradients
loss.backward() params.grad
tensor([-35554.0117, -2266.8909, -171.8540])
params
tensor([-0.5364, 0.6043, 0.4822], requires_grad=True)
Step 5: Step the weights
= 1e-5
lr -= lr * params.grad.data
params.data = None params.grad
Let’s see if the loss has improved (it has) and take a look at the plot:
= f(time, params)
preds mse(preds, speed)
tensor(2788.1594, grad_fn=<MeanBackward0>)
show_preds(preds)
Step 6: Repeat the process
def apply_step(params, prn=True):
= f(time, params)
preds = mse(preds, speed)
loss
loss.backward()-= lr * params.grad.data
params.data = None
params.grad if prn: print(loss.item())
return preds
for i in range(10): apply_step(params)
2788.159423828125
1064.841552734375
738.7333984375
677.02001953125
665.3380737304688
663.1239013671875
662.7010498046875
662.6172485351562
662.59765625
662.5902709960938
= plt.subplots(1,4,figsize=(12,3))
_, axs for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()
Step 7: Stop
We decided to stop after 10 epochs arbitrarily. In practice, we would watch the training and validation losses and our metrics to decide when to stop.
Summarizing Gradient Descent
- At the beginning, the weights of our model can be random (training from scratch) or come from a pretrained model (transfer learning).
- In both cases the model will need to learn better weights.
- Use a loss function to compare model outputs to targets.
- Change the weights to make the loss a bit lower by multiple gradients by the learning rate and subtracting from the parameters.
- Iterate until you have reached the lowest loss and then stop.
The MNIST Loss Function
Concatenate the images into a single tensor. view
changes the shape of a tensor without changing its contents. -1
is a special parameter to view
that means “make this axis as big as necessary to fit all the data”.
= torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28) train_x
Use the label 1
for 3s and 0
for 7s. Unsqueeze adds a dimension of size one.
= tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_y train_x.shape, train_y.shape
(torch.Size([12396, 784]), torch.Size([12396, 1]))
PyTorch Dataset
is required to return a tuple of (x,y)
when indexed.
= list(zip(train_x, train_y))
dset = dset[0]
x,y x.shape,y
(torch.Size([784]), tensor([1]))
Prepare the validation dataset:
= torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_x = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_y = list(zip(valid_x, valid_y))
valid_dset = valid_dset[0]
x,y x.shape, y
(torch.Size([784]), tensor([1]))
Step 1: Initialize the parameters
We need an initially random weight for every pixel.
def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()
= init_params((28*28,1))
weights weights.shape
torch.Size([784, 1])
\(y = wx + b\).
We created w (weights) now we need to create b (intercept or bias):
= init_params(1)
bias bias
tensor([-0.0313], requires_grad=True)
Step 2: Calculate the predictions
Prediction for one image
0] * weights.T).sum() + bias (train_x[
tensor([0.5128], grad_fn=<AddBackward0>)
In Python, matrix multiplication is represetend with the @ operator:
def linear1(xb): return xb@weights + bias
= linear1(train_x)
preds preds
tensor([[ 0.5128],
[-3.8324],
[ 4.9791],
...,
[ 3.0790],
[ 4.1521],
[ 0.3523]], grad_fn=<AddBackward0>)
To decide if an output represents a 3 or a 7, we can just check whether it’s greater than 0:
= (preds>0.0).float() == train_y
corrects corrects
tensor([[ True],
[False],
[ True],
...,
[False],
[False],
[False]])
float().mean().item() corrects.
0.38964182138442993
Step 3: Calculate the loss
A very small change in the value of a weight will often not change the accuracy at all, and thus the gradient is 0 almost everywhere. It’s not useful to use accuracy as a loss function.
We need a loss function that when our weights result in slightly better predictions, gives us a slightly better loss.
In this case, what does “slightly better prediction mean”: if the correct answer is 3 (1
), the score is a little higher, or if the correct answer is a 7 (0
), the score is a little lower.
The loss function receives not the images themselves, but the predictions from the model.
The loss function will measure how distant each prediction is from 1 (if it should be 1) and how distant it is from 0 (if it should be 0) and then it will take the mean of all those distances.
def mnist_loss(predictions, targets):
return torch.where(targets==1, 1-predictions, predictions).mean()
Try it out with sample predictions and targets:
= tensor([1,0,1])
trgts = tensor([0.9, 0.4, 0.2])
prds ==1, 1-prds, prds) torch.where(trgts
tensor([0.1000, 0.4000, 0.8000])
This function returns a lower number when predictions are more accurate, when accurate predictions are more confident and when inaccurate predictions are less confident.
Since we need a scalar for the final loss, mnist_loss
takes the mean of the previous tensor:
mnist_loss(prds, trgts)
tensor(0.4333)
mnist_loss
assumes that predictions are between 0 and 1. We need to ensure that, using sigmoid
, which always outputs a number between 0 and 1:
def sigmoid(x): return 1/(1+torch.exp(-x))
='Sigmoid', min=-4, max=4) plot_function(torch.sigmoid, title
It’s also a smooth curve that only goes up, which makes it easier for SGD to find meaningful gradients. Update mnist+loss
to first apply sigmoid
to the inputs:
def mnist_loss(predictions, targets):
= predictions.sigmoid()
predictions return torch.where(targets==1, 1-predictions, predictions).mean()
We already had a metric, which was overall accuracy. So why did we define a loss?
To drive automated learning, the loss must be a function that has a meaningful derivative. It can’t have big flat sections and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level.
The loss function is calculated for each item in our dataset, and then at the end of an epoch, the loss values are all averaged and the overall mean is reported for the epoch.
It is important that we focus on metrics, rather than the loss, when judging the performance of a model.
SGD and Mini-Batches
The optimization step: change or update the weights based on the gradients.
To take an optimization step, we need to calculate the loss over one or more data items. Calculating the loss for the whole dataset would take a long time, calculating it for a single item would not use much information so it would result in an imprecise and unstable gradient.
Calculate the average loss for a few data items at a time (mini-batch). The number of data items in the mini-batch is called the batch-size.
A larger batch size means you will get a more accurate and stable estimate of your dataset’s gradients from the loss function, but it will take longer and you will process fewer mini-batches per epoch. Using batches of data works well for GPUs, but give the GPU too many items at once and it will run out of memory.
We get better generalization if we can vary things during training (like performing data augmentation). One simple and effective thing we can vary is what data items we put in each mini-batch. Randomly shuffly the dataset before we create mini-batches. The DataLoader
will do the shuffling and mini-batch collation for you:
= range(15)
coll = DataLoader(coll, batch_size=5, shuffle=True)
dl list(dl)
[tensor([10, 3, 8, 11, 0]),
tensor([6, 1, 7, 9, 4]),
tensor([12, 13, 5, 2, 14])]
For training, we want a collection containing independent and dependent variables. A Dataset
in PyTorch is a collection containing tuples of independent and dependent variables.
= L(enumerate(string.ascii_lowercase))
ds ds
(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]
list(enumerate(string.ascii_lowercase))[:5]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]
When we pass a Dataset
to a Dataloader
we will get back many batches that are themselves tuples of tensors representing batches of independent and dependent variables:
= DataLoader(ds, batch_size=6, shuffle=True)
dl list(dl)
[(tensor([24, 2, 4, 8, 9, 13]), ('y', 'c', 'e', 'i', 'j', 'n')),
(tensor([23, 17, 6, 14, 25, 18]), ('x', 'r', 'g', 'o', 'z', 's')),
(tensor([22, 5, 7, 20, 3, 19]), ('w', 'f', 'h', 'u', 'd', 't')),
(tensor([ 0, 21, 12, 1, 16, 10]), ('a', 'v', 'm', 'b', 'q', 'k')),
(tensor([11, 15]), ('l', 'p'))]
Putting It All Together
In code, the process will be implemented something like this for each epoch:
for x,y in dl:
# calculate predictions
= model(x)
pred # calculate the loss
= loss_func(pred, y)
loss # calculate the gradients
loss.backward()# step the weights
-= parameters.grad * lr parameters
Step 1: Initialize the parameters
= init_params((28*28, 1))
weights = init_params(1) bias
A DataLoader
can be created from a Dataset
:
= DataLoader(dset, batch_size=256)
dl = first(dl)
xb,yb xb.shape, yb.shape
(torch.Size([256, 784]), torch.Size([256, 1]))
Do the same for the validation set:
= DataLoader(valid_dset, batch_size=256) valid_dl
Create a mini-batch of size 4 for testing:
= train_x[:4]
batch batch.shape
torch.Size([4, 784])
= linear1(batch)
preds preds
tensor([[10.4546],
[ 9.4603],
[-0.2426],
[ 6.7868]], grad_fn=<AddBackward0>)
= mnist_loss(preds, train_y[:4])
loss loss
tensor(0.1404, grad_fn=<MeanBackward0>)
Step 4: Calculate the gradients
loss.backward() weights.grad.shape, weights.grad.mean(), bias.grad
(torch.Size([784, 1]), tensor(-0.0089), tensor([-0.0619]))
Create a function to calculate gradients:
def calc_grad(xb, yb, model):
= model(xb)
preds = mnist_loss(preds, yb)
loss loss.backward()
Test it:
4], linear1)
calc_grad(batch, train_y[: weights.grad.mean(), bias.grad
(tensor(-0.0178), tensor([-0.1238]))
Look what happens when we call it again:
4], linear1)
calc_grad(batch, train_y[: weights.grad.mean(), bias.grad
(tensor(-0.0267), tensor([-0.1857]))
The gradients have changed. loss.backward
adds the gradients of loss
to any gradients that are currently stored. So we have to set the current gradients to 0 first:
weights.grad.zero_(); bias.grad.zero_()
Methods in PyTorch whose names end in an underscore modify their objects in place.
Step 5: Step the weights
When we update the weights and biases based on the gradient and learning rate, we have to tell PyTorch not to take the gradient of this step. If we assign to the data
attribute of a tensor, PyTorch will not take the gradient of that step. Here’s our basic training loop for an epoch:
def train_epoch(model, lr, params):
for xb,yb in dl:
calc_grad(xb, yb, model)for p in params:
-= p.grad*lr
p.data p.grad.zero_()
We want to check how we’re doing by looking at the accuracy of the validation set. To decide if an output represents a 3 (1
) or a 7 (0
) we can just check whether the prediction is greater than 0.
4] preds, train_y[:
(tensor([[10.4546],
[ 9.4603],
[-0.2426],
[ 6.7868]], grad_fn=<AddBackward0>),
tensor([[1],
[1],
[1],
[1]]))
>0.0).float() == train_y[:4] (preds
tensor([[ True],
[ True],
[False],
[ True]])
# if preds is greater than 0 and the label is 1 -> correct 3 prediction
# if preds is not greater than 0 and the label is 0 -> correct 7 prediction
True == 1, False == 0
(True, True)
Create a function to calculate validation accuracy:
def batch_accuracy(xb, yb):
= xb.sigmoid()
preds = (preds>0.5) == yb
correct return correct.float().mean()
4]) batch_accuracy(linear1(batch), train_y[:
tensor(0.7500)
Put the batches back together:
def validate_epoch(model):
= [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
accs return round(torch.stack(accs).mean().item(), 4)
Starting point accuracy:
validate_epoch(linear1)
0.5703
Let’s train for 1 epoch and see if the accuracy improves:
= 1.
lr = weights, bias
params
train_epoch(linear1, lr, params) validate_epoch(linear1)
0.6928
Step 6: Repeat the process
Then do a few more:
for i in range(20):
train_epoch(linear1, lr, params)print(validate_epoch(linear1), end = ' ')
0.852 0.9061 0.931 0.9418 0.9477 0.9569 0.9584 0.9594 0.9599 0.9633 0.9647 0.9652 0.9657 0.9662 0.9672 0.9677 0.9687 0.9696 0.9701 0.9696
We’re already about at the same accuracy as our “pixel similarity” approach.
Creating an Optimizer
Replace our linear
function with PyTorch’s nn.Lienar
module. A module is an object of a class that inherits from the PyTorch nn.Module
class, and behaves identically to standard Python functions in that you can call them using parentheses and they will return the activations of a model.
nn.Linear
does the same thing as our init_params
and linear
together. It contains both weights and biases in a single class:
= nn.Linear(28*28, 1) linear_model
Every PyTorch module knows what parameters it has that can be trained; they are available through the parameters
method:
= linear_model.parameters()
w,b w.shape, b.shape
(torch.Size([1, 784]), torch.Size([1]))
We can use this information to create an optimizer:
class BasicOptim:
def __init__(self,params,lr): self.params,self.lr = list(params),lr
def step(self, *args, **kwargs):
for p in self.params: p.data -= p.grad.data * self.lr
def zero_grad(self, *args, **kwargs):
for p in self.params: p.grad = None
We can create our optimizer by passing in the model’s parameters:
= BasicOptim(linear_model.parameters(), lr) opt
Simplify our training loop:
def train_epoch(model):
for xb,yb in dl:
# calculate the gradients
calc_grad(xb,yb,model)# step the weights
opt.step() opt.zero_grad()
Our validation function doesn’t need to change at all:
validate_epoch(linear_model)
0.3985
Put our training loop in a function:
def train_model(model, epochs):
for i in range(epochs):
train_epoch(model)print(validate_epoch(model), end=' ')
Similar results as the previous training:
20) train_model(linear_model,
0.4932 0.7959 0.8506 0.9136 0.9341 0.9492 0.9556 0.9629 0.9658 0.9683 0.9702 0.9717 0.9741 0.9746 0.9761 0.9766 0.9775 0.978 0.9785 0.979
fastai provides the SGD
class that by default does the same thing as our BasicOptim
:
= nn.Linear(28*28, 1)
linear_model = SGD(linear_model.parameters(), lr)
opt 20) train_model(linear_model,
0.4932 0.8735 0.8174 0.9082 0.9331 0.9468 0.9546 0.9614 0.9653 0.9668 0.9692 0.9727 0.9736 0.9751 0.9756 0.9761 0.9775 0.978 0.978 0.9785
fastai provides Learner.fit
which we can use instead of train_model
. To create a Learner
we first need to create a DataLoaders
, by passing our training and validation DataLoader
s:
= DataLoaders(dl, valid_dl) dls
To create a Learner
without using an application such as cnn_learner
we need to pass in all the elements that we’ve created in this chapter: the DataLoaders
, the model, the optimization function (which will be passed the parameters), the loss function, and optionally any metrics to print:
= Learner(dls, nn.Linear(28*28, 1), opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy) learn
10, lr=lr) learn.fit(
epoch | train_loss | valid_loss | batch_accuracy | time |
---|---|---|---|---|
0 | 0.636474 | 0.503518 | 0.495584 | 00:00 |
1 | 0.550751 | 0.189374 | 0.840530 | 00:00 |
2 | 0.201501 | 0.178350 | 0.839549 | 00:00 |
3 | 0.087588 | 0.105257 | 0.912659 | 00:00 |
4 | 0.045719 | 0.076968 | 0.933759 | 00:00 |
5 | 0.029454 | 0.061683 | 0.947498 | 00:00 |
6 | 0.022817 | 0.052156 | 0.954367 | 00:00 |
7 | 0.019893 | 0.045825 | 0.962709 | 00:00 |
8 | 0.018424 | 0.041383 | 0.965653 | 00:00 |
9 | 0.017549 | 0.038113 | 0.967125 | 00:00 |
Adding a Nonlinearity
Adding a nonlinearity between two linear classifiers givs us a neural network.
def simple_net(xb):
= xb@w1 + b1
res = res.max(tensor(0.0))
res = res@w2 + b2
res return res
# initialize weights
= init_params((28*28, 30))
w1 = init_params(30)
b1 = init_params((30,1))
w2 = init_params(1) b2
w1
has 30 output activations which means w2
must have 30 input activations so that they match. 30 output activations means that the first layer can construct 30 different features, each representing a different mix of pixels. You can change that 30 to anything you like to make the model more or less complex.
res.max(tensor(0.0))
is called a rectified linear unit or ReLU. It replaces every negative number with a zero.
plot_function(F.relu)
We need a nonlinearity becauase a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters.
The neural net can solve any computable problem to an arbitrarily high level of accuracy if you can find the right parameters w1
and w2
and if you make the matrices big enough.
We can replace our function with PyTorch:
= nn.Sequential(
simple_net 28*28, 30),
nn.Linear(
nn.ReLU(),30,1)
nn.Linear( )
nn.Sequential
create a modeule that will call each of the listed layers or functions in turn. When using nn.Sequential
PyTorch requires us to use the module version (nn.ReLU
) and not the function version (F.relu
). Modules are classes so you have to instantiate them.
= Learner(dls, simple_net, opt_func=SGD,
learn =mnist_loss, metrics=batch_accuracy) loss_func
40, 0.1) learn.fit(
epoch | train_loss | valid_loss | batch_accuracy | time |
---|---|---|---|---|
0 | 0.363529 | 0.409795 | 0.505888 | 00:00 |
1 | 0.165949 | 0.239534 | 0.792934 | 00:00 |
2 | 0.089140 | 0.117148 | 0.913150 | 00:00 |
3 | 0.056798 | 0.078107 | 0.941119 | 00:00 |
4 | 0.042071 | 0.060734 | 0.957311 | 00:00 |
5 | 0.034718 | 0.051121 | 0.962218 | 00:00 |
6 | 0.030605 | 0.045103 | 0.964181 | 00:00 |
7 | 0.027994 | 0.040995 | 0.966143 | 00:00 |
8 | 0.026145 | 0.037990 | 0.969087 | 00:00 |
9 | 0.024728 | 0.035686 | 0.970559 | 00:00 |
10 | 0.023585 | 0.033853 | 0.972522 | 00:00 |
11 | 0.022634 | 0.032346 | 0.973994 | 00:00 |
12 | 0.021826 | 0.031080 | 0.975466 | 00:00 |
13 | 0.021127 | 0.029996 | 0.976448 | 00:00 |
14 | 0.020514 | 0.029053 | 0.975957 | 00:00 |
15 | 0.019972 | 0.028221 | 0.976448 | 00:00 |
16 | 0.019488 | 0.027481 | 0.977920 | 00:00 |
17 | 0.019051 | 0.026818 | 0.978410 | 00:00 |
18 | 0.018654 | 0.026219 | 0.978410 | 00:00 |
19 | 0.018291 | 0.025677 | 0.978901 | 00:00 |
20 | 0.017958 | 0.025181 | 0.978901 | 00:00 |
21 | 0.017650 | 0.024727 | 0.980373 | 00:00 |
22 | 0.017363 | 0.024310 | 0.980864 | 00:00 |
23 | 0.017096 | 0.023925 | 0.980864 | 00:00 |
24 | 0.016846 | 0.023570 | 0.981845 | 00:00 |
25 | 0.016610 | 0.023241 | 0.982336 | 00:00 |
26 | 0.016389 | 0.022935 | 0.982336 | 00:00 |
27 | 0.016179 | 0.022652 | 0.982826 | 00:00 |
28 | 0.015980 | 0.022388 | 0.982826 | 00:00 |
29 | 0.015791 | 0.022142 | 0.982826 | 00:00 |
30 | 0.015611 | 0.021913 | 0.983317 | 00:00 |
31 | 0.015440 | 0.021700 | 0.983317 | 00:00 |
32 | 0.015276 | 0.021500 | 0.983317 | 00:00 |
33 | 0.015120 | 0.021313 | 0.983317 | 00:00 |
34 | 0.014969 | 0.021137 | 0.983317 | 00:00 |
35 | 0.014825 | 0.020972 | 0.983317 | 00:00 |
36 | 0.014686 | 0.020817 | 0.982826 | 00:00 |
37 | 0.014553 | 0.020671 | 0.982826 | 00:00 |
38 | 0.014424 | 0.020532 | 0.982826 | 00:00 |
39 | 0.014300 | 0.020401 | 0.982826 | 00:00 |
You can view the training process in learn.recorder
:
2)) plt.plot(L(learn.recorder.values).itemgot(
View the final accuracy:
-1][2] learn.recorder.values[
0.982826292514801
At this point we have:
- A function that can solve any problem to any level of accuracy (the neural network) given the correct set of parameters.
- A way to find the best set of parameters for any function (stochastic gradient descent).
Going Deeper
We can add as many layers in our neural network as we want, as long as we add a nonlinearity between each pair of linear layers.
The deeper the model gets, the harder it is to optimize the parameters.
With a deeper model (one with more layers) we do not need to use as many parameters. We can use smaller matrices with more layers and get better results than we would get with larger matrices and few layers.
In the 1990s what held back the field for years was that so few researchers were experimenting with more than one nonlinearity.
Training an 18-layer model:
= ImageDataLoaders.from_folder(path)
dls = cnn_learner(dls, resnet18, pretrained=False,
learn =F.cross_entropy, metrics=accuracy)
loss_func1, 0.1) learn.fit_one_cycle(
/usr/local/lib/python3.10/dist-packages/fastai/vision/learner.py:288: UserWarning: `cnn_learner` has been renamed to `vision_learner` -- please update your code
warn("`cnn_learner` has been renamed to `vision_learner` -- please update your code")
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.098852 | 0.014919 | 0.996075 | 02:01 |
Jargon Recap
Activations: Numbers that are calculated (both by linear and nonlinear layers)
Parameters: Numbers that are randomly initialized and optimized (that is, the numbers that define the model).
Part of becoming a good deep learning practitioner is getting used to the idea of looking at your activations and parameters, and plotting the and testing whether they are behaving correctly.
Activations and parameters are all contained in tensors. The number of dimensions of a tensor is its rank.
A neural network contains a number of layers. Each layer is either linear or nonlinear. We generally alternate between these two kinds of layers in a neural network. Sometimes a nonlinearity is referred to as an activation function.
Key concepts related to SGD:
Term | Meaning |
---|---|
ReLU | Function that returns 0 for negative numbers and doesn’t change positive numbers. |
Mini-batch | A small group of inputs and labels gathered together in two arrays. A gradient descent is updated on this batch (rather than a whole epoch). |
Forward pass | Applying the model to some input and computing the predictions. |
Loss | A value that represents how well or badly our model is doing. |
Gradient | The derivative of the loss with respect to some parameter of the model. |
Backward pass | Computing the gradients of the loss with respect to all model parameters. |
Gradient descent | Taking a step in the direction opposite to the gradients to make the model parameters a little bit better. |
Learning rate | The size of the step we take when applying SGD to update the parameters of the model. |
Questionnaire
1. How is a grayscale image represented on a computer? How about a color image?
Grayscale image pixels can be 0 (black) to 255 (white). Color image pixels have three values (Red, Green, Blue) where each value can be from 0 to 255.
2. How are the files and folders in the MNIST_SAMPLE
dataset structured? Why?
path.ls()
(#3) [Path('/root/.fastai/data/mnist_sample/labels.csv'),Path('/root/.fastai/data/mnist_sample/train'),Path('/root/.fastai/data/mnist_sample/valid')]
MNIST_SAMPLE
path has a labels.csv
file, a train
folder, and a valid
folder.
/'train').ls() (path
(#2) [Path('/root/.fastai/data/mnist_sample/train/3'),Path('/root/.fastai/data/mnist_sample/train/7')]
The train
folder has a 3
and a 7
folder, each which contains training images.
/'valid').ls() (path
(#2) [Path('/root/.fastai/data/mnist_sample/valid/3'),Path('/root/.fastai/data/mnist_sample/valid/7')]
The valid
folder contains a 3
and a 7
folder, each containing validation set images.
3. Explain how the “pixel similarity” approach to classifying digits works.
Pixel similarity works by calculating the absolute mean difference (L1 norm) between each image and the mean digit 3, and averaging the classification (if the absolute mean difference between the image and the ideal 3 is less than the absolute mean difference between the image and the ideal 7, it’s classified as a 3) across all images of each digit’s validation set as the accuracy of the model.
4. What is list comprehension? Create one now that selects odd numbers from a list and doubles them.
List comprehension is syntax for creating a new list based on another sequence or iterable (docs)
# for each element in range(10)
# if the modulo of the element and 2 is not 0
# double the element's value and store in this new list
= [2*elem for elem in range(10) if elem % 2 != 0]
doubled_odds doubled_odds
[2, 6, 10, 14, 18]
5. What is a rank-3 tensor?
A rank-3 tensor is a “cube” (3-dimensional tensor).
6. What is the difference between tensor rank and shape? How do you get the rank from the shape?
Tensor rank is the number of dimensions of the tensor. Tensor shape is the number of elements in each dimension. The following tensor is a 2-dimensional tensor with rank 2, the shape of which is 3 elements by 2 elements.
= tensor([[1,3], [4,5], [5,6]])
a_tensor # dim == rank
a_tensor.dim(), a_tensor.shape
(2, torch.Size([3, 2]))
7. What are RMSE and L1 norm?
RMSE = Root Mean Squared Error: The square root of the mean of squared differences between two sets of values.
L1 norm = mean absolute difference: the mean of the absolute value of differences between two sets of values.
8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
You can do so by using tensors on a GPU.
9. Create a 3x3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom four numbers.
= tensor([[1,2,3], [4,5,6], [7,8,9]])
a_tensor a_tensor
tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
= 2 * a_tensor
a_tensor a_tensor
tensor([[ 2, 4, 6],
[ 8, 10, 12],
[14, 16, 18]])
-1, 9)[0,-4:] a_tensor.view(
tensor([12, 14, 16, 18])
10. What is broadcasting? Broadcasting is when a tensor of smaller rank (or a scalar) is expanded so that you can perform an operation between it and a tensor of larger rank. Broadcasting makes it so that the two operands have the same rank.
+ tensor([1,2,3]) a_tensor
tensor([[ 3, 6, 9],
[ 9, 12, 15],
[15, 18, 21]])
- Are metrics generally calculated using the training set or the validation set? Why?
Metrics are calculated on the validation set because since that is the data the model does not see during training, the metric tells you how your model performs on data it hasn’t seen before.
12. What is SGD?
SGD is Stochastic Gradient Descent, an automated process where a model learns the right parameters needed to solve problems like image classification. The randomly (from scratch) or pretrained (transfer learning) parameters are updated using their gradients with respect to the loss and the learning rate. Metrics like the accuracy measure how well the model is performing.
13. Why does SGD use mini-batches?
One reason is to utilize the ability of a GPU to process a lot of data at once.
Another reason is that calculating the loss one image at a time leads to an unstable loss function whereas calculating the loss on the entire dataset takes too long. Mini-batches fall in between these two extremes.
14. What are the seven steps in SGD for machine learning?
- Initialize the weights.
- Calculate the predictions.
- Calculate the loss.
- Calculate gradients.
- Step the weights.
- Repeat the process.
- Stop.
15. How do we initialize the weights in a model?
Either randomly (if training from scratch) or using pretrained weights (if transfer learning from an existing model like resnet18).
16. What is loss?
A machine-friendly way to measure how well (or badly) the model is performing. The model is learning to step the weights in order to decrease the loss.
17. Why can’t we always use a high learning rate?
Because we risk overshooting the minimum loss (getting stuck back and forth between the two sides of the parabola) or diverging (resulting in larger losses each step).
18. What is a gradient?
The rate of change or derivative of one variable with respect to another variable. In our case, gradients are the ratio of change in loss to change in parameter at one point.
19. Do you need to know how to calculate gradients yourself?
Nope! Although you should understand the basic concept of derivatives. PyTorch calculates gradients with the .backward
method.
20. Why can’t we use accuracy as a loss function?
Because small changes in predictions do not result in small changes in accuracy. Accuracy drastically jumps (from 0
to 1
in our MNIST_SAMPLE
example) at one point, with 0 slope elsewhere. We want a smooth function where you can calculate non-zero and non-infinite derivatives everywhere.
21. Draw the sigmoid function. What is special about its shape?
The sigmoid function outputs between 0 and 1 for input values going from -inf to +inf. It also has a smooth positive slope everywhere so it’s easy to take the derivate.
='Sigmoid', min=-4, max=4) plot_function(torch.sigmoid, title
22. What is the difference between a loss function and a metric?
The loss function is a machine-friendly way to measure the performance of the model while a metric is a human-friendly way to do the same.
The purpose of the loss function is to provide a smooth function to take derivates over so the training system can change the weights little by little towards the optimum.
The purpose of the metric is to inform the human how well or badly the model is learning during training.
23. What is the function to calculate new weights using a learning rate?
In code, the function is:
parameters.data -= parameters.grad * lr
The new weights are stepped incrementally in the opposite direction of the gradients. If the gradient is negative, the weights will be increased. If the gradient is positive, the weights will be decreased.
24. What does the DataLoader
class do?
The DataLoader
class prepares training and validation batches and feeds them to the GPU during training. It also performs any necessary item_tfms
or batch_tfms
to the data.
25. Write pseudocode showing the basic steps taken in each epoch for SGD.
def train_epoch(model):
# calculate predictions
= model(xb)
preds # calculate the loss
= loss_func(preds, targets)
loss # calculate gradients
loss.backward()# step the weights
-= params.grad * lr
params.data # reset the gradients
params.zero_grad_()# calculate accuracy
= tensor([accuracy for each batch]).mean() acc
- Create a function that, if passed two arguments
[1, 2, 3, 4]
and'abcd'
, returns[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
. What is special about that output data structure?
def zipped_tuples(x, y): return list(zip(x,y))
1,2,3,4], 'abcd') zipped_tuples([
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
The output data structure is the same structure as the PyTorch Dataset
.
27. What does view
do in PyTorch?
view
changes the rank and shape of the tensor.
1,2,3],[4,5,6]).view(3,2) tensor([
tensor([[1, 2],
[3, 4],
[5, 6]])
1,2,3],[4,5,6]).view(6) tensor([
tensor([1, 2, 3, 4, 5, 6])
28. What are the bias parameters in a neural network? Why do we need them?
The bias parameters are the intercept \(b\) in the function \(y = wx + b\). We need them for situations where the inputs are 0 (since \(w*0 = 0\)). Bias also helps to create a more flexible function (source).
29. What does the @
operator do in Python?
Matrix multiplication.
= tensor([1,2,3])
v1 = tensor([4,5,6])
v2 @ v2 v1
tensor(32)
30. What does the backward
method do?
Calculate the gradients of the loss function with respect to the parameters.
31. Why do we have to zero the gradients?
Each time you call .backward
PyTorch will add the new gradients to the current gradients, so we need to zero the gradients to prevent them from accumulating.
32. What information do we have to pass to Learner
?
Reference:
Learner(dls, simple_net, opt_func=SGD,
loss_func=mnist_loss, metrics=batch_accuracy)
We pass to the Learner
:
DataLoaders
containing training and validation sets.- The model we want to train.
- An optimizer function.
- A loss function.
- Any metrics we want calculated.
33. Show Python or pseudocode for the basic steps of a training loop.
See #25.
34. What is ReLU? Draw a plot for it for values from -2 to +2.
ReLU is Rectified Linear Unit. It’s a function where if the inputs are negative, they are set to zero, and if the inputs are positive, they are kept as is.
min=-2, max=2) plot_function(F.relu,
35. What is an activation function?
An activation function is the function that produces our predictions (in our case, a neural net with linear and nonlinear layers). Sometimes the ReLU is referred to as the activation function.
36. What’s the difference between F.relu
and nn.ReLU
?
F.relu
is a function whereas nn.ReLU
is a class that needs to be instantiated.
37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why wo we normally use more?
Using more layers results in more accurate models.
Further Research
Since this lesson’s Further Research was so intensive, I decided to create separate blog posts for each one:
Lesson 4: Natural Language (NLP)
As recommended at the end of the lesson 3 video, I will read + run through the code from Jeremy’s notebook Getting started with NLP for absolute beginners before starting lesson 4.
In this notebook we’ll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.
Notebook Exercise: Getting started with NLP for absolute beginners
Download the Data
!pip install kaggle
! pip install -q datasets
! pip install transformers[sentencepiece]
!pip install accelerate -U
# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path
= Path('~/.kaggle/kaggle.json').expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
= Path('us-patent-phrase-to-phrase-matching') path
import zipfile,kaggle
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading us-patent-phrase-to-phrase-matching.zip to /content
100%|██████████| 682k/682k [00:00<00:00, 750kB/s]
!ls {path}
sample_submission.csv test.csv train.csv
View the Data
import pandas as pd
= pd.read_csv(path/'train.csv') df
df
id | anchor | target | context | score | |
---|---|---|---|---|---|
0 | 37d61fd2272659b1 | abatement | abatement of pollution | A47 | 0.50 |
1 | 7b9652b17b68b7a4 | abatement | act of abating | A47 | 0.75 |
2 | 36d72442aefd8232 | abatement | active catalyst | A47 | 0.25 |
3 | 5296b0c19e1ce60e | abatement | eliminating process | A47 | 0.50 |
4 | 54c1e3b9184cb5b6 | abatement | forest region | A47 | 0.00 |
... | ... | ... | ... | ... | ... |
36468 | 8e1386cbefd7f245 | wood article | wooden article | B44 | 1.00 |
36469 | 42d9e032d1cd3242 | wood article | wooden box | B44 | 0.50 |
36470 | 208654ccb9e14fa3 | wood article | wooden handle | B44 | 0.50 |
36471 | 756ec035e694722b | wood article | wooden material | B44 | 0.75 |
36472 | 8d135da0b55b8c88 | wood article | wooden substrate | B44 | 0.50 |
36473 rows × 5 columns
='object') df.describe(include
id | anchor | target | context | |
---|---|---|---|---|
count | 36473 | 36473 | 36473 | 36473 |
unique | 36473 | 733 | 29340 | 106 |
top | 37d61fd2272659b1 | component composite coating | composition | H01 |
freq | 1 | 152 | 24 | 2186 |
In the describe
output, freq
is the number of rows with the top
value in a given column.
'anchor == "component composite coating"').shape df.query(
(152, 5)
Structure the input
data:
'input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor df[
input.head() df.
0 TEXT1: A47; TEXT2: abatement of pollution; ANC...
1 TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2 TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3 TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4 TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object
Tokenization
Transformers use a Dataset
object for storing a dataset. We can create one like so:
from datasets import Dataset, DatasetDict
= Dataset.from_pandas(df) ds
ds
Dataset({
features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
num_rows: 36473
})
A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:
- Tokenization: Split each text up into words (tokens).
- Numericalization: Convert each word (or token) into a number.
The details on how this is done depends on the model. So pick a model first:
= 'microsoft/deberta-v3-small' model_nm
AutoTokenizer
will create a tokenizer appropriate for a given model:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
= AutoTokenizer.from_pretrained(model_nm) tokz
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Here’s an example of how the tokenizer splits a text into “tokens” (which are like words, but can be sub-word pieces):
"G'day folks, I'm Jeremy from fast.ai!") tokz.tokenize(
['▁G',
"'",
'day',
'▁folks',
',',
'▁I',
"'",
'm',
'▁Jeremy',
'▁from',
'▁fast',
'.',
'ai',
'!']
Uncommon words will be split into pieces. The start of a new word is represented by _
.
"A platypus is an ornithorhynchus anatinus.") tokz.tokenize(
['▁A',
'▁platypus',
'▁is',
'▁an',
'▁or',
'ni',
'tho',
'rhynch',
'us',
'▁an',
'at',
'inus',
'.']
Here’s a simple function which tokenizes our inputs:
def tok_func(x): return tokz(x["input"])
To run this quickly in parallel on every row in our dataset, use map
:
= ds.map(tok_func, batched=True) tok_ds
This adds a new item to our dataset called input_ids
. For instance, here is the input and IDs for the first row of our data:
= tok_ds[0]
row 'input'], row['input_ids'] row[
('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
[1,
54453,
435,
294,
336,
5753,
346,
54453,
445,
294,
47284,
265,
6435,
346,
23702,
435,
294,
47284,
2])
There’s a list called vocab
in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the word “of”:
'▁of'] tokz.vocab[
265
265
is present in our input_ids
for the first row of data.
'of'] tokz.vocab[
1580
Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name labels
, but in our dataset it’s currently score
. Therefore, we need to rename it:
= tok_ds.rename_columns({'score':'labels'}) tok_ds
Test and validation sets
= pd.read_csv(path/'test.csv')
eval_df eval_df.describe()
id | anchor | target | context | |
---|---|---|---|---|
count | 36 | 36 | 36 | 36 |
unique | 36 | 34 | 36 | 29 |
top | 4112d61851461f60 | el display | inorganic photoconductor drum | G02 |
freq | 1 | 2 | 1 | 3 |
This is the test set. Possibly the most important idea in machine learning is that of having separate training, validation, and test data sets.
Validation set
To explain the motivation, let’s start simple, and imagine we’re trying to fit a model where the true relationship is this quadratic:
def f(x): return -3*x**2 + 2*x + 20
Unfortunately matplotlib (the most common library for plotting in Python) doesn’t come with a way to visualize a function, so we’ll write something to do this ourselves:
import numpy as np
import matplotlib.pyplot as plt
def plot_function(f, min=-2.1, max=2.1, color='r'):
= np.linspace(min,max, 100)[:,None]
x plt.plot(x, f(x), color)
plot_function(f)
For instance, perhaps we’ve measured the height above ground of an object before and after some event. The measurements will have some random error. We can use numpy’s random number generator to simulate that. I like to use seed when writing about simulations like this so that I know you’ll see the same thing I do:
from numpy.random import normal,seed,uniform
42) np.random.seed(
def noise(x, scale): return normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)
= np.linspace(-2, 2, num=20)[:,None]
x = add_noise(f(x), 0.2, 1.3)
y ; plt.scatter(x,y)
Now let’s see what happens if we underfit or overfit these predictions. To do that, we’ll create a function that fits a polynomial of some degree (e.g. a line is degree 1, quadratic is degree 2, cubic is degree 3, etc). The details of how this function works don’t matter too much so feel free to skip over it if you like!
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def plot_poly(degree):
= make_pipeline(PolynomialFeatures(degree), LinearRegression())
model
model.fit(x, y)
plt.scatter(x,y) plot_function(model.predict)
1) plot_poly(
As you see, the points on the red line (the line we fitted) aren’t very close at all. This is under-fit – there’s not enough detail in our function to match our data.
And what happens if we fit a degree 10 polynomial to our measurements?
10) plot_poly(
Well now it fits our data better, but it doesn’t look like it’ll do a great job predicting points other than those we measured – especially those in earlier or later time periods. This is over-fit – there’s too much detail such that the model fits our points, but not the underlying process we really care about.
Let’s try a degree 2 polynomial (a quadratic), and compare it to our “true” function (in blue):
2)
plot_poly(='b') plot_function(f, color
That’s not bad at all!
So, how do we recognise whether our models are under-fit, over-fit, or “just right”? We use a validation set. This is a set of data that we “hold out” from training – we don’t let our model see it at all. If you use the fastai library, it automatically creates a validation set for you if you don’t have one, and will always report metrics (measurements of the accuracy of a model) using the validation set.
The validation set is only ever used to see how we’re doing. It’s never used as inputs to training the model.
Transformers uses a DatasetDict
for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use train_test_split
:
= tok_ds.train_test_split(0.25, seed=42)
dds dds
DatasetDict({
train: Dataset({
features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 27354
})
test: Dataset({
features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 9119
})
})
As you see above, the validation set here is called test
and not validate
, so be careful!
In practice, a random split like we’ve used here might not be a good idea – here’s what Dr Rachel Thomas has to say about it:
“One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a train_test_split method, this method takes a random subset of the data, which is a poor choice for many real-world problems.”
Test set
So that’s the validation set explained, and created. What about the “test set” then – what’s that for?
The test set is yet another dataset that’s held out from training. But it’s held out from reporting metrics too! The accuracy of your model on the test set is only ever checked after you’ve completed your entire training process, including trying different models, training methods, data processing, etc.
You see, as you try all these different things, to see their impact on the metrics on the validation set, you might just accidentally find a few things that entirely coincidentally improve your validation set metrics, but aren’t really better in practice. Given enough time and experiments, you’ll find lots of these coincidental improvements. That means you’re actually over-fitting to your validation set!
That’s why we keep a test set held back. Kaggle’s public leaderboard is like a test set that you can check from time to time. But don’t check too often, or you’ll be even over-fitting to the test set!
Kaggle has a second test set, which is yet another held-out dataset that’s only used at the end of the competition to assess your predictions. That’s called the “private leaderboard”.
We’ll use eval as our name for the test set, to avoid confusion with the test dataset that was created above.
'input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_df[= Dataset.from_pandas(eval_df).map(tok_func, batched=True) eval_ds
Metrics and correlation
When we’re training a model, there will be one or more metrics that we’re interested in maximising or minimising. These are the measurements that should, hopefully, represent how well our model will works for us.
In real life, outside of Kaggle, things not easy… As my partner Dr Rachel Thomas notes in The problem with metrics is a big problem for AI:
At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so. This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.
In Kaggle, however, it’s very straightforward to know what metric to use: Kaggle will tell you! According to this competition’s evaluation page, “submissions are evaluated on the Pearson correlation coefficient between the predicted and actual similarity scores.” This coefficient is usually abbreviated using the single letter r. It is the most widely used measure of the degree of relationship between two variables.
r can vary between -1, which means perfect inverse correlation, and +1, which means perfect positive correlation. The mathematical formula for it is much less important than getting a good intuition for what the different values look like. To start to get that intuition, let’s look at some examples using the California Housing dataset, which shows “is the median house value for California districts, expressed in hundreds of thousands of dollars”. This dataset is provided by the excellent scikit-learn library, which is the most widely used library for machine learning outside of deep learning.
from sklearn.datasets import fetch_california_housing
= fetch_california_housing(as_frame=True)
housing = housing['data'].join(housing['target']).sample(1000, random_state=52)
housing housing.head()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|
7506 | 3.0550 | 37.0 | 5.152778 | 1.048611 | 729.0 | 5.062500 | 33.92 | -118.28 | 1.054 |
4720 | 3.0862 | 35.0 | 4.697897 | 1.055449 | 1159.0 | 2.216061 | 34.05 | -118.37 | 3.453 |
12888 | 2.5556 | 24.0 | 4.864905 | 1.129222 | 1631.0 | 2.395007 | 38.66 | -121.35 | 1.057 |
13344 | 3.0057 | 32.0 | 4.212687 | 0.936567 | 1378.0 | 5.141791 | 34.05 | -117.64 | 0.969 |
7173 | 1.9083 | 42.0 | 3.888554 | 1.039157 | 1535.0 | 4.623494 | 34.05 | -118.19 | 1.192 |
We can see all the correlation coefficients for every combination of columns in this dataset by calling np.corrcoef
:
=2, suppress=True)
np.set_printoptions(precision
=False) np.corrcoef(housing, rowvar
array([[ 1. , -0.12, 0.43, -0.08, 0.01, -0.07, -0.12, 0.04, 0.68],
[-0.12, 1. , -0.17, -0.06, -0.31, 0. , 0.03, -0.13, 0.12],
[ 0.43, -0.17, 1. , 0.76, -0.09, -0.07, 0.12, -0.03, 0.21],
[-0.08, -0.06, 0.76, 1. , -0.08, -0.07, 0.09, 0. , -0.04],
[ 0.01, -0.31, -0.09, -0.08, 1. , 0.16, -0.15, 0.13, 0. ],
[-0.07, 0. , -0.07, -0.07, 0.16, 1. , -0.16, 0.17, -0.27],
[-0.12, 0.03, 0.12, 0.09, -0.15, -0.16, 1. , -0.93, -0.16],
[ 0.04, -0.13, -0.03, 0. , 0.13, 0.17, -0.93, 1. , -0.03],
[ 0.68, 0.12, 0.21, -0.04, 0. , -0.27, -0.16, -0.03, 1. ]])
This works well when we’re getting a bunch of values at once, but it’s overkill when we want a single coefficient:
np.corrcoef(housing.MedInc, housing.MedHouseVal)
array([[1. , 0.68],
[0.68, 1. ]])
Therefore, we’ll create this little function to just return the single number we need given a pair of variables:
def corr(x,y): return np.corrcoef(x,y)[0][1]
corr(housing.MedInc, housing.MedHouseVal)
0.6760250732906
Now we’ll look at a few examples of correlations, using this function (the details of the function don’t matter too much):
def show_corr(df, a, b):
= df[a],df[b]
x,y =0.5, s=4)
plt.scatter(x,y, alphaf'{a} vs {b}; r: {corr(x, y):.2f}') plt.title(
'MedInc', 'MedHouseVal') show_corr(housing,
So that’s what a correlation of 0.68 looks like. It’s quite a close relationship, but there’s still a lot of variation. (Incidentally, this also shows why looking at your data is so important – we can see clearly in this plot that house prices above $500,000 seem to have been truncated to that maximum value).
Let’s take a look at another pair:
'MedInc', 'AveRooms') show_corr(housing,
The relationship looks like it is similarly close to the previous example, but r is much lower than the income vs valuation case. Why is that? The reason is that there are a lot of outliers – values of AveRooms
well outside the mean.
r is very sensitive to outliers. If there’s outliers in your data, then the relationship between them will dominate the metric. In this case, the houses with a very high number of rooms don’t tend to be that valuable, so it’s decreasing r from where it would otherwise be.
Let’s remove the outliers and try again:
= housing[housing.AveRooms<15]
subset 'MedInc', 'AveRooms') show_corr(subset,
As we expected, now the correlation is very similar to our first comparison.
Here’s another relationship using AveRooms
on the subset:
'MedHouseVal', 'AveRooms') show_corr(subset,
At this level, with r of 0.34, the relationship is becoming quite weak.
Let’s look at one more:
'HouseAge', 'AveRooms') show_corr(subset,
As you see here, a correlation of -0.2 shows a very weak negative trend.
We’ve seen now examples of a variety of levels of correlation coefficient, so hopefully you’re getting a good sense of what this metric means.
Transformers expects metrics to be returned as a dict
, since that way the trainer knows what label to use, so let’s create a function to do that:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}
Training Our Model
To train a model in Transformers we’ll need this:
from transformers import TrainingArguments,Trainer
We pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly:
= 128
bs = 4 epochs
The most important hyperparameter is the learning rate. fastai provides a learning rate finder to help you figure this out, but Transformers doesn’t, so you’ll just have to use trial and error. The idea is to find the largest value you can, but which doesn’t result in training failing.
= 8e-5 lr
Transformers uses the TrainingArguments
class to set up arguments. Don’t worry too much about the values we’re using here – they should generally work fine in most cases. It’s just the 3 parameters above that you may need to change for different models.
= TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
args ="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
evaluation_strategy=epochs, weight_decay=0.01, report_to='none') num_train_epochs
We can now create our model, and Trainer
, which is a class which combines the data and model together (just like Learner
in fastai):
= AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
model = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
trainer =tokz, compute_metrics=corr_d) tokenizer
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Let’s train our model!
; trainer.train()
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Epoch | Training Loss | Validation Loss | Pearson |
---|---|---|---|
1 | No log | 0.032255 | 0.790911 |
2 | No log | 0.023222 | 0.814958 |
3 | 0.040500 | 0.022491 | 0.828246 |
4 | 0.040500 | 0.023501 | 0.828109 |
The key thing to look at is the “Pearson” value in table above. As you see, it’s increasing, and is already above 0.8. That’s great news! We can now submit our predictions to Kaggle if we want them to be scored on the official leaderboard. Let’s get some predictions on the test set:
= trainer.predict(eval_ds).predictions.astype(float)
preds preds
array([[ 0.58],
[ 0.69],
[ 0.57],
[ 0.33],
[-0.01],
[ 0.5 ],
[ 0.55],
[-0.01],
[ 0.31],
[ 1.15],
[ 0.29],
[ 0.24],
[ 0.76],
[ 0.91],
[ 0.75],
[ 0.43],
[ 0.33],
[-0.01],
[ 0.66],
[ 0.33],
[ 0.46],
[ 0.26],
[ 0.18],
[ 0.22],
[ 0.59],
[-0.04],
[-0.02],
[ 0.01],
[-0.03],
[ 0.59],
[ 0.3 ],
[-0. ],
[ 0.68],
[ 0.52],
[ 0.47],
[ 0.23]])
Look out - some of our predictions are <0, or >1! This once again shows the value of remember to actually look at your data. Let’s fix those out-of-bounds predictions:
= np.clip(preds, 0, 1) preds
preds
array([[0.58],
[0.69],
[0.57],
[0.33],
[0. ],
[0.5 ],
[0.55],
[0. ],
[0.31],
[1. ],
[0.29],
[0.24],
[0.76],
[0.91],
[0.75],
[0.43],
[0.33],
[0. ],
[0.66],
[0.33],
[0.46],
[0.26],
[0.18],
[0.22],
[0.59],
[0. ],
[0. ],
[0.01],
[0. ],
[0.59],
[0.3 ],
[0. ],
[0.68],
[0.52],
[0.47],
[0.23]])
Notebook Exercise: Deeper Dive: Iterate like a grandmaster!
In this section I’ll run through the explanation and code provided in Jeremy’s notebook here.
In this notebook I’ll try to give a taste of how a competitions grandmaster might tackle the U.S. Patent Phrase to Phrase Matching competition. The focus generally should be two things:
- Creating an effective validation set
- Iterating rapidly to find changes which improve results on the validation set.
If you can do these two things, then you can try out lots of experiments and find what works, and what doesn’t. Without these two things, it will be nearly impossible to do well in a Kaggle competition (and, indeed, to create highly accurate models in real life!)
The more code you have, the more you have to maintain, and the more chances there are to make mistakes. So keep it simple!
from pathlib import Path
import os
= os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
iskaggle if iskaggle:
!pip install -Uqq fastai
else:
import zipfile,kaggle
= Path('us-patent-phrase-to-phrase-matching')
path str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading us-patent-phrase-to-phrase-matching.zip to /content
100%|██████████| 682k/682k [00:00<00:00, 1.49MB/s]
from fastai.imports import *
if iskaggle: path = Path('../input/us-patent-phrase-to-phrase-matching')
path.ls()
(#3) [Path('us-patent-phrase-to-phrase-matching/sample_submission.csv'),Path('us-patent-phrase-to-phrase-matching/test.csv'),Path('us-patent-phrase-to-phrase-matching/train.csv')]
Let’s look at the training set:
= pd.read_csv(path/'train.csv')
df df
id | anchor | target | context | score | |
---|---|---|---|---|---|
0 | 37d61fd2272659b1 | abatement | abatement of pollution | A47 | 0.50 |
1 | 7b9652b17b68b7a4 | abatement | act of abating | A47 | 0.75 |
2 | 36d72442aefd8232 | abatement | active catalyst | A47 | 0.25 |
3 | 5296b0c19e1ce60e | abatement | eliminating process | A47 | 0.50 |
4 | 54c1e3b9184cb5b6 | abatement | forest region | A47 | 0.00 |
... | ... | ... | ... | ... | ... |
36468 | 8e1386cbefd7f245 | wood article | wooden article | B44 | 1.00 |
36469 | 42d9e032d1cd3242 | wood article | wooden box | B44 | 0.50 |
36470 | 208654ccb9e14fa3 | wood article | wooden handle | B44 | 0.50 |
36471 | 756ec035e694722b | wood article | wooden material | B44 | 0.75 |
36472 | 8d135da0b55b8c88 | wood article | wooden substrate | B44 | 0.50 |
36473 rows × 5 columns
And the test set:
= pd.read_csv(path/'test.csv')
eval_df len(eval_df)
36
eval_df.head()
id | anchor | target | context | |
---|---|---|---|---|
0 | 4112d61851461f60 | opc drum | inorganic photoconductor drum | G02 |
1 | 09e418c93a776564 | adjust gas flow | altering gas flow | F23 |
2 | 36baf228038e314b | lower trunnion | lower locating | B60 |
3 | 1f37ead645e7f0c8 | cap component | upper portion | D06 |
4 | 71a5b6ad068d531f | neural stimulation | artificial neural network | H04 |
df.target.value_counts()
composition 24
data 22
metal 22
motor 22
assembly 21
..
switching switch over valve 1
switching switch off valve 1
switching over valve 1
switching off valve 1
wooden substrate 1
Name: target, Length: 29340, dtype: int64
We see that there’s nearly as many unique targets as items in the training set, so they’re nearly but not quite unique. Most importantly, we can see that these generally contain very few words (1-4 words in the above sample).
df.anchor.value_counts()
component composite coating 152
sheet supply roller 150
source voltage 140
perfluoroalkyl group 136
el display 135
...
plug nozzle 2
shannon 2
dry coating composition1 2
peripheral nervous system stimulation 1
conduct conducting material 1
Name: anchor, Length: 733, dtype: int64
We can see here that there’s far fewer unique values (just 733) and that again they’re very short (2-4 words in this sample).
df.context.value_counts()
H01 2186
H04 2177
G01 1812
A61 1477
F16 1091
...
B03 47
F17 33
B31 24
A62 23
F26 18
Name: context, Length: 106, dtype: int64
The first character is the section the patent was filed under – let’s create a column for that and look at the distribution:
'section'] = df.context.str[0]
df[ df.section.value_counts()
B 8019
H 6195
G 6013
C 5288
A 4094
F 4054
E 1531
D 1279
Name: section, dtype: int64
Finally, we’ll take a look at a histogram of the scores:
; df.score.hist()
There’s a small number that are scored 1.0
- here’s a sample:
==1] df[df.score
id | anchor | target | context | score | section | |
---|---|---|---|---|---|---|
28 | 473137168ebf7484 | abatement | abating | F24 | 1.0 | F |
158 | 621b048d70aa8867 | absorbent properties | absorbent characteristics | D01 | 1.0 | D |
161 | bc20a1c961cb073a | absorbent properties | absorption properties | D01 | 1.0 | D |
311 | e955700dffd68624 | acid absorption | absorption of acid | B08 | 1.0 | B |
315 | 3a09aba546aac675 | acid absorption | acid absorption | B08 | 1.0 | B |
... | ... | ... | ... | ... | ... | ... |
36398 | 913141526432f1d6 | wiring trough | wiring troughs | F16 | 1.0 | F |
36435 | ee0746f2a8ecef97 | wood article | wood articles | B05 | 1.0 | B |
36440 | ecaf479135cf0dfd | wood article | wooden article | B05 | 1.0 | B |
36464 | 8ceaa2b5c2d56250 | wood article | wood article | B44 | 1.0 | B |
36468 | 8e1386cbefd7f245 | wood article | wooden article | B44 | 1.0 | B |
1154 rows × 6 columns
We can see from this that these are just minor rewordings of the same concept, and isn’t likely to be specific to context. Any pretrained model should be pretty good at finding these already.
Training
! pip install transformers[sentencepiece] datasets accelerate
from torch.utils.data import DataLoader
import warnings,transformers,logging,torch
from transformers import TrainingArguments,Trainer
from transformers import AutoModelForSequenceClassification,AutoTokenizer
if iskaggle:
!pip install -q datasets
import datasets
from datasets import load_dataset, Dataset, DatasetDict
# quiet huggingface warnings
'ignore')
warnings.simplefilter( logging.disable(logging.WARNING)
# specify which model we are going to be using
= 'microsoft/deberta-v3-small' model_nm
We can now create a tokenizer for this model. Note that pretrained models assume that text is tokenized in a particular way. In order to ensure that your tokenizer matches your model, use the AutoTokenizer
, passing in your model name.
= AutoTokenizer.from_pretrained(model_nm) tokz
We’ll need to combine the context, anchor, and target together somehow. There’s not much research as to the best way to do this, so we may need to iterate a bit. To start with, we’ll just combine them all into a single string. The model will need to know where each section starts, so we can use the special separator token to tell it:
= tokz.sep_token
sep sep
'[SEP]'
'inputs'] = df.context + sep + df.anchor + sep + df.target df[
Generally we’ll get best performance if we convert pandas DataFrames into HuggingFace Datasets, so we’ll convert them over, and also rename the score column to what Transformers expects for the dependent variable, which is label
:
= Dataset.from_pandas(df).rename_column('score', 'label')
ds = Dataset.from_pandas(eval_df) eval_ds
To tokenize the data, we’ll create a function (since that’s what Dataset.map
will need):
def tok_func(x): return tokz(x["inputs"])
0]) tok_func(ds[
{'input_ids': [1, 336, 5753, 2, 47284, 2, 47284, 265, 6435, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The only bit we care about at the moment is input_ids
. We can see in the tokens that it starts with a special token 1
(which represents the start of text), and then has our three fields separated by the separator token 2
. We can check the indices of the special token IDs like so:
tokz.all_special_tokens
['[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]']
We can now tokenize the input. We’ll use batching to speed it up, and remove the columns we no longer need:
= "anchor","target","context"
inps = ds.map(tok_func, batched=True, remove_columns=inps+('inputs','id','section')) tok_ds
Looking at the first item of the dataset we should see the same information as when we checked tok_func above:
0] tok_ds[
{'label': 0.5,
'input_ids': [1, 336, 5753, 2, 47284, 2, 47284, 265, 6435, 2],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Creating a validation set
According to this post, the private test anchors do not overlap with the training set. So let’s do the same thing for our validation set.
First, create a randomly shuffled list of anchors:
= df.anchor.unique()
anchors 42)
np.random.seed(
np.random.shuffle(anchors)5] anchors[:
array(['time digital signal', 'antiatherosclerotic', 'filled interior',
'dispersed powder', 'locking formation'], dtype=object)
Now we can pick some proportion (e.g 25%) of these anchors to go in the validation set:
= 0.25
val_prop = int(len(anchors)*val_prop)
val_sz = anchors[:val_sz] val_anchors
Now we can get a list of which rows match val_anchors
, and get their indices:
# is_val is a boolean array
= np.isin(df.anchor, val_anchors)
is_val = np.arange(len(df))
idxs = idxs[ is_val]
val_idxs = idxs[~is_val]
trn_idxs len(val_idxs),len(trn_idxs)
(9116, 27357)
Our training and validation Dataset
s can now be selected, and put into a DatasetDict
ready for training:
= DatasetDict({"train":tok_ds.select(trn_idxs),
dds "test": tok_ds.select(val_idxs)})
BTW, a lot of people do more complex stuff for creating their validation set, but with a dataset this large there’s not much point. As you can see, the mean scores in the two groups are very similar despite just doing a random shuffle:
df.iloc[trn_idxs].score.mean(),df.iloc[val_idxs].score.mean()
(0.3623021530138539, 0.3613426941641071)
Initial model
Let’s now train our model! We’ll need to specify a metric, which is the correlation coefficient provided by numpy (we need to return a dictionary since that’s how Transformers knows what label to use):
def corr(eval_pred): return {'pearson': np.corrcoef(*eval_pred)[0][1]}
We pick a learning rate and batch size that fits our GPU, and pick a reasonable weight decay and small number of epochs:
= 8e-5,128
lr,bs = 0.01,4 wd,epochs
Transformers uses the TrainingArguments
class to set up arguments. We’ll use a cosine scheduler with warmup, since at fast.ai we’ve found that’s pretty reliable. We’ll use fp16 since it’s much faster on modern GPUs, and saves some memory. We evaluate using double-sized batches, since no gradients are stored so we can do twice as many rows at a time.
def get_trainer(dds):
= TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
args ="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
evaluation_strategy=epochs, weight_decay=wd, report_to='none')
num_train_epochs= AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
model return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
=tokz, compute_metrics=corr) tokenizer
= TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
args ="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
evaluation_strategy=epochs, weight_decay=wd, report_to='none') num_train_epochs
We can now create our model, and Trainer
, which is a class which combines the data and model together (just like Learner
in fastai):
= AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
model = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
trainer =tokz, compute_metrics=corr) tokenizer
; trainer.train()
Epoch | Training Loss | Validation Loss | Pearson |
---|---|---|---|
1 | No log | 0.027171 | 0.794542 |
2 | No log | 0.026872 | 0.811033 |
3 | 0.035300 | 0.024633 | 0.816882 |
4 | 0.035300 | 0.024581 | 0.817413 |
Improving the model
We now want to start iterating to improve this. To do that, we need to know whether the model gives stable results. I tried training it 3 times from scratch, and got a range of outcomes from 0.808-0.810. This is stable enough to make a start - if we’re not finding improvements that are visible within this range, then they’re not very significant! Later on, if and when we feel confident that we’ve got the basics right, we can use cross validation and more epochs of training.
Iteration speed is critical, so we need to quickly be able to try different data processing and trainer parameters. So let’s create a function to quickly apply tokenization and create our DatasetDict
:
def get_dds(df):
= Dataset.from_pandas(df).rename_column('score', 'label')
ds = ds.map(tok_func, batched=True, remove_columns=inps+('inputs','id','section'))
tok_ds return DatasetDict({"train":tok_ds.select(trn_idxs), "test": tok_ds.select(val_idxs)})
def get_model(): return AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
def get_trainer(dds, model=None):
if model is None: model = get_model()
= TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
args ="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
evaluation_strategy=epochs, weight_decay=wd, report_to='none')
num_train_epochsreturn Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
=tokz, compute_metrics=corr) tokenizer
Perhaps using the special separator character isn’t a good idea, and we should use something we create instead. Let’s see if that makes things better. First we’ll change the separator and create the DatasetDict
:
= " [s] "
sep 'inputs'] = df.context + sep + df.anchor + sep + df.target
df[= get_dds(df) dds
get_trainer(dds).train()
Epoch | Training Loss | Validation Loss | Pearson |
---|---|---|---|
1 | No log | 0.027216 | 0.799765 |
2 | No log | 0.025568 | 0.814325 |
3 | 0.031000 | 0.023474 | 0.817759 |
4 | 0.031000 | 0.024206 | 0.817377 |
TrainOutput(global_step=856, training_loss=0.023552694610346144, metrics={'train_runtime': 207.9058, 'train_samples_per_second': 526.335, 'train_steps_per_second': 4.117, 'total_flos': 582121520370810.0, 'train_loss': 0.023552694610346144, 'epoch': 4.0})
That’s looking quite a bit better, so we’ll keep that change.
(Vishal note: I trained it a few times but couldn’t get the pearson coefficient past 0.8174)
Often changing to lowercase is helpful. Let’s see if that helps too:
'inputs'] = df.inputs.str.lower()
df[= get_dds(df)
dds get_trainer(dds).train()
Epoch | Training Loss | Validation Loss | Pearson |
---|---|---|---|
1 | No log | 0.025207 | 0.798847 |
2 | No log | 0.024926 | 0.813183 |
3 | 0.031800 | 0.023556 | 0.815640 |
Epoch | Training Loss | Validation Loss | Pearson |
---|---|---|---|
1 | No log | 0.025207 | 0.798847 |
2 | No log | 0.024926 | 0.813183 |
3 | 0.031800 | 0.023556 | 0.815640 |
4 | 0.031800 | 0.024359 | 0.815295 |
TrainOutput(global_step=856, training_loss=0.024133934595874536, metrics={'train_runtime': 197.3858, 'train_samples_per_second': 554.386, 'train_steps_per_second': 4.337, 'total_flos': 582121520370810.0, 'train_loss': 0.024133934595874536, 'epoch': 4.0})
Special tokens
What if we made the patent section a special token? Then potentially the model might learn to recognize that different sections need to be handled in different ways. To do that, we’ll use, e.g. [A] for section A. We’ll then add those as special tokens:
'sectok'] = '[' + df.section + ']'
df[= list(df.sectok.unique())
sectoks 'additional_special_tokens': sectoks}) tokz.add_special_tokens({
8
'inputs'] = df.sectok + sep + df.context + sep + df.anchor.str.lower() + sep + df.target
df[= get_dds(df) dds
Since we’ve added more tokens, we need to resize the embedding matrix in the model:
= get_model()
model len(tokz)) model.resize_token_embeddings(
Embedding(128009, 768)
= get_trainer(dds, model=model)
trainer trainer.train()
Epoch | Training Loss | Validation Loss | Pearson |
---|---|---|---|
1 | No log | 0.025942 | 0.810038 |
2 | No log | 0.025694 | 0.814332 |
3 | 0.010500 | 0.023547 | 0.816508 |
4 | 0.010500 | 0.024562 | 0.817200 |
TrainOutput(global_step=856, training_loss=0.009868621826171875, metrics={'train_runtime': 221.7169, 'train_samples_per_second': 493.548, 'train_steps_per_second': 3.861, 'total_flos': 695370741753690.0, 'train_loss': 0.009868621826171875, 'epoch': 4.0})
Before submitting a model, retrain it on the full dataset, rather than just the 75% training subset we’ve used here. Create a function like the ones above to make that easy for you!
Video Notes
In this section, I’ll take notes while watching this lesson’s video.
- Introduction
- In the book, we do NLP using Recurrent Neural Networks (RNNs).
- In the video, we’re going to be fine-tuning a pretrained NLP model using a library called HuggingFace Transformers.
- It’s useful to have experience in using more than one library. See the same concepts applied in different ways. Great for understanding the concepts.
- HuggingFace libraries are SOTA in NLP.
- Transformers library in process of being integrated into fastai library.
- HuggingFace Transformers doesn’t have the same layered API as fastai.
- Fine-Tuning a Pretrained Model
- In the quadratic/sliders example, a pretrained model is like someone telling you that they are confident what parameter
a
should be, are somewhat confident whatb
should be, and have no idea whatc
should be. Then, we would trainc
until it firts our model, adjustb
and keepa
as is. That’s what it’s like fine-tuning a pretrained model. - A pretrained model is a bunch of parameters have already been fit, where for some of them we’re pretty confident of what they should be, and for some of them we really have no idea at all.
- Fine-tuning is the process of taking those ones where we have no idea at all what they should be and trying to get them right, and then moving the other ones a little bit.
- In the quadratic/sliders example, a pretrained model is like someone telling you that they are confident what parameter
- ULMFiT
- The idea of fine-tuning a pretrained NLP model was pioneered by ULMFiT which was first introduced in a fastai course, later turned into an academic paper by Jeremy and Sebastian Ruder which inspired a huge change in NLP capabilities around the world.
- Step 1
- Build a language model using all of Wikipedia that tried to predict the next word of a Wikipedia article. Filling in these kinds of things requires understanding a lot about how language is structured and about the world. Getting good at fitting a language model requires a neural net getting good at a lot of things. It needs to understand language at a reasonably good level, what is true, what is not true, different ways in which things are expressed and so on. Started with random weights. At the end was a model that could predict more than 30% of the time correctly what the next word in a Wikipedia article would be.
- Step 2
- Create a second language model, that predicts the next word of a sentence. Took the pretrained model and ran a few more epochs using IMDb movie reviews. So it got very good at predicting the next work of an IMDb movie review.
- Step 3
- Took those weights and fine-tuned them for the task of predicting whether or not a movie review was positive or negative sentiment.
- The first two models don’t require labels. The labels was what’s the next word of the sentence.
- ULMFiT built with RNNs.
- Transformers developed at the same time of ULMFiT’s release.
- Transformers can take advantage of modern accelerators like Google’s TPUs.
- Transformers don’t allow you to predict the next word of a sentence, it’s just not how they are structured. Instead they deleted at random a few words and asked the model to predict what words were deleted. The basic concept similar to ULMFiT ,replaced RNN with Transformer. Replaced language model with masked language model.
- How do you go from a model that’s trained to predict the next word to a model that does classification?
- The first layer of ImageNet classification model finds basic features like diagonal edges, gradients, etc. Layer two combined those (ReLUs added together, activations from sets of ReLUs matrix multipled, etc.)
- Layer 5 had bird and lizard eyeball detectors, dog face detectors, flowers detectors, etc.
- Later layers do things much more specific to the training task.
- Pretty unlikely that you need to change the early layers.
- The layer that says “what is this” is deleted in fine-tuning (the layer that has one output per category). The model is then spitting out a few hundred activations. We stick a new random matrix on top of that and train it, so it can predict what you’re trying to predict. Then we gradually train the rest of the layers.
- Getting started with NLP for absolute beginners
- US Patent Phrase to Phrase Matching Competition.
- Classification is probably the most widely use case for NLP.
- Document = an input to an NLP model that contains text.
- Classifying a document is a rich thing to do: sentiment analysis, author identifiation, legal discovery, organizing documents by topic, triaging inbound emails.
- The Kaggle competition on US Patents does not immediately look like a classification problem.
- Columns: Anchor, target, context, score
- Goal: come up with a model that automatically determines which anchor and target pairs are talking about the same thing. score = 1.0 means the anchor and target mean the same thing, 0.0 means they are not.
- Whether the anchor and target are determined to be similar or not depends on the context.
- Represent the problem as
<constant string><anchor><seperator><constant string><target>
and choose category 0.0, 0.25, 0.50, 0.75 or 1.00. - Kaggle data is already on Kaggle.
- Always look through the competition’s Data page and read through it before jumping into the data.
- Use
DataFrame.describe(include='object')
to see stats about the fields (count, unique, top, frequency of top). - This dataset contains very small documents (3-4 words) that are not very unique. There’s not a lot of unique data to work with.
- Create a single string of anchor, target, and context with separators and store as the
input
column. - Neural networks work with numbers: We’re going to take the numbers, multiply by matrices, replace negatives with zeros, add them up, and do this a few times.
- Tokenization: Split each document into tokens (words).
- The list of unique words is called the vocabulary.
- Numericalization: Each word in the vocabulrary gets a number. The bigger the vocab, the more memory gets used, the more data we need to train. We don’t want a large vocabulary.
- Tokenize into sub-words (pieces of words).
- We can turn a pandas DataFrame into a Huggingface dataset’s Dataset using
Dataset.from_pandas
. - Whatever pretrained model you used comes with a tokenizer. Before you start tokenizing, you have to decide on which model to use.
- Hugginface Model Hub has pretrained models trained on specific corpuses.
- There are some generally good models,
deberta-v3
is one of those. - NLP has been practically effective for general users for only a year or two, a lot of this stuff we’re figuring out as a community.
- Always start with a
small
model, it’s faster to train, we’re going to be able to do more iterations. AutoTokenizer.from_pretrained(<model name>)
will download the vocab and details about how this particular model tokenized the dataset._
represents the start of a word.def tok_func(x): return tokx(x['input'])
takes a documentx
, and tokenizes it’sinput
.Dataset.map
will parallelize the process of calling the function on each value.batched=True
will do a bunch at a time. Tokenizer library is an optimized Rust library.input_ids
will contain numbers in the position of each of the tokens.- How do you choose the keywords and the order of the fields when creating
input
?- It’s arbitrary, try a few things. We just want something it can learn from that separates one field from another.
- If one of the fields was long (1000 characters) is there any special handling required there?
- Long documents in ULMFiT require no special consideration. ULMFiT is the best approach for large documents. It will split large documents into pieces.
- Large documents are challening for Transformers. It does the whole document at once.
- Documents over 2000 words: look at ULMFiT.
- Under 2000 words: Transformers should be fine unless you have a laptop GPU with not much memory.
- HuggingFace transformers expect that your target is a column called
labels
. test.csv
doesn’t have ascore
field.- Perhaps the most important idea in machine learning is having separate training, validation and test datasets.
- Test and validation sets are all about identifying and controlling for overfitting.
- Underfit: not enough complexity in the model fit to match the data that’s there. It’s systematically biased.
- Common misunderstanding is that simpler models are more reliable in some way, but models that are too simple will be systematically incorrect.
- Overfit: it’s done a good job of fitting our data points, but if we sample some more data points from our distribution the model won’t be close to them.
- Underfitting is easy to recognize (we can look at training data and see that it’s not very close).
- Overfitting is harder to recognize because the training data is very close.
- How do we tell if we have a good fit that’s not overfitting? We measure how good our model is by looking ONLY at the points we set aside as the validation set.
- fast.ai won’t let you train a model without a validation set and shows metrics only on the validation set.
- Creating a good validation set is not generally as simple as just randomly pulling some of your data out of the data that you train your model on.
- Kaggle is a great place to learn how to create a good validation set.
- A test set is another validation set that you don’t use for metrics. Helps you see if you overfit using the validation set.
- Kaggle has two test sets: leaderboard feedback during competition and second test set that is private until after competition is finished.
- Don’t accidentally find a model that is good by coincidence. Only if you have a test set that you hold out will you know if you’ve done this.
- If your model is terrible on the test set—go back to square one.
- You don’t want functions with gradient of 0 of inf (like accuracy) you want something smooth.
- One metric is not enough to capture all of the real world dynamics involved in a model’s use.
- Goodhart’s law: when a measure becomes a target, it’s ceases to be a good measure.
- AI is really good at optimizing metrics so you have to be careful what metrics you choose for models that are used in real life (impacting people’s lives).
- Pearson correlation coefficient is the most widely used measure of how similar two variables are
- -1.0 to +1.0.
- Abbreviated as r.
- How do I plot datasets with far too many points? The answer is: get less points (sample).
np.corrcoef
gives a diagonally symmetric matrix of r values.- Visualizing your data is important so you can see things like how data is truncated.
alpha=0.5
for scatter plots creates darker areas where there’s lots of dots.- r relies on the square of the difference, big outliers increase that by a lot.
- r is very sensitive to outliers.
- If you’re trying to win a Kaggle competition that uses r and even a couple of your rows are really wrong, it will be a disaster.
- You almost can’t see the relationship for \(r=0.34\)
- Transformers expects metric to be returned as a
dict
. tok_ds.train_test_split()
returns aDatasetDict({train: Dataset, test: Dataset})
.- Transformers calls it validation set
test
, on which is calculates metrics. - The fastai equivalent of
Learner
is the HuggingFace Transformer’sTrainer
. - The larger the batch size, the more you can do in parallel and the faster it’ll be, but if it’s too large you’ll get an out-of-memory error on the GPU.
- If you’re using a framework that doesn’t have a learning rate finder like fastai, you can just start with a really low learning rate and then keep doubling it until it falls apart.
TrainingArguments
is a class that takes all of the configuration (like learning rate, warmup ratio, scheduler type, weight decay, etc.).- You always want
fp16=True
as it will be faster. AutoModelForSequenceClassification
will create an model for classification,.from_pretrained
will use a pretrained model which has anum_labels
param which is the number of output columns we have, which in this case is 1 (the score).Trainer
takes the model, the training and validation data,TrainingArguments()
, tokenizer and metrics).Trainer.train()
will train the model.- HuggingFace is very verbose, the warnings which you can ignore.
- The only reason we get a high r value after 4 epochs is because we used a pretrained model.
- The pretrained model already knows a lot about language and has a good sense of whether two phrases have the same meaning or not.
- How do you decide when it’s okay to remove outliers?
- Outliers should never just be removed for modelling.
- Instead we would observe that clearly from looking at this dataset, these two groups can’t be treated the same way (low income/high # of rooms vs. high income/high # of rooms). Split them into two separate analyses.
- Outlier exists in a statistical sense, it doesn’t exist in a real sense (i.e. things that we should ignore or throw away). Some of the most useful insights in data projects are digging into outliers and understanding what are they? and where did they come from? It’s in those edge cases where you discover really important things like when processes go wrong, labelling problems. Never delete outliers. Investigate them, have a strategy about what you’re going to do with them.
- Training with HuggingFace’s Transformer is similar to the things we’ve seen before with fastai.
trainer.predict(eval_ds).predictions.astype(float)
to get predictions fromTrainer
object.- Always look at your outputs. So you can see things like having negative predictions or predictions over 1, which are outside the range of the patent phrase matching score. For now, we can at least round these off up to 0 and down to 1, respectively, better ways to do this but this is better than nothing.
- Kaggle expects submissions to generally be in a CSV file.
- NLP is probably where the biggest opportunities are for big wins in research and commercialization.
- It’s worth thinking about both use and misuse of modern NLP.
- You can create bots to generate context appropriate conversation and scale it up to 99% of Twitter and nobody would know. This is worrying because a lot of how people see the world is coming out of social media conversation, which at this point are contrallable. It would not be that hard to create something that’s optimized towards moving a point of view amongst a billion people in a very subtle way, very gradually over a long period of time by multiple bots each pretending to argue with each other and one of them getting the upper hand and so forth.
- What GPT is used for we may not know for decades, if ever.
- 2017: millions of submissions to the FTC about Net Neutrality very heavily biased against it. An analysis showed that something like 99% of them were auto-generated. We don’t know for sure but this seems successful because repealing Net Neutrality went through, the comments were factored into this decision.
- You can always create a generative model that beats bot classifiers designed to classify its content as auto-generated. Similar problem with spam prevention.
- If you pass
num_labels=1
toAutoModelForSequenceClassification
it treats it as a regression problem.
Book Notes
In this section, I’ll take notes and run code examples from Chapter 10: NLP Deep Dive: RNNs in the textbook.
- In general, in NLP the pretrained model is trained on a different task.
- language model: a model that has been trained to guess the next word in a text (having read the ones before).
- self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels.
- To properly guess the next word in a sentence, the model will have to develop an understanding of the natural language.
- Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pretraining a model used for transfer learning.
- Self-supervised learning and computer vision
- Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targeting.
- You get even better results if you fine-tune the sequence-based language model prior to fine-tuning the classification model.
- The IMDb dataset contains 100k movie reviews (50k unlabeled, 25k labeled training set reviews, 25k labeled validation set reviews). We can use all of these reviews to fine-tune the pretrained language model, which was trained only on Wikipedia articles, this will result in a language model that is particularly good at predicting the next word of a movie review. This is known as Universal Language Model Fine-tuning (ULMFiT).
- The extra stage of fine-tuning the language model, prior to transfer learning to classification task, resulted in significantly better predictions.
Text Preprocessing
- Using categorical variables as independent variables for a neural network:
- Make a list of all possible levels of that categorical variable (the vocab).
- Replace each level with its index in the vocab.
- Create an embedding matrix for this containing a row for each level (i.e., for each item of the vocab).
- Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step 2; this is equivalent to, but faster and more efficient than, a matrix that takes as input one-hot-encoded vectors representing the indexes).
- We can do nearly the same thing with text:
- First we concatenate all of the documents in our dataset into one big long string and split it into words (or tokens), giving us a very long list of words.
- Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second to last, and our dependent variable will be the sequence of words starting with the second word and ending with the last word.
- Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus.
- Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model; but for new words, we won’t have anything, so we will just initialize the corresponding row with a random vector.
- Steps for creating a language model:
- Tokenization: convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
- Numericalization: List all of the unique words that appear (vocab) and convert each word into a number by looking up its index in the vocab.
- Language model data loader creation: fastai’s
LMDataLoader
automatically handles creating a dependent variable that is offset from the independent variable by one token, and handles important details liks shuffling the training data so that the dependent and independent variables maintain their structure as required. - Language model creation: we need a model that handles input lists that could be arbitrarily big or small. We use a Recurrent Neural Network (RNN).
Tokenization
There is no one approach to tokenization. There are three main approaches:
- Word-based: Split a sentence on spaces and separate parts of meaning even when there are no spaces (“don’t” -> “do n’t”). Punctuation marks are generally split into separate tokens.
- Subword based: Split words into smaller parts, based on the most commonly occurring substrings (“occasion” -> “o c ca sion”).
- Character-based: Split a sentence into its individual characters.
Word Tokenization with fastai
Rather than providing its own tokenizers, fastai provides a consistent interface to a range of tokenizers in external libraries.
Let’s try it out with the IMDb dataset:
from fastai.text.all import *
= untar_data(URLs.IMDB) path
path.ls()
(#7) [Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/test'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/README')]
get_text_files
gets all the text files in a path
= get_text_files(path, folders = ['train', 'test', 'unsup']) files
10] files[:
(#10) [Path('/root/.fastai/data/imdb/unsup/42765_0.txt'),Path('/root/.fastai/data/imdb/unsup/19120_0.txt'),Path('/root/.fastai/data/imdb/unsup/8649_0.txt'),Path('/root/.fastai/data/imdb/unsup/32022_0.txt'),Path('/root/.fastai/data/imdb/unsup/30143_0.txt'),Path('/root/.fastai/data/imdb/unsup/14876_0.txt'),Path('/root/.fastai/data/imdb/unsup/28162_0.txt'),Path('/root/.fastai/data/imdb/unsup/32133_0.txt'),Path('/root/.fastai/data/imdb/unsup/21844_0.txt'),Path('/root/.fastai/data/imdb/unsup/830_0.txt')]
Here’s a review that we will tokenize:
= files[0].open().read(); txt[:75] txt
"Despite some humorous banter and a decent supporting cast, I can't really r"
WordTokenizer
will always point to fastai’s current default word tokenizer.
fastai’s coll_repr(collection, n)
displays the first n
items of collection
, along with the full size.
= WordTokenizer()
tokz = first(tokz([txt]))
toks print(coll_repr(toks, 30))
(#243) ['Despite','some','humorous','banter','and','a','decent','supporting','cast',',','I','ca',"n't",'really','recommend','this','movie','.','The','leads','are',"n't",'very','likable','and','I','did',"n't",'particularly','care'...]
Tokenization is a surprisingly subtle task. “.” is separated when it terminates a sentence but not in an acronym or number:
'The U.S. dollar $1 is $1.00.'])) first(tokz([
(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']
fastai adds some functionality to the tokenization process with the Tokenizer
class:
= Tokenizer(tokz)
tkn print(coll_repr(tkn(txt), 31))
(#264) ['xxbos','xxmaj','despite','some','humorous','banter','and','a','decent','supporting','cast',',','i','ca',"n't",'really','recommend','this','movie','.','xxmaj','the','leads','are',"n't",'very','likable','and','i','did',"n't"...]
Tokens that start with xx
are special tokens.
xxbos
is a special token that indicates the start of a new text (“BOS” is a standard NLP acronym that means “beginning of stream”). By recognizing this start token, the model will be able to learn it needs to “forget” what was said previously and focus on upcoming words. These special tokens don’t come from the external tokenizer. fastai adds them by default by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. We are translating the original English language sequence into a simplified tokenized language that is designed to be easy for a model to learn.
For example, the rules will replace a sequence of four exclamation points with a single exclamation point follow by a special repeated character token and then the number four.
'!!!!') tkn(
(#4) ['xxbos','xxrep','4','!']
In this way, the model’s embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repititions of every punctuation mark. A capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word so the embedding matrix needs only the lowercase version of the words saving compute and memory resources but can still learn the concept of capitalization.
Here are some of the main special tokens:
xxbos
: Indicates the beginning of a text (in this case, a review).
xxmaj
: Indicates the next word begins with a capital.
xxunk
: Indicates the next word is unknown.
defaults.text_proc_rules
[<function fastai.text.core.fix_html(x)>,
<function fastai.text.core.replace_rep(t)>,
<function fastai.text.core.replace_wrep(t)>,
<function fastai.text.core.spec_add_spaces(t)>,
<function fastai.text.core.rm_useless_spaces(t)>,
<function fastai.text.core.replace_all_caps(t)>,
<function fastai.text.core.replace_maj(t)>,
<function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]
fix_html
: replaces special HTML characters with a readable version.
replace_rep
: Replaces any character repeated three times or more with a special token for repetition (xxrep
), the number of times it’s repeated, then the character.
replace_wrep
: Replaces any word repeated three times or more with a special token for word repetition (xxwrep
), the number of times it’s repeated, then the character.
spec_add_spaces
: adds spaces around / and #.
rm_useless_spaces
: Removes all repetitions of the space character.
replace_all_caps
: Lowercases a word written in all caps and adds a special token for all caps (xxcap
) in front of it.
replace_maj
: Lowercases a capitalized word and adds a special token for capitalized (xxmaj
) in front of it.
lowercase
: Lowercases all text and adds a special token at the beginning (xxbos
) and/or the end (xxeos
).
"© Fast.ai www.fast.ai/INDEX"), 31) coll_repr(tkn(
"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"
Subword Tokenization
Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However this assumption is not always appropriate. Languages like Chinese and Japanese don’t use spaces. Turkish and Hungarian can add many subwords together without spaces.
Two steps of subword tokenization:
- Analyze a corpus of documents to find the most commonly occuring groups of letters. These becomes the vocab.
- Tokenize the corpus string using this vocab of subword units.
= L(o.open().read() for o in files[:2000]) txts
! pip install sentencepiece
def subword(sz):
= SubwordTokenizer(vocab_sz=sz)
sp
sp.setup(txts)return ' '.join(first(sp([txt]))[:40])
setup
reads the documents and finds the common sequences of characters to create the vocab.
1000) subword(
"▁De s p ite ▁some ▁humor ous ▁b ant er ▁and ▁a ▁de cent ▁support ing ▁cast , ▁I ▁can ' t ▁really ▁recommend ▁this ▁movie . ▁The ▁lead s ▁are n ' t ▁very ▁li k able ▁and ▁I"
When using fastai’s subword tokenizer, _
represents a space character in the original text.
If we use a smaller vocab, each token will represent fewer characters and it will take more tokens to represent a sentence.
200) subword(
'▁ D es p it e ▁ s o m e ▁h u m or o us ▁b an ter ▁and ▁a ▁ d e c ent ▁ s u p p or t ing ▁ c a s t'
If we use a larger vocab, most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence:
10000) subword(
"▁Des pite ▁some ▁humorous ▁ban ter ▁and ▁a ▁decent ▁support ing ▁cast , ▁I ▁can ' t ▁really ▁recommend ▁this ▁movie . ▁The ▁leads ▁are n ' t ▁very ▁likable ▁and ▁I ▁didn ' t ▁particular ly ▁care ▁if ▁they"
A larger vocab means fewer tokens per sentence, which means faster training, less memory and less state for the model to remember; but on the downside, it means larger embedding matricces, which require more data to learn.
Subword tokenization provides a way to easily scale between character tokenization (using a small subword vocab) and word tokenization (using a large subword vocab) and handles every human language. It can even handle genomic sequences or MIDI music notation. It’s likely to become (or has already) the most common tokenization approach.
Numericalization with fast.ai
Numericalization is the process of mapping tokens to integers.
- Make a list of all possible levels of the categorical variable (the vocab).
- Replace each level with its index in the vocab.
= tkn(txt)
toks print(coll_repr(tkn(txt), 31))
(#264) ['xxbos','xxmaj','despite','some','humorous','banter','and','a','decent','supporting','cast',',','i','ca',"n't",'really','recommend','this','movie','.','xxmaj','the','leads','are',"n't",'very','likable','and','i','did',"n't"...]
Just like with SubwordTokenizer
, we need to call setup
on Numericalize
to create the vocab. That means we’ll need our tokenized corpus first:
= txts[:200].map(tkn)
toks200 0] toks200[
(#264) ['xxbos','xxmaj','despite','some','humorous','banter','and','a','decent','supporting'...]
= Numericalize()
num
num.setup(toks200)20) coll_repr(num.vocab,
"(#2200) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','is','in','i','it'...]"
Our special rules tokens appear first, and then every word appears once in frequency order.
The defaults to Numericalize
are min_freq=3
and max_vocab=60000
. max_vocab
results in fastai replacing all words other than the most common 60,000 with a special unknown word token, xxunk
. This is useful to avoid having an overly large embedding matrix, since that can slow down training and use up too much memory, and can also mean that there isn’t enough data to train useful representations for rare words (better handles by setting min_freq
, any word appearing fewer than it is replaced with xxunk
).
fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the vocab
parameter.
The Numericalizer
object is used like a function:
= num(toks)[:20]; nums nums
TensorText([ 2, 8, 418, 68, 0, 0, 12, 13, 618, 419, 190, 11, 18,
259, 38, 93, 445, 21, 28, 10])
We can check that the integers map back to the original text:
' '.join(num.vocab[o] for o in nums)
"xxbos xxmaj despite some xxunk xxunk and a decent supporting cast , i ca n't really recommend this movie ."
Putting Our Texts into Batches for a Language Model
We want our language model to read text in order, so that it can efficiently predict what the next word is, this means each new batch should begin precisely where the previous one left off.
At the beginning of each epoch we will shuffle the order of the documents to make a new stream.
We then cut this stream into a certain number of batches (which is our batch size). For example, if the stream has 50,000 tokens as we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (1 to 5,000 for the first mini-stream, then from 5,001 to 10,000…) because we want the model to read continuous rows of text. An xxbos
token is added at the start of each text during preprocessing, so that the model knowns when it reads the stream when a new entry is beginning.
First apply our Numericalize
object to the tokenized texts:
= toks200.map(num) nums200
Then pass it to the LMDataLoader
:
= LMDataLoader(nums200) dl
= first(dl)
x,y x.shape, y.shape
(torch.Size([64, 72]), torch.Size([64, 72]))
1], y[:1] x[:
(LMTensorText([[ 2, 8, 418, 68, 0, 0, 12, 13, 618, 419, 190,
11, 18, 259, 38, 93, 445, 21, 28, 10, 8, 9,
693, 42, 38, 72, 1274, 12, 18, 81, 38, 479, 420,
58, 47, 305, 274, 17, 9, 135, 10, 18, 619, 81,
38, 49, 9, 221, 120, 221, 47, 305, 274, 11, 29,
8, 0, 8, 1275, 783, 74, 59, 446, 15, 43, 9,
0, 285, 114, 0, 24, 0]]),
TensorText([[ 8, 418, 68, 0, 0, 12, 13, 618, 419, 190, 11,
18, 259, 38, 93, 445, 21, 28, 10, 8, 9, 693,
42, 38, 72, 1274, 12, 18, 81, 38, 479, 420, 58,
47, 305, 274, 17, 9, 135, 10, 18, 619, 81, 38,
49, 9, 221, 120, 221, 47, 305, 274, 11, 29, 8,
0, 8, 1275, 783, 74, 59, 446, 15, 43, 9, 0,
285, 114, 0, 24, 0, 30]]))
Looking at the first row of the independent variable:
' '.join(num.vocab[o] for o in x[0][:20])
"xxbos xxmaj despite some xxunk xxunk and a decent supporting cast , i ca n't really recommend this movie ."
Which is the start of the text.
The dependent variable is the same thing offset by one token:
' '.join(num.vocab[o] for o in y[0][:20])
"xxmaj despite some xxunk xxunk and a decent supporting cast , i ca n't really recommend this movie . xxmaj"
We are now ready to train our text classifier.
Training a Text Classifier
Two steps to training a state-of-the-art text classifier using transfer learning:
- Fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews.
- Use that model to train a classifier.
Language Model Using DataBlock
fastai handles tokenization and numericalization automatically when TextBlock
is passed to DataBlock
.
= partial(get_text_files, folders=['train', 'test', 'unsup'])
get_imdb
= DataBlock(
dls_lm =TextBlock.from_folder(path, is_lm=True),
blocks=get_imdb,
get_items=RandomSplitter(0.1)
splitter=path, bs=128, seq_len=80) ).dataloaders(path, path
from_folder
tells TextBlock
how to access the texts so that it can do initial preprocessing. fastai performs a few optmizations:
- It saves the tokenized documents in a temporary folder, so it doesn’t have to tokenize them more than once.
- It runs multiple tokenization processes in parallel, to take advantage of your computer’s CPUs.
=2) dls_lm.show_batch(max_n
text | text_ | |
---|---|---|
0 | xxbos xxmaj caught this at xxmaj cinequest . xxmaj it was well attended , but the crowd seemed disappointed . xxmaj in my humble opinion , " charlie the xxmaj ox " was very amateurish and overrated ( it pales in comparison with other cinequest pics i saw ) . xxmaj acting ( with the exception of xxmaj polito ) seemed self - conscious and " stagey . " xxmaj photography , despite originating on high - end xxup hd | xxmaj caught this at xxmaj cinequest . xxmaj it was well attended , but the crowd seemed disappointed . xxmaj in my humble opinion , " charlie the xxmaj ox " was very amateurish and overrated ( it pales in comparison with other cinequest pics i saw ) . xxmaj acting ( with the exception of xxmaj polito ) seemed self - conscious and " stagey . " xxmaj photography , despite originating on high - end xxup hd , |
1 | career , seemed to specialize in patriarch roles , such as in " all the xxmaj president 's xxmaj men " , " max xxmaj dugan xxmaj returns " , and " you xxmaj ca n't xxmaj take it xxmaj with xxmaj you " . xxmaj and in this case , those of us who never saw him on the stage get a big treat , because this was a taped xxmaj broadway production . xxmaj he dominates every scene | , seemed to specialize in patriarch roles , such as in " all the xxmaj president 's xxmaj men " , " max xxmaj dugan xxmaj returns " , and " you xxmaj ca n't xxmaj take it xxmaj with xxmaj you " . xxmaj and in this case , those of us who never saw him on the stage get a big treat , because this was a taped xxmaj broadway production . xxmaj he dominates every scene , |
Each item in the training dataset is a document:
' '.join(dls_lm.vocab[o] for o in dls_lm.train.dataset[0][0])
"xxbos xxmaj it is a delight to watch xxmaj laurence xxmaj harvey as a neurotic chess player , who schemes to murder the opponent he can not defeat at the chessboard . xxmaj this movie has wonderful pacing and several cliffhanger moments , as xxmaj harvey 's plot several times seems on the point of failure or exposure , but he manages to beat the odds yet again . xxmaj columbo wages a skilful war of nerves against this high - strung genius , and the scene where he manages to rattle him enough to cause him to make a mistake while playing chess is one of the highlights of the movie , as xxmaj harvey looks down in disbelief at the board , where he has just allowed himself to be xxunk . xxmaj the climax is almost as strong , and watching xxmaj laurence xxmaj harvey collapse completely as his scheme is exposed brings the movie to a satisfying finish . xxmaj highly recommended ."
' '.join(dls_lm.vocab[o] for o in dls_lm.train.dataset[2][0])
"xxbos xxmaj eyeliner was worn nearly 6 xxrep 3 0 years ago in xxmaj egypt . xxmaj really not that much of a stretch for it to be around in the 12th century . i also did n't realize the series flopped . xxmaj there is a second season airing now is n't there ? xxmaj it is amazing to me when commentaries are made by those who are either ill - informed or do n't watch a show at all . xxmaj it is a waste of space on the boards and of other 's time . xxmaj the first show of the series was maybe a bit painful as the cast began to fall into place , but that is to be expected from any show . xxmaj the remainder of the first season is excellent . i can hardly wait for the second season to begin in the xxmaj united xxmaj states ."
To confirm my understanding, that the first item in each batch is continuing the mini-stream, I’ll take a look at the first mini-stream of the first two batches:
= 0
counter for xb, yb in dls_lm.train:
= ' '.join(dls_lm.vocab[o] for o in xb[0])
output print(output)
+= 1
counter if counter == 2: break
xxbos xxmaj just got this in the mail and i was positively surprised . xxmaj as a big fan of 70 's cinema it does n't take much to satisfy me when it comes to these kind of flicks . xxmaj despite the obvious low budget on this movie , the acting is overall good and you can already see why xxmaj pesci was to become on of the greatest actors ever . xxmaj i 'm not sure how authentic
this movie is , but it sure is a good contribution to the mob genre … .. xxbos xxmaj why on earth should you explore the mesmerizing nature documentary " earth " ? xxmaj how much time do you have on earth so i can explain this to you ? xxup ok , i will not xxunk my review exploration on " earth " to infinity , but i must stand my ground on why this is a " must
Confirmed! The second batch’s first mini-stream is a continuation of the first batch’s first mini-stream. In this case, the first mini-stream of the second batch also contains the start of the next movie review (document) as indicated by the xxbos
special token.
Fine-Tuning the Language Model
To convert the integer word indices into activations that we can use for our neural network, we will use embeddings. We feed those embeddings into a recurrent neural network (RNN) using an architecture called AWS-LSTM.
The embeddings in the pretrained model are merged with random embeddings added for words that weren’t in the pretraining vocabulary.
= language_model_learner(
learn
dls_lm,
AWD_LSTM,=0.3,
drop_mult=[accuracy, Perplexity()]
metrics ).to_fp16()
The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab).
Perplexity is a metric often used in NLP for language models. It is the exponential of loss (i.e., torch.exp(cross_entropy)
).
language_model_learner
automatically calls freeze
when using a pretrained model (which is the default) so this will train only the embeddings (the part of the model that contains randomly initialized weights—embeddings for the words that are in our IMDb vocab, but aren’t in the pretrained model vocab).
I wasn’t able to train my model on Google Colab (I got a ran out of memory error even for small batches) so I trained the IMDb language model on Paperspace and wrote a separate blog post about it.
Disinformation and Language Models
- Even simple algorithms could be used to create fraudulent accounts and try to influence policymakers (99% of the 2017 Net Neutrality public comments were likely faked).
- Many people assume or hope that algorithms will come to our defense here, the problem is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms.
Questionnaire
1. What is self-supervised learning?
Self-supervised learning is when you train a model on data that does not contain any external labels. Instead, the labels are embedded in the independent variable.
2. What is a language model?
A language model is a model that predicts the next word based on the previous words in a text.
3. Why is a language model considered self-supervised?
Because we do not train the model with external labels. The dependent variable is the next token in a sequence of previous tokens (independent variable).
4. What are self-supervised models usually used for?
Pretraining a model that will be used for transfer learning.
5. Why do we fine-tune language models?
In order for it to learn the style of language used in our specific corpus.
6. What are the three steps to create a state-of-the-art text classifier?
- Train a language model on a large general corpus like Wikipedia.
- Fine-tune a language model using your task-specific corpus.
- Fine-tune a classifier using the encoder of the twice-pretrained language model.
7. How do the 50,000 unlabeled movie reviews help create a better text classifier for the IMDb dataset?
The 50k unlabeled movie reviews help create a better text classifier for the IMDb dataset because when you fine-tune the pretrained Wikipedia language model using this data, the model learns the particular style and content of IMDb movie reviews, which helps it better understand what the language used in the reviews means when classifying it as positive or negative.
8. What are the three steps to prepare your data for a language model?
- Tokenization: convert the text into a list of words (or characters or substrings).
- Numericalization: List all of the words that appear (the vocab) and convert each word into a number by looking up its index in the vocab.
- Language model data loader creation: combine the documents into one string and split it into fixed sequence length batches while preserving the order of the tokens, create a dependent variable that is offset from the independent variable by one token, and shuffle the training data (maintaining independent/dependent variable structure).
9. What is tokenization? Why do we need it?
Tokenization is the conversion of text into smaller parts (like words, subwords or characters). In order to convert our documents into numbers (categories) that the language model can learn something about, we first tokenize them (break them into smaller parts) so that we can generate a list of unique tokens (unique levels of a categorical variable) contained in the corpus (categorical variable).
10. Name three approaches to tokenization.
- word-based: split a sentence based on spaces.
- subword based: split words into commonly occurring substrings.
- character-based: split a sentence into its individual characters.
11. What is xxbos
?
A special token that tells the language model that we are at the start of a new stream (document).
12. List four rules that fastai applies to text during tokenization.
I’ll list them all:
fix_html
: replace special HTML characters (like©
—the copyright symbol) with a readable version.replace_rep
: replace repeated characters with a special token for repetition (xxrep
), the number of times it’s repeated, and then the character.replace_wrep
: do the same asreplace_rep
but for repeated words (using the special tokenxxwrep
).spec_add_spaces
: add spaces around/
and#
.rm_useless_spaces
: remove all repetitions of the space character.replace_all_caps
: lowercase all-caps words and place a special tokenxxcap
in front of it.replace_maj
: lowercase a capitalized word and place a special tokenxxmaj
in front of it.lowercase
: lowercase all text and place a special token at the beginning (xxbos
) and/or at the end (xxeos
).
13. Why are repeated characters replaced with a token showing the number of repetitions and the character that’s repeated?
So that the model’s embedding matrix can encode information about general concepts such as repeated punctuation without requiring a unique token for every number of repetitions of a character.
14. What is numericalization?
Converting a token to a number by looking up its index in the vocab (unique list of all tokens).
15. Why might there be words that are replaced with the “unknown word” token?
In order to avoid having an overly large embedding matrix, fastai’s numericalization replaces two types of words with with the unknown word token xxunk
:
- Words that appear less than
min_freq
times. - Words that are not in the
max_vocab
most frequent words.
For example, if min_freq = 3
then all words that appear once or twice are replaced with xxunk
.
If max_vocab = 60000
then words the appear less frequently than the 60000th most frequent word are replaced with xxunk
.
16. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain?
The second row contains 64 tokens of the (n/b/s+1)th group of tokens where n is the number of tokens, divided by the number of batches b divided by the sequence length s. So, if we have 90 tokens divided into 6 batches (rows) with a sequence length (columns) of 5, then the second row of the first batch contains the 4th (i.e., 3 + 1) group of tokens.
Putting Tanishq’s answer here as well:
The dataset is split into 64 mini-streams (batch size).
Each batch has 64 rows (batch size) and 64 columns (sequence length).
The first row of the first batch contains the beginning of the first mini-stream (tokens 1-64).
The second row of the first batch contains the beginning of the second mini-stream.
The first row of the second batch contains the second chunk of the first mini-stream (tokens 65 - 128).
17. Why do we need padding for text classification? Why don’t we need it for language modeling?
When the data is prepared for language modeling, the documents are concatenated into a single string and broken up into equally-sized batches, so there is no need to pad any batches—they’re already the right size.
In the case of text classification, each document is maintained in full length in a batch, and documents will very likely have a varying number of tokens (i.e., everyone is not writing the same length of movie reviews with the same number of special tokens) so in each batch, all of the documents (except the largest) will need to be padded to the batch’s largest document’s size. fastai sorts the data by length each epoch and groups together documents of similar lengths for each batch before applying the padding.
Something that I would like to understand however is:
What if the number of tokens in the training dataset is not divisible by the selected batch size and sequence length? Does fastai use padding in that case? Suppose you have 1000 tokens in total, a batch size of 16 and sequence length of 20. 320 goes into 1000 3 times with a remainder. Does fastai create a 4th batch with padding? Or remove the tokens so there’s only 3 batches? I’ll see if I can figure out what it does with some sample code:
= 5, 2
bs,sl = L([[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]).map(tensor) ints
= LMDataLoader(ints, bs=bs, seq_len=sl)
dl
list(dl)
[(LMTensorText([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]]),
tensor([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]]))]
list(LMDataLoader(ints, bs=bs, seq_len=sl, drop_last=False))
[(LMTensorText([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]]),
tensor([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]]))]
Looks like fastai drops the last batch if it’s not full. I’ve posted this question in the fastai forums to get a confirmation on my understanding.
18. What does an embedding matrix for NLP contain? What is its shape?
It contains the parameters that are trained by the neural net, with each parameter corresponding to each token in the vocab.
From Tanishq’s solutions:
The embedding matrix has the size (vocab_size x embedding_size) where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.
19. What is perplexity?
A metric used in NLP. It is the exponential of the loss.
20. Why do we have to pass the vocabulary of the language model to the classifier data block?
The indexes corresponding to the tokens have to be maintained because we are fine-tuning the language model.
21. What is gradual unfreezing?
When we train one layer at a time for one epoch before we unfreeze and train the full model (including all layers of the encoder).
22. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?
Because text generation models can be trained to beat automatic identification algorithms.
Further Research
1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
- Here is a tweet thread by Arvind Narayan talking about how the danger of ChatGPT is that “you can’t tell when it’s wrong unless you already know the answer”.
- This New York Times article walks through different examples of ChatGPT responding to prompts with disinformation.
- This NewsGuard article, which was referenced in the NYT article, discusses how ChatGPT-4 is more prone to perpetuating misinformation than its predecessor GPT-3.5. GPT-3.5 generated 80 of 100 false narratives given as prompts while GPT-4 generated 100 of 100 false narratives. Also, “ChatGPT-4’s responses that contained false and misleading claims were less likely to include disclaimers about the falsity of those claims (23% of the time) [than ChatGPT-3.5 (51% of the time)].
- This NBC New York article walks through an example of how a ChatGPT written story on Michael Bloomberg was full of made-up quotes and sources. It also talks about how some educators are embracing ChatGPT in the classroom, and while ineffective, there are machine-generated text identification algorithms available. Although it’s important to note, as disussed in the fastai course, that text generation models will always be ahead of automatic identification models (generative models can be trained to beat identification models).
- In this Harvard Business School Working Knowledge article Scott Van Voorhiss and Tsedal Neeley summarise the story of how Dr. Timnit Gebru went from Ethiopia, to Boston, to a PhD at Stanford, and co-lead of Google AI Ethics, later to be fired when because she co-authored a paper that asked for companies to hold off on building large language models until we figured out how to handle the bias perpetuated by these models.
The article’s authors use these events as a case study to learn from when handling issues of ethics in AI.
- “The biggest message I want to convey is that AI can scale bias in ways that we can barely understand today”.
- “in failing to give Gebru the independence to do her job, might have sacrificed an opportunity to become a global leader in responsible AI development”.
- Finally, in this paper the authors test detection tools for AI-generated text in academic settings. “The researchers conclude that the available detection tools are neither accurate nor reliable and have a main bias towards classifying the output as human-written rather than detecting AI-generated text”. Across the 14 tools, the highest average accuracy was less than 80%, with 50% for AI-generated/human-edited text and 26% for machine-paraphrased AI-generated text.
2. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?
The first thing that comes to mind is Glaze by the University of Chicago which “works by understanding the AI models that are training on human art, and using machine learning algorithms, computing a set of minimal changes to artworks, such that it appears unchanged to human eyes, but appears to AI models like a dramatically different art style…So when someone then prompts the model to generate art mimicking the charcoal artist, they will get something quite different from what they expected.”
I can’t imagine how something analogous to Glaze can be created for language, since plain text is just plain text, but conceptually, if human-written language is altered in a similar way, then it will be prevented from being generated similarly by LLMs like GPT. This would effect not just LLMs but anyone training their model on such altered data, but perhaps that is a cost worth having to prevent the perpetuation of copyrighted or disinformation content.
Another idea is that disinformation detection may benefit from a human-in-the-loop. AI-generated content that is not identified automatically may be identified by a human as disinformation. A big enough sample of accounts spreading this misinformation may lead to identifying broader trends in which accounts are fake.
Lesson 5: From-scratch Model
Notebook Exercise: Linear model and neural net from scratch
In this section I’ll run code cells from the “clean” version (no markdown or outputs) of this notebook by Jeremy. I’ll add some thoughts as I run cells and add code to understand what is going on.
from pathlib import Path
= Path('~/.kaggle/kaggle.json').expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
import os
= os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
iskaggle if iskaggle: path = Path("../input/titanic")
else:
= Path('titanic')
path if not path.exists():
import zipfile, kaggle
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading titanic.zip to /content
100%|██████████| 34.1k/34.1k [00:00<00:00, 2.77MB/s]
import torch, numpy as np, pandas as pd
=140)
np.set_printoptions(linewidth=140, sci_mode=False, edgeitems=7)
torch.set_printoptions(linewidth'display.width', 140) pd.set_option(
# load the training data and look at it
= pd.read_csv(path/'train.csv')
df df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
# see how many null values are in each column
sum() df.isna().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Since each Name
is unique, there are 891 modes for the Name
column. df.mode()
will print these out as a DataFrame
, with rows containing NaN
for columns with fewer modes (e.g., Age
has 1 mode, 24
, and that is listed once in the first row in the output DataFrame
for df.mode()
).
# see the most frequent values in each column
df.mode()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3.0 | Abbing, Mr. Anthony | male | 24.0 | 0.0 | 0.0 | 1601 | 8.05 | B96 B98 | S |
1 | 2 | NaN | NaN | Abbott, Mr. Rossmore Edward | NaN | NaN | NaN | NaN | 347082 | NaN | C23 C25 C27 | NaN |
2 | 3 | NaN | NaN | Abbott, Mrs. Stanton (Rosa Hunt) | NaN | NaN | NaN | NaN | CA. 2343 | NaN | G6 | NaN |
3 | 4 | NaN | NaN | Abelson, Mr. Samuel | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | NaN | NaN | Abelson, Mrs. Samuel (Hannah Wizosky) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | NaN | NaN | de Mulder, Mr. Theodore | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
887 | 888 | NaN | NaN | de Pelsmaeker, Mr. Alfons | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
888 | 889 | NaN | NaN | del Carlo, Mr. Sebastiano | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
889 | 890 | NaN | NaN | van Billiard, Mr. Austin Blyler | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
890 | 891 | NaN | NaN | van Melkebeke, Mr. Philemon | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
891 rows × 12 columns
# view the topmost row of the modes DataFrame
= df.mode().iloc[0]
modes modes
PassengerId 1
Survived 0.0
Pclass 3.0
Name Abbing, Mr. Anthony
Sex male
Age 24.0
SibSp 0.0
Parch 0.0
Ticket 1601
Fare 8.05
Cabin B96 B98
Embarked S
Name: 0, dtype: object
# fill missing data with the column's mode
=True) df.fillna(modes, inplace
# check that we no longer have missing data
sum() df.isna().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64
import numpy as np
# view a summary of the data
=(np.number)) df.describe(include
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 28.566970 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 13.199572 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 24.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
# view the skewed distribution of Fares
'Fare'].hist(); df[
So that it’s more normally distributed, we take the log of Fare
. We add 1
to Fare
before taking the logarithm so that we aren’t ever taking log of 0
(which is undefined).
'LogFare'] = np.log(df['Fare']+1) df[
'LogFare'].hist(); df[
# view the unique values of pclass
= sorted(df.Pclass.unique())
pclasses pclasses
[1, 2, 3]
# look at string columns
=[object]) df.describe(include
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 891 | 891 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Braund, Mr. Owen Harris | male | 347082 | B96 B98 | S |
freq | 1 | 577 | 7 | 691 | 646 |
# get_dummies returns DataFrame with 0/1 values for categorical variable columns
= pd.get_dummies(df, columns=['Sex', 'Pclass', 'Embarked'])
df df.columns
Index(['PassengerId', 'Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'LogFare', 'Sex_female', 'Sex_male',
'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
dtype='object')
# view the new dummy variables
= ['Sex_male', 'Sex_female', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
added_cols df[added_cols].head()
Sex_male | Sex_female | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
from torch import tensor
# convert dependent variable to a tensor
= tensor(df.Survived)
t_dep 5] t_dep[:
tensor([0, 1, 1, 1, 0])
# convert independent variables to a tensor
= ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols
indep_cols indep_cols
['Age',
'SibSp',
'Parch',
'LogFare',
'Sex_male',
'Sex_female',
'Pclass_1',
'Pclass_2',
'Pclass_3',
'Embarked_C',
'Embarked_Q',
'Embarked_S']
df[indep_cols].values
array([[22., 1., 0., ..., 0., 0., 1.],
[38., 1., 0., ..., 1., 0., 0.],
[26., 0., 0., ..., 0., 0., 1.],
...,
[24., 1., 2., ..., 0., 0., 1.],
[26., 0., 0., ..., 1., 0., 0.],
[32., 0., 0., ..., 0., 1., 0.]])
= tensor(df[indep_cols].values, dtype=torch.float)
t_indep t_indep
tensor([[22.0000, 1.0000, 0.0000, 2.1102, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[38.0000, 1.0000, 0.0000, 4.2806, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000],
[26.0000, 0.0000, 0.0000, 2.1889, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[35.0000, 1.0000, 0.0000, 3.9908, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[35.0000, 0.0000, 0.0000, 2.2028, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[24.0000, 0.0000, 0.0000, 2.2469, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[54.0000, 0.0000, 0.0000, 3.9677, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
...,
[25.0000, 0.0000, 0.0000, 2.0857, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[39.0000, 0.0000, 5.0000, 3.4054, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[27.0000, 0.0000, 0.0000, 2.6391, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[19.0000, 0.0000, 0.0000, 3.4340, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[24.0000, 1.0000, 2.0000, 3.1966, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[26.0000, 0.0000, 0.0000, 3.4340, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000],
[32.0000, 0.0000, 0.0000, 2.1691, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000]])
# 891 individuals
# 12 columns
t_indep.shape
torch.Size([891, 12])
# initialize parameters
442)
torch.manual_seed(
= t_indep.shape[1]
n_coeff = torch.rand(n_coeff)-0.5
coeffs coeffs
tensor([-0.4629, 0.1386, 0.2409, -0.2262, -0.2632, -0.3147, 0.4876, 0.3136, 0.2799, -0.4392, 0.2103, 0.3625])
coeffs.shape
torch.Size([12])
# normalize large values
max(dim=0) t_indep.
torch.return_types.max(
values=tensor([80.0000, 8.0000, 6.0000, 6.2409, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]),
indices=tensor([630, 159, 678, 258, 0, 1, 1, 9, 0, 1, 5, 0]))
= t_indep.max(dim=0)
vals, indices
# divide values in each column by the maximum in each column
# using broadcasting
= t_indep / vals
t_indep t_indep
tensor([[0.2750, 0.1250, 0.0000, 0.3381, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.4750, 0.1250, 0.0000, 0.6859, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000],
[0.3250, 0.0000, 0.0000, 0.3507, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.4375, 0.1250, 0.0000, 0.6395, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[0.4375, 0.0000, 0.0000, 0.3530, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.3000, 0.0000, 0.0000, 0.3600, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[0.6750, 0.0000, 0.0000, 0.6358, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
...,
[0.3125, 0.0000, 0.0000, 0.3342, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.4875, 0.0000, 0.8333, 0.5456, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[0.3375, 0.0000, 0.0000, 0.4229, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[0.2375, 0.0000, 0.0000, 0.5502, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[0.3000, 0.1250, 0.3333, 0.5122, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.3250, 0.0000, 0.0000, 0.5502, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000],
[0.4000, 0.0000, 0.0000, 0.3476, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000]])
When calculating the predictions, each row of t_indep
is element-wise multiplied by coeffs
and is summed together with .sum(axis=1)
.
# calculate predictions
# predictions = matrix multiplication of independent variable values and parameters
# each row
= (t_indep*coeffs).sum(axis=1)
preds preds.shape
torch.Size([891])
10] preds[:
tensor([ 0.1927, -0.6239, 0.0979, 0.2056, 0.0968, 0.0066, 0.1306, 0.3476, 0.1613, -0.6285])
To visualize the preds
calculation, I’ll do the first prediction (0.1927
) manually:
# multiply coeffs by first row of t_indep and take the sum
0]*coeffs) (t_indep[
tensor([-0.1273, 0.0173, 0.0000, -0.0765, -0.2632, -0.0000, 0.0000, 0.0000, 0.2799, -0.0000, 0.0000, 0.3625])
0]*coeffs).sum() (t_indep[
tensor(0.1927)
= torch.abs(preds-t_dep).mean()
loss loss
tensor(0.5382)
# collect calculations into functions
def calc_preds(coeffs, indeps): return (indeps*coeffs).sum(axis=1)
def calc_loss(coeffs, indeps, deps): return torch.abs(calc_preds(coeffs, indeps)-deps).mean()
# get ready to calculate gradient
coeffs.requires_grad_()
tensor([-0.4629, 0.1386, 0.2409, -0.2262, -0.2632, -0.3147, 0.4876, 0.3136, 0.2799, -0.4392, 0.2103, 0.3625], requires_grad=True)
= calc_loss(coeffs, t_indep, t_dep)
loss loss
tensor(0.5382, grad_fn=<MeanBackward0>)
loss.backward()
coeffs.grad
tensor([-0.0106, 0.0129, -0.0041, -0.0484, 0.2099, -0.2132, -0.1212, -0.0247, 0.1425, -0.1886, -0.0191, 0.2043])
If we calculate loss again and calculate the gradients they will be added to the existing gradients:
= calc_loss(coeffs, t_indep, t_dep)
loss
loss.backward()# notice how these are 2x the original gradients coeffs.grad
tensor([-0.0212, 0.0258, -0.0082, -0.0969, 0.4198, -0.4265, -0.2424, -0.0494, 0.2851, -0.3771, -0.0382, 0.4085])
This is why we set gradients back to zero.
= calc_loss(coeffs, t_indep, t_dep)
loss
loss.backward()with torch.no_grad():
* 0.1)
coeffs.sub_(coeffs.grad
coeffs.grad.zero_()print(calc_loss(coeffs, t_indep, t_dep))
tensor(0.4945)
Our loss decreased after doing gradient descent.
Split data into training and validation sets
from fastai.data.transforms import RandomSplitter
# RandomSplitter gives indexes of the corresponding training/validation split
=42)(df) RandomSplitter(seed
((#713) [788,525,821,253,374,98,215,313,281,305...],
(#178) [303,778,531,385,134,476,691,443,386,128...])
=RandomSplitter(seed=42)(df) trn_split,val_split
len(trn_split), len(val_split)
(713, 178)
Using the training and validation indexes, create training and validation set independent and dependent variables:
= t_indep[trn_split], t_indep[val_split]
trn_indep,val_indep = t_dep[trn_split], t_dep[val_split] trn_dep,val_dep
trn_indep.shape, trn_dep.shape
(torch.Size([713, 12]), torch.Size([713]))
val_indep.shape, val_dep.shape
(torch.Size([178, 12]), torch.Size([178]))
Put the stepping the parameters code into a function:
def update_coeffs(coeffs, lr):
* lr)
coeffs.sub_(coeffs.grad coeffs.grad.zero_()
Create function to train model for one epoch:
def one_epoch(coeffs, lr):
= calc_loss(coeffs, trn_indep, trn_dep)
loss
loss.backward()with torch.no_grad(): update_coeffs(coeffs, lr)
print(f"{loss:.3f}", end="; ")
Create a function to initialize parameters:
def init_coeffs(): return (torch.rand(n_coeff)-0.5).requires_grad_()
Create function to train a model for a given number of epochs:
def train_model(epochs=30, lr=0.01):
442)
torch.manual_seed(= init_coeffs()
coeffs for i in range(epochs): one_epoch(coeffs, lr=lr)
return coeffs
Train model for 18 epochs:
= train_model(18, lr=0.2) coeffs
0.536; 0.502; 0.477; 0.454; 0.431; 0.409; 0.388; 0.367; 0.349; 0.336; 0.330; 0.326; 0.329; 0.304; 0.314; 0.296; 0.300; 0.289;
The loss consistently decreases each epoch.
def show_coeffs(): return dict(zip(indep_cols, coeffs.requires_grad_(False)))
Positive coefficients indicate a positive correlation with survival, negative coefficients indicate negative correlation. For example, Sex_male
coefficient is negative meaning that survival variable decreases if Sex_male
is 1
.
show_coeffs()
{'Age': tensor(-0.2694),
'SibSp': tensor(0.0901),
'Parch': tensor(0.2359),
'LogFare': tensor(0.0280),
'Sex_male': tensor(-0.3990),
'Sex_female': tensor(0.2345),
'Pclass_1': tensor(0.7232),
'Pclass_2': tensor(0.4112),
'Pclass_3': tensor(0.3601),
'Embarked_C': tensor(0.0955),
'Embarked_Q': tensor(0.2395),
'Embarked_S': tensor(0.2122)}
With coefficients, we can calculate predictions and therefore accuracy:
= calc_preds(coeffs, val_indep) preds
10] preds[:
tensor([ 0.8160, 0.1295, -0.0148, 0.1831, 0.1520, 0.1350, 0.7279, 0.7754, 0.3222, 0.6740])
# recall that we split the data into training and validation sets preds.shape
torch.Size([178])
bool()[:10] val_dep.
tensor([ True, False, False, False, False, False, True, True, False, True])
>0.5)[:10] (preds
tensor([ True, False, False, False, False, False, True, True, False, True])
= val_dep.bool()==(preds>0.5)
results 10] results[:
tensor([True, True, True, True, True, True, True, True, True, True])
Calculate accuracy:
float().mean() results.
tensor(0.7865)
Put accuracy calculation into a function:
def acc(coeffs): return (val_dep.bool()==(calc_preds(coeffs, val_indep)>0.5)).float().mean()
acc(coeffs)
tensor(0.7865)
View sigmoid function:
import sympy
"1/(1+exp(-x))", xlim=(-5,5)) sympy.plot(
<sympy.plotting.plot.Plot at 0x79c4dd276650>
Notice how large positive values of x result in y values closer to 1 and large negative x values result in y closer to 0.
"1/(1+exp(-x))", xlim=(-10,10)) sympy.plot(
<sympy.plotting.plot.Plot at 0x79c4dc682fe0>
We update our prediction calculation function to incorporate sigmoid:
def calc_preds(coeffs, indeps): return torch.sigmoid((indeps*coeffs).sum(axis=1))
= train_model(lr=100) coeffs
0.510; 0.327; 0.294; 0.207; 0.201; 0.199; 0.198; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194;
acc(coeffs)
tensor(0.8258)
Sex_male
’s coefficient has significantly increased (negatively):
show_coeffs()
{'Age': tensor(-1.5061),
'SibSp': tensor(-1.1575),
'Parch': tensor(-0.4267),
'LogFare': tensor(0.2543),
'Sex_male': tensor(-10.3320),
'Sex_female': tensor(8.4185),
'Pclass_1': tensor(3.8389),
'Pclass_2': tensor(2.1398),
'Pclass_3': tensor(-6.2331),
'Embarked_C': tensor(1.4771),
'Embarked_Q': tensor(2.1168),
'Embarked_S': tensor(-4.7958)}
Predict test data set values:
= pd.read_csv(path/'test.csv') tst_df
sum() tst_df.isna().
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
Replace missing Fare
with 0
:
'Fare'] = tst_df.Fare.fillna(0) tst_df[
Replace other missing values with training set modes
:
=True) tst_df.fillna(modes, inplace
Apply the same data transformations as training set:
'LogFare'] = np.log(tst_df['Fare']+1)
tst_df[= pd.get_dummies(tst_df, columns=['Sex', 'Pclass', 'Embarked']) tst_df
= tensor(tst_df[indep_cols].values, dtype=torch.float)
tst_indep = tst_indep / vals tst_indep
10] tst_indep[:
tensor([[0.4313, 0.0000, 0.0000, 0.3490, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[0.5875, 0.1250, 0.0000, 0.3332, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.7750, 0.0000, 0.0000, 0.3796, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000],
[0.3375, 0.0000, 0.0000, 0.3634, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.2750, 0.1250, 0.1667, 0.4145, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.1750, 0.0000, 0.0000, 0.3725, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.3750, 0.0000, 0.0000, 0.3453, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000],
[0.3250, 0.1250, 0.1667, 0.5450, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000],
[0.2250, 0.0000, 0.0000, 0.3377, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000],
[0.2625, 0.2500, 0.0000, 0.5167, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000]])
tst_indep.shape
torch.Size([418, 12])
Calculate predictions in the format expected by Kaggle:
'Survived'] = (calc_preds(tst_indep, coeffs)>0.5).int() tst_df[
= tst_df[['PassengerId', 'Survived']]
sub_df sub_df.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 0 |
Use @
operator for matrix multiplication:
*coeffs).sum(axis=1) (val_indep
tensor([ 12.3288, -14.8119, -15.4540, -13.1513, -13.3512, -13.6469, 3.6248, 5.3429, -22.0878, 3.1233, -21.8742, -15.6421, -21.5504,
3.9393, -21.9190, -12.0010, -12.3775, 5.3550, -13.5880, -3.1015, -21.7237, -12.2081, 12.9767, 4.7427, -21.6525, -14.9135,
-2.7433, -12.3210, -21.5886, 3.9387, 5.3890, -3.6196, -21.6296, -21.8454, 12.2159, -3.2275, -12.0289, 13.4560, -21.7230,
-3.1366, -13.2462, -21.7230, -13.6831, 13.3092, -21.6477, -3.5868, -21.6854, -21.8316, -14.8158, -2.9386, -5.3103, -22.2384,
-22.1097, -21.7466, -13.3780, -13.4909, -14.8119, -22.0690, -21.6666, -21.7818, -5.4439, -21.7407, -12.6551, -21.6671, 4.9238,
-11.5777, -13.3323, -21.9638, -15.3030, 5.0243, -21.7614, 3.1820, -13.4721, -21.7170, -11.6066, -21.5737, -21.7230, -11.9652,
-13.2382, -13.7599, -13.2170, 13.1347, -21.7049, -21.7268, 4.9207, -7.3198, -5.3081, 7.1065, 11.4948, -13.3135, -21.8723,
-21.7230, 13.3603, -15.5670, 3.4105, -7.2857, -13.7197, 3.6909, 3.9763, -14.7227, -21.8268, 3.9387, -21.8743, -21.8367,
-11.8518, -13.6712, -21.8299, 4.9440, -5.4471, -21.9666, 5.1333, -3.2187, -11.6008, 13.7920, -21.7230, 12.6369, -3.7268,
-14.8119, -22.0637, 12.9468, -22.1610, -6.1827, -14.8119, -3.2838, -15.4540, -11.6950, -2.9926, -3.0110, -21.5664, -13.8268,
7.3426, -21.8418, 5.0744, 5.2582, 13.3415, -21.6289, -13.9898, -21.8112, -7.3316, 5.2296, -13.4453, 12.7891, -22.1235,
-14.9625, -3.4339, 6.3089, -21.9839, 3.1968, 7.2400, 2.8558, -3.1187, 3.7965, 5.4667, -15.1101, -15.0597, -22.9391,
-21.7230, -3.0346, -13.5206, -21.7011, 13.4425, -7.2690, -21.8335, -12.0582, 13.0489, 6.7993, 5.2160, 5.0794, -12.6957,
-12.1838, -3.0873, -21.6070, 7.0744, -21.7170, -22.1001, 6.8159, -11.6002, -21.6310])
@coeffs val_indep
tensor([ 12.3288, -14.8119, -15.4540, -13.1513, -13.3511, -13.6468, 3.6248, 5.3429, -22.0878, 3.1233, -21.8742, -15.6421, -21.5504,
3.9393, -21.9190, -12.0010, -12.3775, 5.3550, -13.5880, -3.1015, -21.7237, -12.2081, 12.9767, 4.7427, -21.6525, -14.9135,
-2.7433, -12.3210, -21.5886, 3.9387, 5.3890, -3.6196, -21.6296, -21.8454, 12.2159, -3.2275, -12.0289, 13.4560, -21.7230,
-3.1366, -13.2462, -21.7230, -13.6831, 13.3092, -21.6477, -3.5868, -21.6854, -21.8316, -14.8158, -2.9386, -5.3103, -22.2384,
-22.1097, -21.7466, -13.3780, -13.4909, -14.8119, -22.0690, -21.6666, -21.7818, -5.4439, -21.7407, -12.6551, -21.6671, 4.9238,
-11.5777, -13.3323, -21.9638, -15.3030, 5.0243, -21.7614, 3.1820, -13.4721, -21.7170, -11.6066, -21.5737, -21.7230, -11.9652,
-13.2382, -13.7599, -13.2170, 13.1347, -21.7049, -21.7268, 4.9207, -7.3198, -5.3081, 7.1065, 11.4948, -13.3135, -21.8723,
-21.7230, 13.3603, -15.5670, 3.4105, -7.2857, -13.7197, 3.6909, 3.9763, -14.7227, -21.8268, 3.9387, -21.8743, -21.8367,
-11.8518, -13.6712, -21.8299, 4.9440, -5.4471, -21.9666, 5.1333, -3.2187, -11.6008, 13.7920, -21.7230, 12.6369, -3.7268,
-14.8119, -22.0637, 12.9468, -22.1610, -6.1827, -14.8119, -3.2838, -15.4540, -11.6950, -2.9926, -3.0110, -21.5664, -13.8268,
7.3426, -21.8418, 5.0744, 5.2582, 13.3415, -21.6289, -13.9898, -21.8112, -7.3316, 5.2296, -13.4453, 12.7891, -22.1235,
-14.9625, -3.4339, 6.3089, -21.9839, 3.1968, 7.2400, 2.8558, -3.1187, 3.7965, 5.4667, -15.1101, -15.0597, -22.9391,
-21.7230, -3.0346, -13.5206, -21.7011, 13.4425, -7.2690, -21.8335, -12.0582, 13.0489, 6.7993, 5.2160, 5.0794, -12.6957,
-12.1838, -3.0873, -21.6070, 7.0744, -21.7170, -22.1001, 6.8159, -11.6002, -21.6310])
Update prediction calculation so that it uses matrix multiplication operator:
def calc_preds(coeffs, indeps): return torch.sigmoid(indeps@coeffs)
Recreate coefficients and dependent variable so they are in the correct shape for matrix multiplication (when doing matrix-matrix products later on):
def init_coeffs(): return (torch.rand(n_coeff, 1)*0.1).requires_grad_()
trn_dep.shape, val_dep.shape
(torch.Size([713]), torch.Size([178]))
= trn_dep[:, None]
trn_dep = val_dep[:, None] val_dep
trn_dep.shape, val_dep.shape
(torch.Size([713, 1]), torch.Size([178, 1]))
= train_model(lr=100) coeffs
0.512; 0.323; 0.290; 0.205; 0.200; 0.198; 0.197; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194;
acc(coeffs)
tensor(0.8258)
Our model hasn’t changed other than the fact that we are now using matrix product explicitly.
Let’s create a neural net:
1)[0] torch.rand(
tensor(0.6722)
def init_coeffs(n_hidden=20):
= (torch.rand(n_coeff, n_hidden)-0.5)/n_hidden
layer1 = torch.rand(n_hidden, 1)-0.3
layer2 = torch.rand(1)[0]
const return layer1.requires_grad_(), layer2.requires_grad_(), const.requires_grad_()
import torch.nn.functional as F
def calc_preds(coeffs, indeps):
= coeffs
l1, l2, const = F.relu(indeps@l1)
res = res@l2 + const
res return torch.sigmoid(res)
As an aside, showing that the order of matrix multiplication operands matters—you get very different results:
1,2,3], [4,5,6]]).shape tensor([[
torch.Size([2, 3])
1, 2], [3, 4], [5, 6]]).shape tensor([[
torch.Size([3, 2])
1,2,3], [4,5,6]]) @ tensor([[1, 2], [3, 4], [5, 6]]) tensor([[
tensor([[22, 28],
[49, 64]])
1, 2], [3, 4], [5, 6]]) @ tensor([[1,2,3], [4,5,6]]) tensor([[
tensor([[ 9, 12, 15],
[19, 26, 33],
[29, 40, 51]])
Back to updating our functions to handle neural nets:
def update_coeffs(coeffs, lr):
for layer in coeffs:
* lr)
layer.sub_(layer.grad layer.grad.zero_()
= train_model(lr=1.4) coeffs
0.543; 0.532; 0.520; 0.505; 0.487; 0.466; 0.439; 0.407; 0.373; 0.343; 0.319; 0.301; 0.286; 0.274; 0.264; 0.256; 0.250; 0.245; 0.240; 0.237; 0.234; 0.231; 0.229; 0.227; 0.226; 0.224; 0.223; 0.222; 0.221; 0.220;
= train_model(lr=20) coeffs
0.543; 0.400; 0.260; 0.390; 0.221; 0.211; 0.197; 0.195; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192;
acc(coeffs)
tensor(0.8258)
Next we train a deep learning model:
def init_coeffs():
= [10,10]
hiddens = [n_coeff] + hiddens + [1]
sizes = len(sizes)
n = [(torch.rand(sizes[i], sizes[i+1])-0.3)/sizes[i+1]*4 for i in range(n-1)]
layers = [(torch.rand(1)[0]-0.5)*0.1 for i in range(n-1)]
consts for l in layers+consts: l.requires_grad_()
return layers,consts
I’ll run through this function’s code line by line to make sure I see what’s going on:
= [10,10] hiddens
= [n_coeff] + hiddens + [1]
sizes sizes
[12, 10, 10, 1]
= len(sizes)
n n
4
+1]) for i in range (n-1)] [(sizes[i], sizes[i
[(12, 10), (10, 10), (10, 1)]
1)[0]-0.5)*0.1 for i in range(n-1)] [(torch.rand(
[tensor(-0.0371), tensor(0.0406), tensor(-0.0461)]
Cool! I can see it now. Next we update the function which calculates predictions to handle a deep neural net:
def calc_preds(coeffs, indeps):
= coeffs
layers,consts = len(layers)
n = indeps
res for i,l in enumerate(layers):
= res@l + consts[i]
res # pass through ReLU for all layers except the last one
if i!=n-1: res = F.relu(res)
return torch.sigmoid(res)
def update_coeffs(coeffs, lr):
= coeffs
layers,consts for layer in layers+consts:
* lr)
layer.sub_(layer.grad layer.grad.zero_()
= train_model(lr=4) coeffs
0.521; 0.483; 0.427; 0.379; 0.379; 0.379; 0.379; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.377; 0.376; 0.371; 0.333; 0.239; 0.224; 0.208; 0.204; 0.203; 0.203; 0.207; 0.197; 0.196; 0.195;
acc(coeffs)
tensor(0.8258)
That’s a wrap for that notebook! It all makes clear sense now after running through the code line by line. We trained a linear model, neural net, and deep learning model and got similar results. In this case, as discussed in the video, the deep learning model doesn’t improve our results.
Notebook Exercise: Why you should use a framework
In this section I run through the “clean” version of Jeremy’s notebook.
from fastai.tabular.all import *
= '{:.2f}'.format
pd.options.display.float_format 42) set_seed(
# read in the data
= pd.read_csv(path/'train.csv') df
# view the data
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.00 | 1 | 0 | A/5 21171 | 7.25 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.00 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.00 | 0 | 0 | STON/O2. 3101282 | 7.92 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.00 | 1 | 0 | 113803 | 53.10 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.00 | 0 | 0 | 373450 | 8.05 | NaN | S |
# feature engineering
def add_features(df):
'LogFare'] = np.log1p(df['Fare'])
df['Deck'] = df.Cabin.str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG"))
df['Family'] = df.SibSp+df.Parch
df['Alone'] = df.Family == 0
df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')
df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df.Title.map(dict(Mr="Mr", Miss="Miss", Mrs="Mrs", Master="Master")) df[
I’ll look at some of these in more detail to breakdown what is happening:
str[0].unique() df.Cabin.
array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)
str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG")).unique() df.Cabin.
array([nan, 'ABC', 'DE', 'FG'], dtype=object)
df.Ticket
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450
...
886 211536
887 112053
888 W./C. 6607
889 111369
890 370376
Name: Ticket, Length: 891, dtype: object
'Ticket')['Ticket'].transform('count') df.groupby(
0 1
1 1
2 1
3 2
4 1
..
886 1
887 1
888 2
889 1
890 1
Name: Ticket, Length: 891, dtype: int64
# there should be 2 count of this ticket
'Ticket == "113803"') df.query(
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.00 | 1 | 0 | 113803 | 53.10 | C123 | S |
137 | 138 | 0 | 1 | Futrelle, Mr. Jacques Heath | male | 37.00 | 1 | 0 | 113803 | 53.10 | C123 | S |
# expand = True splits into separate columns
str.split(', ', expand=True).head() df.Name.
0 | 1 | |
---|---|---|
0 | Braund | Mr. Owen Harris |
1 | Cumings | Mrs. John Bradley (Florence Briggs Thayer) |
2 | Heikkinen | Miss. Laina |
3 | Futrelle | Mrs. Jacques Heath (Lily May Peel) |
4 | Allen | Mr. William Henry |
str.split(', ', expand=False) df.Name.
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs Thayer)]
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object
str.split(', ', expand=True)[1].str.split('.', expand=True)[0].unique() df.Name.
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
'Jonkheer'], dtype=object)
The line df.Title.map(dict(Mr="Mr", Miss="Miss", Mrs="Mrs", Master="Master"))
reduces the number of titles to 4.
str.split(', ', expand=True)[1].str.split('.', expand=True)[0].map(dict(Mr="Mr", Miss="Miss", Mrs="Mrs", Master="Master")).unique() df.Name.
array(['Mr', 'Mrs', 'Miss', 'Master', nan], dtype=object)
# add the features to our dataframe
add_features(df)
df.Title.unique()
array(['Mr', 'Mrs', 'Miss', 'Master', nan], dtype=object)
df.Deck.unique()
array([nan, 'ABC', 'DE', 'FG'], dtype=object)
df.Family.unique()
array([ 1, 0, 4, 2, 6, 5, 3, 7, 10])
; df.LogFare.hist()
df.Alone.unique()
array([False, True])
; df.TicketFreq.hist()
# create training and validation index lists
= RandomSplitter(seed=42)(df) splits
splits
((#713) [788,525,821,253,374,98,215,313,281,305...],
(#178) [303,778,531,385,134,476,691,443,386,128...])
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'LogFare', 'Deck',
'Family', 'Alone', 'TicketFreq', 'Title'],
dtype='object')
# create dataloaders object
= TabularPandas(
dls
df,=splits,
splits=[Categorify, FillMissing, Normalize],
procs=["Sex", "Pclass", "Embarked", "Deck", "Title"],
cat_names=["Age", "SibSp", "Parch", "LogFare", "Alone", "TicketFreq", "Family"],
cont_names="Survived",
y_names=CategoryBlock()
y_block=".") ).dataloaders(path
# view a batch
dls.show_batch()
Sex | Pclass | Embarked | Deck | Title | Age_na | Age | SibSp | Parch | LogFare | Alone | TicketFreq | Family | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 3 | Q | #na# | Mr | True | 28.00 | 1.00 | -0.00 | 2.80 | 0.00 | 2.00 | 1.00 | 0 |
1 | male | 3 | C | #na# | Mr | False | 30.00 | 0.00 | -0.00 | 2.11 | 1.00 | 1.00 | -0.00 | 0 |
2 | male | 3 | S | #na# | Mr | False | 28.00 | 2.00 | -0.00 | 2.19 | 0.00 | 1.00 | 2.00 | 0 |
3 | female | 3 | S | #na# | Miss | False | 45.00 | 0.00 | -0.00 | 2.17 | 1.00 | 1.00 | -0.00 | 0 |
4 | male | 2 | S | #na# | Mr | True | 28.00 | 0.00 | -0.00 | 0.00 | 1.00 | 1.00 | -0.00 | 0 |
5 | male | 3 | S | #na# | Mr | True | 28.00 | 0.00 | -0.00 | 2.78 | 1.00 | 1.00 | -0.00 | 0 |
6 | male | 1 | S | ABC | Mr | False | 38.00 | 0.00 | 1.00 | 5.04 | 0.00 | 3.00 | 1.00 | 0 |
7 | male | 1 | C | ABC | #na# | False | 32.00 | 0.00 | -0.00 | 3.45 | 1.00 | 1.00 | -0.00 | 1 |
8 | male | 2 | S | #na# | Mr | False | 24.00 | 2.00 | -0.00 | 4.31 | 0.00 | 5.00 | 2.00 | 0 |
9 | male | 2 | S | #na# | Mr | False | 48.00 | 0.00 | -0.00 | 2.64 | 1.00 | 1.00 | -0.00 | 0 |
= tabular_learner(dls, metrics=accuracy, layers=[10,10]) learn
=(slide, valley)) learn.lr_find(suggest_funcs
SuggestedLRs(slide=0.04786301031708717, valley=0.015848932787775993)
16, lr=0.03) learn.fit(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.551385 | 0.558225 | 0.595506 | 00:00 |
1 | 0.498181 | 0.578588 | 0.752809 | 00:00 |
2 | 0.472778 | 0.471495 | 0.803371 | 00:00 |
3 | 0.447318 | 0.430369 | 0.825843 | 00:00 |
4 | 0.432644 | 0.454893 | 0.808989 | 00:00 |
5 | 0.421892 | 0.397669 | 0.825843 | 00:00 |
6 | 0.413710 | 0.406790 | 0.814607 | 00:00 |
7 | 0.406777 | 0.430182 | 0.825843 | 00:00 |
8 | 0.402777 | 0.434063 | 0.837079 | 00:00 |
9 | 0.397782 | 0.425264 | 0.814607 | 00:00 |
10 | 0.392991 | 0.413648 | 0.837079 | 00:00 |
11 | 0.390115 | 0.422005 | 0.820225 | 00:00 |
12 | 0.385480 | 0.412861 | 0.837079 | 00:00 |
13 | 0.383542 | 0.403564 | 0.820225 | 00:00 |
14 | 0.380573 | 0.422910 | 0.831461 | 00:00 |
15 | 0.378466 | 0.444065 | 0.820225 | 00:00 |
# prep test data for submission
= pd.read_csv(path/'test.csv')
tst_df 'Fare'] = tst_df.Fare.fillna(0)
tst_df[ tst_df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
add_features(tst_df) tst_df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked', 'LogFare', 'Deck', 'Family',
'Alone', 'TicketFreq', 'Title'],
dtype='object')
= learn.dls.test_dl(tst_df)
tst_dl tst_dl.show_batch()
Sex | Pclass | Embarked | Deck | Title | Age_na | Age | SibSp | Parch | LogFare | Alone | TicketFreq | Family | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | male | 3 | Q | #na# | Mr | False | 34.50 | 0.00 | -0.00 | 2.18 | 1.00 | 1.00 | -0.00 |
1 | female | 3 | S | #na# | Mrs | False | 47.00 | 1.00 | -0.00 | 2.08 | 0.00 | 1.00 | 1.00 |
2 | male | 2 | Q | #na# | Mr | False | 62.00 | 0.00 | -0.00 | 2.37 | 1.00 | 1.00 | -0.00 |
3 | male | 3 | S | #na# | Mr | False | 27.00 | 0.00 | -0.00 | 2.27 | 1.00 | 1.00 | -0.00 |
4 | female | 3 | S | #na# | Mrs | False | 22.00 | 1.00 | 1.00 | 2.59 | 0.00 | 1.00 | 2.00 |
5 | male | 3 | S | #na# | Mr | False | 14.00 | 0.00 | -0.00 | 2.32 | 1.00 | 1.00 | -0.00 |
6 | female | 3 | Q | #na# | Miss | False | 30.00 | 0.00 | -0.00 | 2.16 | 1.00 | 1.00 | -0.00 |
7 | male | 2 | S | #na# | Mr | False | 26.00 | 1.00 | 1.00 | 3.40 | 0.00 | 1.00 | 2.00 |
8 | female | 3 | C | #na# | Mrs | False | 18.00 | 0.00 | -0.00 | 2.11 | 1.00 | 1.00 | -0.00 |
9 | male | 3 | S | #na# | Mr | False | 21.00 | 2.00 | -0.00 | 3.22 | 0.00 | 1.00 | 2.00 |
get_preds
returns predictions for both categories of Survived
(0 and 1).
=tst_dl)[0][:5] learn.get_preds(dl
tensor([[0.9141, 0.0859],
[0.5954, 0.4046],
[0.9711, 0.0289],
[0.9268, 0.0732],
[0.4136, 0.5864]])
=tst_dl)[0][:5].sum(axis=1) learn.get_preds(dl
tensor([1., 1., 1., 1., 1.])
# targets are empty---why?
=tst_dl)[1] learn.get_preds(dl
= learn.get_preds(dl=tst_dl) preds,_
'Survived'] = (preds[:,1]>0.5).int() tst_df[
tst_df.Survived.unique()
array([0, 1], dtype=int32)
= tst_df[['PassengerId', 'Survived']] sub_df
sub_df.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
# ensembling
def ensemble():
= tabular_learner(dls, metrics=accuracy, layers=[10,10])
learn with learn.no_bar(), learn.no_logging(): learn.fit(16, lr=0.03)
return learn.get_preds(dl=tst_dl)[0]
= [ensemble() for _ in range(5)] learns
= torch.stack(learns).mean(0) ens_preds
torch.stack(learns).shape
torch.Size([5, 418, 2])
ens_preds.shape
torch.Size([418, 2])
'Survived'] = (ens_preds[:,1]>0.5).int()
tst_df[= tst_df[['PassengerId', 'Survived']] sub_df
sub_df.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
Notebook Exercise: How random forests really work
In this section I run through the “clean” version of Jeremy’s notebook.
from fastai.imports import *
=130) np.set_printoptions(linewidth
= pd.read_csv(path/'train.csv')
df = pd.read_csv(path/'test.csv') tst_df
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.00 | 1 | 0 | A/5 21171 | 7.25 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.00 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.00 | 0 | 0 | STON/O2. 3101282 | 7.92 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.00 | 1 | 0 | 113803 | 53.10 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.00 | 0 | 0 | 373450 | 8.05 | NaN | S |
tst_df.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.50 | 0 | 0 | 330911 | 7.83 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.00 | 1 | 0 | 363272 | 7.00 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.00 | 0 | 0 | 240276 | 9.69 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.00 | 0 | 0 | 315154 | 8.66 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.00 | 1 | 1 | 3101298 | 12.29 | NaN | S |
= df.mode().iloc[0]
modes modes
PassengerId 1
Survived 0.00
Pclass 3.00
Name Abbing, Mr. Anthony
Sex male
Age 24.00
SibSp 0.00
Parch 0.00
Ticket 1601
Fare 8.05
Cabin B96 B98
Embarked S
Name: 0, dtype: object
# pre-processing
def proc_data(df):
'Fare'] = df.Fare.fillna(0)
df[=True)
df.fillna(modes, inplace'LogFare'] = np.log1p(df['Fare'])
df['Embarked'] = pd.Categorical(df.Embarked)
df['Sex'] = pd.Categorical(df.Sex) df[
df.Embarked.unique()
array(['S', 'C', 'Q', nan], dtype=object)
pd.Categorical(df.Embarked).unique()
['S', 'C', 'Q', NaN]
Categories (3, object): ['C', 'Q', 'S']
proc_data(df) proc_data(tst_df)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'LogFare'],
dtype='object')
tst_df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked', 'LogFare'],
dtype='object')
df.Sex
0 male
1 female
2 female
3 female
4 male
...
886 male
887 female
888 female
889 male
890 male
Name: Sex, Length: 891, dtype: category
Categories (2, object): ['female', 'male']
tst_df.Sex
0 male
1 female
2 male
3 male
4 female
...
413 male
414 female
415 male
416 male
417 male
Name: Sex, Length: 418, dtype: category
Categories (2, object): ['female', 'male']
=["Sex", "Embarked"]
cats=['Age', 'SibSp', 'Parch', 'LogFare', 'Pclass']
conts="Survived" dep
Categoricals are stored as integers but shown as their labels:
df.Sex.head()
0 male
1 female
2 female
3 female
4 male
Name: Sex, dtype: category
Categories (2, object): ['female', 'male']
df.Sex.cat.codes.head()
0 1
1 0
2 0
3 0
4 1
dtype: int8
import seaborn as sns
Sex alone is a pretty good indicator of survival:
= plt.subplots(1,2, figsize=(11,5))
fig,axs =df, y=dep, x="Sex", ax=axs[0]).set(title="Survival rate")
sns.barplot(data=df, x="Sex", ax=axs[1]).set(title="Histogram"); sns.countplot(data
from numpy import random
from sklearn.model_selection import train_test_split
42)
random.seed(= train_test_split(df, test_size=0.25)
trn_df,val_df = trn_df[cats].apply(lambda x: x.cat.codes)
trn_df[cats] = val_df[cats].apply(lambda x: x.cat.codes) val_df[cats]
trn_df[cats].head()
Sex | Embarked | |
---|---|---|
298 | 1 | 2 |
884 | 1 | 2 |
247 | 0 | 2 |
478 | 1 | 2 |
305 | 1 | 2 |
val_df[cats].head()
Sex | Embarked | |
---|---|---|
709 | 1 | 0 |
439 | 1 | 2 |
840 | 1 | 2 |
720 | 0 | 2 |
39 | 0 | 0 |
def xs_y(df):
= df[cats+conts].copy()
xs return xs,df[dep] if dep in df else None
= xs_y(trn_df)
trn_xs,trn_y = xs_y(val_df) val_xs,val_y
trn_xs.head()
Sex | Embarked | Age | SibSp | Parch | LogFare | Pclass | |
---|---|---|---|---|---|---|---|
298 | 1 | 2 | 24.00 | 0 | 0 | 3.45 | 1 |
884 | 1 | 2 | 25.00 | 0 | 0 | 2.09 | 3 |
247 | 0 | 2 | 24.00 | 0 | 2 | 2.74 | 2 |
478 | 1 | 2 | 22.00 | 0 | 0 | 2.14 | 3 |
305 | 1 | 2 | 0.92 | 1 | 2 | 5.03 | 1 |
trn_y.head()
298 1
884 0
247 1
478 0
305 1
Name: Survived, dtype: int64
# sex as the only predictor
= val_xs.Sex==0 preds
from sklearn.metrics import mean_absolute_error
mean_absolute_error(val_y, preds)
0.21524663677130046
= trn_df[trn_df.LogFare>0]
df_fare = plt.subplots(1,2, figsize=(11,5))
fig,axs =df_fare, x=dep, y="LogFare", ax=axs[0])
sns.boxenplot(data=df_fare, x="LogFare", ax=axs[1]); sns.kdeplot(data
It looks like people survived for LogFare values above 2.7ish (2.5ish is the median LogFare value for deaths).
# LogFare as a sole predictor
= val_xs.LogFare>2.7 preds
mean_absolute_error(val_y, preds)
0.336322869955157
We get a larger error than Sex.
def _side_score(side, y):
= side.sum()
tot if tot<=1: return 0
return y[side].std()*tot
def score(col, y, split):
= col<=split
lhs return (_side_score(lhs, y) + _side_score(~lhs, y))/len(y)
"Sex"], trn_y, 0.5) score(trn_xs[
0.40787530982063946
= trn_xs["Sex"] <= 0.5 lhs
sum() lhs.
229
*lhs.sum() trn_y[lhs].std()
100.36927432272375
~lhs].std()*(~lhs).sum() trn_y[
172.0914326374634
len(trn_y)
668
100.36927432272375 + 172.0914326374634)/668 (
0.40787530982063946
"LogFare"], trn_y, 2.7) score(trn_xs[
0.47180873952099694
A smaller score means less variation on each side.
def iscore(nm, split):
= trn_xs[nm]
col return score(col, trn_y, split)
from ipywidgets import interact
=conts, split=15.5)(iscore); interact(nm
=cats, split=15.5)(iscore); interact(nm
= "Age"
nm = trn_xs[nm]
col = col.unique()
unq
unq.sort() unq
array([ 0.42, 0.67, 0.75, 0.83, 0.92, 1. , 2. , 3. , 4. , 5. , 6. , 7. , 8. , 9. , 10. , 11. , 12. ,
13. , 14. , 14.5 , 15. , 16. , 17. , 18. , 19. , 20. , 21. , 22. , 23. , 24. , 24.5 , 25. , 26. , 27. ,
28. , 28.5 , 29. , 30. , 31. , 32. , 32.5 , 33. , 34. , 34.5 , 35. , 36. , 36.5 , 37. , 38. , 39. , 40. ,
40.5 , 41. , 42. , 43. , 44. , 45. , 45.5 , 46. , 47. , 48. , 49. , 50. , 51. , 52. , 53. , 54. , 55. ,
55.5 , 56. , 57. , 58. , 59. , 60. , 61. , 62. , 64. , 65. , 70. , 70.5 , 74. , 80. ])
= np.array([score(col, trn_y, o) for o in unq if not np.isnan(o)])
scores unq[scores.argmin()]
6.0
min() scores.
0.478316717508991
"Age"], trn_y, 6) score(trn_xs[
0.478316717508991
def min_col(df, nm):
= df[nm], df[dep]
col, y = col.dropna().unique()
unq = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
scores = scores.argmin()
idx return unq[idx],scores[idx]
"Age") min_col(trn_df,
(6.0, 0.478316717508991)
= cats+conts
cols for o in cols} {o: min_col(trn_df, o)
{'Sex': (0, 0.40787530982063946),
'Embarked': (0, 0.47883342573147836),
'Age': (6.0, 0.478316717508991),
'SibSp': (4, 0.4783740258817434),
'Parch': (0, 0.4805296527841601),
'LogFare': (2.4390808375825834, 0.4620823937736597),
'Pclass': (2, 0.46048261885806596)}
"Sex")
cols.remove(= trn_df.Sex==1
ismale = trn_df[ismale], trn_df[~ismale] males, females
for o in cols} {o: min_col(males, o)
{'Embarked': (0, 0.3875581870410906),
'Age': (6.0, 0.3739828371010595),
'SibSp': (4, 0.3875864227586273),
'Parch': (0, 0.3874704821461959),
'LogFare': (2.803360380906535, 0.3804856231758151),
'Pclass': (1, 0.38155442004360934)}
for o in cols} {o: min_col(females, o)
{'Embarked': (0, 0.4295252982857327),
'Age': (50.0, 0.4225927658431649),
'SibSp': (4, 0.42319212059713535),
'Parch': (3, 0.4193314500446158),
'LogFare': (4.256321678298823, 0.41350598332911376),
'Pclass': (2, 0.3335388911567601)}
The next split after Sex
is Age<=6
for males
and Pclass<=2
for females
.
from sklearn.tree import DecisionTreeClassifier, export_graphviz
= DecisionTreeClassifier(max_leaf_nodes=4).fit(trn_xs, trn_y); m
import graphviz
def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
s=True, rotate=False, precision=precision, **kwargs)
special_charactersreturn graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))
=10) draw_tree(m, trn_xs, size
def gini(cond):
= df.loc[cond, dep]
act return 1 - act.mean()**2 - (1-act).mean()**2
=='female'), gini(df.Sex=='male') gini(df.Sex
(0.3828350034484158, 0.3064437162277842)
mean_absolute_error(val_y, m.predict(val_xs))
0.2242152466367713
= DecisionTreeClassifier(min_samples_leaf=50)
m
m.fit(trn_xs, trn_y)=60) draw_tree(m, trn_xs, size
mean_absolute_error(val_y, m.predict(val_xs))
0.18385650224215247
= tst_df[cats].apply(lambda x: x.cat.codes)
tst_df[cats] = xs_y(tst_df) tst_xs, _
tst_xs.head()
Sex | Embarked | Age | SibSp | Parch | LogFare | Pclass | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 34.50 | 0 | 0 | 2.18 | 3 |
1 | 0 | 2 | 47.00 | 1 | 0 | 2.08 | 3 |
2 | 1 | 1 | 62.00 | 0 | 0 | 2.37 | 2 |
3 | 1 | 2 | 27.00 | 0 | 0 | 2.27 | 3 |
4 | 0 | 2 | 22.00 | 1 | 1 | 2.59 | 3 |
def subm(preds, suff):
'Survived'] = preds
tst_df[= tst_df[['PassengerId', 'Survived']]
sub_df f'sub-{suff}.csv', index=False)
sub_df.to_csv(
'tree') subm(m.predict(tst_xs),
def get_tree(prop=0.75):
= len(trn_y)
n = random.choice(n, int(n*prop))
idxs return DecisionTreeClassifier(min_samples_leaf=5).fit(trn_xs.iloc[idxs], trn_y.iloc[idxs])
= [get_tree() for t in range(100)] trees
= [t.predict(val_xs) for t in trees]
all_probs = np.stack(all_probs).mean(0)
avg_probs
mean_absolute_error(val_y, avg_probs)
0.22811659192825115
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier(100, min_samples_leaf=5)
rf ;
rf.fit(trn_xs, trn_y) mean_absolute_error(val_y, rf.predict(val_xs))
0.18834080717488788
dict(cols=trn_xs.columns, imp=m.feature_importances_)).plot('cols', 'imp', 'barh'); pd.DataFrame(
Video Notes
- Kaggle sets an environment variable that you can check to see if you’re on Kaggle.
df.isna()
returns aDataFrame
with boolean values (True
if the value isNaN
).- If you call
sum
on aDataFrame
it sums up each column. - Easiest method to impute missing values is to replace them with the mode.
- Mode works for both categorical and continuous variables.
- First baseline model shouldn’t involve doing complicated things.
- Never throw out columns with missing values. Maybe it turns out that the row missing a value is great predictor.
- Some types of models (like linear) don’t like long-tailed distributions like
Fare
. Neural nets are better behaved without them as well. - Things that grow exponentially you want to take the log of (money, population, etc.).
- Dummy variables turn categoricals into 1/0 valued columns for each categorical.
- For n levels if you create n 0/1 columns you don’t have to add a constant term to the model.
- You can create a 82% accurate model just using names.
- Idea of tensor came from notation in 1950. Ken Iverson.
- The most important attribute of a tensor is its
shape
. The length of the shape is its rank.
Linear Model
- The number of coefficients we need is the number of columns in the independent variable.
- Computers can’t create truly random numbers and instead create a sequence of numbers that behave in a random-like way.
- A lot of people are into reproducible results—Jeremy disagrees. An important part of understanding your data is understanding how much it varies from run to run. Run things a few times and get an intuitive sense of how stable it is.
- broadcasting comes from APL. Happens in optimized C code (CPU) or CUDA (GPU). As long as the last axes match it’ll broadcast. It uses a kind of “virtual copying”.
- Linear model: coefficients times the values, added together.
Age
is bigger than any other columns so it will always have a larger value. Not ideal for optimization.- Normalize the columns (divide by maximum in the column).
- Another common way to normalize is subtracting the mean and dividing by the standard deviation.
- Mean absolute value is a good loss function to start with.
- In notebooks, do everything step-by-step manually and then copy it into a function.
- PyTorch functions with an underscore at the end will do an in-place operation.
.backward()
calls the gradient function.- If the gradient is negative it means that if we increase that coefficient, the loss will go down. If it’s positive that means if we decrease that coefficient, the loss will go down.
RandomSplitter(seed=42)(df)
returns indexes (training, validation) of the split.- We can’t use accuracy as a loss function because it doesn’t have a smooth gradient.
- Sigmoid makes it easier to optimize—optimizer doesn’t have to exactly hit 0 or 1, it can predict a really big number and it gets converted to 1 or a really small number that gets converted to 0.
- Sigmoid =
1/(1+exp(-x))
sympy
package does symbolic calculations and plots.- With sigmoid, we could increae the learning rate from
0.1
to2
, showing that it truly is easier to optimize. - binary dependent variable: chuck it through sigmoid.
- fastai always creates an extra category called “other” for categorical columns. At test time if you have a level that wasn’t in training, fastai puts it into the “other” category for you.
- For categorical variables fastai puts less common ones into “other”.
Neural Net Model
(indeps*coeffs).sum(axis=1)
is the same thing as matrix multiplication.init_coeffs
changed to create anncoeff
by1
matrix instead of anncoeff
vector, since for the neural net we will have multiple columns of coefficients.tensor[:,None]
indexes into second dimensionNone
it creates that dimension.- Dimension of
1
is a “unit axis”. torch.Size([12, 1])
represents a rank-2 tensor with a trailing unit axis.- If our coefficients are too big or too small, it’s not going to train at all so you have to fiddle with their magnitude in a from-scratch model.
Deep Learning Model
- Jeremy divides the first layer coefficients by
n_hidden
since the coeffs will get multiplied by a second layer as well and we want the coeffs to be a similar size as the linear model. - The final layer absolutely needs a constant term.
- A deep learning model has multiple hidden layers.
torch.sigmoid
andF.relu
are the activation functions for the layers.- For very small datasets with very few columns and columns that are really simple, deep learning is not necessarily going to give you the best result. Nothing is going to be as good as a carefully designed model that uses just the name column.
- For data types which have a very consitent structure like images or natural language text documents you can chuck a deep learning neural net at it and expect great results. Generally for tabular data that’s not the case. Normally have to think pretty long and hard about feature engineering to get good results.
- You want to make choices for the non-obvious things and have the obvious things done for you by a package like fastai.
Using fastai
Categorify
handles dummy variables.learner.lr_find
starts at a very small learning rate like10e-7
, trains one batch of data and calculates the loss, increases the learning rate slightly and calculates the loss again. Picking a learning rate betweenslide
andvalley
generally works well for training.learn.dls.test_dl
creates aDataLoader
that contains exactly the same processing steps that our learner used.- You want to make sure your inference time pre-processing and transformations are exactly the same as training time.
- Ensembling is about creating multiple models and combining their predictions.
Random Forests
- Random forets are elegant, and almost impossible to mess up. Jeremy has seen far more examples in industry of people messing up logistic regression than random forests.
- Handy shortcut:
from fastai.imports import *
. df.col_name.cat.codes
shows actual values (numbers corresponding to list of categories) for categorical column.- A random forest is an ensemble of trees, a tree is an ensemble of binary splits.
- A binary split is something that splits the rows into two groups.
- Kernel density plot is like a histogram with infinitesimally narrow bins.
- A good split is one where all of the values of the dependent variable on one side are all pretty much the same and all of dependent variable values on the other side are pretty much the same.
- You want each of your groups, within the group, to be as similar as possible on the dependent variable.
- “how similar are all the things in the group” = standard deviation.
- Sex is the best single binary split model we can find.
- “OneR” model: create a single binary split and stop.
- Don’t assume that you have to go complicated. It’s not a bad idea of always creating a baseline of OneR (a decision tree with a single binary split).
Book Notes
- The objective of tabular modeling is to predict the value in one column based on the values in the other columns.
Categorical Embeddings
- Continuous variables can be directly fed to the model (with some optional preprocessing).
- Categorical variables need to be converted to numbers. Addition and multiplication don’t have meaning for them even if they’re stored as numbers.
- Rossmann competition example notebook
- The embedding layer is just another layer in the model.
- The embedding transforms the categorical variables into inputs that are both continuous and meaningful.
- The raw categorical data is transformed by an embedding layer before it interacts with the raw continuous input data.
- Deep learning is not always the best starting point fo analyzing tabular data.
Beyond Deep Learning
- Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:
- Ensembles of decision trees (random forests and gradient boosting machines), mainly for structured data. They train faster, are often easier to interpret, do not require GPU for inference at scale, often require less hyperparameter tuning, and have a more mature ecosystem of tooling and documentation.
- Multilayered neural networks learned with SGD (shallow and/or deep learning) mainly for unstructured data (audio, images, and natural language)
- The critical step of interpreting a model of tabular data is significantly easier for decision tree ensembles.
- There are tools and methods for answering questions like:
- Which columns in the dataset were the most important for your predictions?
- How are they related to the dependent variable?
- How do they interact with each other?
- Which particular features were most important for some particular observation?
- Ensembles of decision trees are our first approach for analyzing a new tabular dataset except when there are some high-cardinality categorical variables that are very important or when there are some columns that contain data that would be understood with a neural network such as plain text data.
The Dataset
- Blue Book for Bulldozers Kaggle competition: the goal of the contest is to predict the sale price of a particular piece of heavy equipment at auction based on its usage, equipment type, and configuration.
!pip install dtreeviz
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
= 20
pd.options.display.max_rows = 8 pd.options.display.max_columns
from pathlib import Path
= Path("~/.kaggle/kaggle.json").expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
import zipfile,kaggle
= Path('bluebook-for-bulldozers')
path if not path.exists():
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading bluebook-for-bulldozers.zip to /content
100%|██████████| 48.4M/48.4M [00:01<00:00, 36.3MB/s]
='text') path.ls(file_type
(#7) [Path('bluebook-for-bulldozers/Valid.csv'),Path('bluebook-for-bulldozers/median_benchmark.csv'),Path('bluebook-for-bulldozers/Test.csv'),Path('bluebook-for-bulldozers/Machine_Appendix.csv'),Path('bluebook-for-bulldozers/TrainAndValid.csv'),Path('bluebook-for-bulldozers/ValidSolution.csv'),Path('bluebook-for-bulldozers/random_forest_benchmark_test.csv')]
= pd.read_csv(path/'TrainAndValid.csv', low_memory=False) df
len(df.columns)
53
df.columns
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
dtype='object')
; df.SalePrice.hist()
="saledate", y="SalePrice"); df.plot(x
len(df.SalesID.unique())
412698
len(df.MachineID.unique())
348808
df.MachineHoursCurrentMeter.unique()
array([ 68., 4640., 2838., ..., 11612., 12097., 14650.])
df.Forks.unique()
array(['None or Unspecified', nan, 'Yes'], dtype=object)
df.Pad_Type.unique()
array([nan, 'None or Unspecified', 'Reversible', 'Street', 'Grouser'],
dtype=object)
df.Backhoe_Mounting.unique()
array([nan, 'None or Unspecified', 'Yes'], dtype=object)
df.ProductSize.unique()
array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large',
'Compact'], dtype=object)
10] df.SalePrice.unique()[:
array([66000., 57000., 10000., 38500., 11000., 26500., 21000., 27000.,
21500., 65000.])
Tell pandas about a suitable ordering of these levels like so:
= 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'
sizes 'ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True) df[
FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
# I believe the ordering should be reverse of this
df.ProductSize.unique()
[NaN, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact']
Categories (6, object): ['Large' < 'Large / Medium' < 'Medium' < 'Small' < 'Mini' < 'Compact']
The metric we will use is RMLSE (root mean squared log error) between the actual and predicted auction prices. Take the log of the prices so that the m_rmse
of that value will give us the metric.
= 'SalePrice'
dep_var = np.log(df[dep_var]) df[dep_var]
; df.SalePrice.hist()
Decision Trees
A decision tree asks a series of binary (yes or no) questions about the data. After each question the data at that part of the tree is split between a Yes and a No branch. After one or more questions, either a prediction can be made on the basis of all previous answers or another question is required.
The basic steps to train a decision tree:
- Loop through each column of the dataset in turn.
- For each column, loop through each possible level of that column in turn.
- Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
- Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. Treat this as a very simple “model” in which our predictions are simply the average sale price of the item’s group.
- After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
- We now have two groups of our data, based on the selected split. Treat each group as a separate dataset, and find the best split for each by going back to step 1 for each group.
- Continue this process recursively until you have reached some stopping criterion for each group–for instance, stop splitting a group further when it has only 20 items in it.
Handling Dates
To help our algorithm handle dates intelligently, we’d like our model to know more than whether a date is more recent or less recent than another. We might want our model to make decisions based on that date’s day of the week, on whether a day is a holiday, on what month it is in, and so forth. To do this, replace every date column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that we suspect will be useful.
= add_datepart(df, 'saledate')
df = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate') df_test
len(df.columns)
65
' '.join(o for o in df.columns if o.startswith('sale'))
'saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed'
Using TabularPandas and TabularProc
A TabularProc
is like a regular Transform
except for the following:
- It returns the exact same object that’s passed to it, after modifying the object in place.
- It runs the transform once, when data is first passed in, rather than lazily as the data is accessed.
Categorify
is a TabularProc
that replaces a column with a numerical categorical column. FillMissing
is a TabularProc
that replaces missing values with the median of the column, and creates a new Boolean column that is set to True
for any row where the value was missing.
= [Categorify, FillMissing] procs
We need to be very careful about our validation set. We want to design it so that it is like the test set Kaggle will use to judge the contest.
The test set date range is from May 2012 to November 2012.
str) + "/" + df_test.saleMonth.astype(str)).unique() (df_test.saleYear.astype(
array(['2012/5', '2012/6', '2012/7', '2012/8', '2012/9', '2012/10',
'2012/11'], dtype=object)
The test set dates are later than any data in the training set (which has a latest date of April 2012).
str) + "/" + df.saleMonth.astype(str)).unique())[-10:] np.sort((df.saleYear.astype(
array(['2011/4', '2011/5', '2011/6', '2011/7', '2011/8', '2011/9',
'2012/1', '2012/2', '2012/3', '2012/4'], dtype=object)
We’ll define a validation set consisting of data from after November 2011.
= (df.saleYear<2011) | (df.saleMonth<10)
cond = np.where( cond)[0]
train_idx = np.where(~cond)[0]
valid_idx = (list(train_idx), list(valid_idx)) splits
TabularPandas
needs to be told which columns are continuous and which are categorical.
= cont_cat_split(df, 1, dep_var=dep_var) cont,cat
= TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits) to
len(to.train), len(to.valid)
(404710, 7988)
The data is still displayed as strings for categories.
3) to.show(
UsageBand | fiModelDesc | fiBaseModel | fiSecondaryDesc | fiModelSeries | fiModelDescriptor | ProductSize | fiProductClassDesc | state | ProductGroup | ProductGroupDesc | Drive_System | Enclosure | Forks | Pad_Type | Ride_Control | Stick | Transmission | Turbocharged | Blade_Extension | Blade_Width | Enclosure_Type | Engine_Horsepower | Hydraulics | Pushblock | Ripper | Scarifier | Tip_Control | Tire_Size | Coupler | Coupler_System | Grouser_Tracks | Hydraulics_Flow | Track_Type | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | saleIs_month_end | saleIs_month_start | saleIs_quarter_end | saleIs_quarter_start | saleIs_year_end | saleIs_year_start | auctioneerID_na | MachineHoursCurrentMeter_na | SalesID | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | saleYear | saleMonth | saleWeek | saleDay | saleDayofweek | saleDayofyear | saleElapsed | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Low | 521D | 521 | D | #na# | #na# | #na# | Wheel Loader - 110.0 to 120.0 Horsepower | Alabama | WL | Wheel Loader | #na# | EROPS w AC | None or Unspecified | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Standard | Conventional | False | False | False | False | False | False | False | False | 1139246 | 999089 | 3157 | 121 | 3.0 | 2004 | 68.0 | 2006 | 11 | 46 | 16 | 3 | 320 | 1.163635e+09 | 11.097410 |
1 | Low | 950FII | 950 | F | II | #na# | Medium | Wheel Loader - 150.0 to 175.0 Horsepower | North Carolina | WL | Wheel Loader | #na# | EROPS w AC | None or Unspecified | #na# | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | 2 Valve | #na# | #na# | #na# | #na# | 23.5 | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Standard | Conventional | False | False | False | False | False | False | False | False | 1139248 | 117657 | 77 | 121 | 3.0 | 1996 | 4640.0 | 2004 | 3 | 13 | 26 | 4 | 86 | 1.080259e+09 | 10.950807 |
2 | High | 226 | 226 | #na# | #na# | #na# | #na# | Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity | New York | SSL | Skid Steer Loaders | #na# | OROPS | None or Unspecified | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | Auxiliary | #na# | #na# | #na# | #na# | #na# | None or Unspecified | None or Unspecified | None or Unspecified | Standard | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | #na# | False | False | False | False | False | False | False | False | 1139249 | 434808 | 7009 | 121 | 3.0 | 2001 | 2838.0 | 2004 | 2 | 9 | 26 | 3 | 57 | 1.077754e+09 | 9.210340 |
Bu the underlying items are all numeric:
"state", "ProductGroup", "Drive_System", "Enclosure"]].head(3) to.items[[
state | ProductGroup | Drive_System | Enclosure | |
---|---|---|---|---|
0 | 1 | 6 | 0 | 3 |
1 | 33 | 6 | 0 | 3 |
2 | 32 | 3 | 0 | 6 |
There’s no particular meaning to the numbers in the categorical columns after conversion, they are chosen consecutively as they are seen in a column. The exception is if you first convert a column to a Pandas ordered category.
'ProductSize'] to.classes[
['#na#', 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']
Creating the Decision Tree
= to.train.xs, to.train.y
xs,y = to.valid.xs, to.valid.y valid_xs, valid_y
= DecisionTreeRegressor(max_leaf_nodes=4)
m ; m.fit(xs,y)
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
s=True, rotate=False, precision=precision, **kwargs)
special_charactersreturn graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))
=10, leaves_parallel=True, precision=2) draw_tree(m, xs, size
The topmost node is the initial model when all data is in one group. Predicts the average value of the whole dataset. In this case it predicts 10.1 for the logarithm of the sales price, and gives a mean squared error of 0.48. The square root of this is 0.69. There are 404710 records in this group which is the total size of our training set. The best split found was a split based on the coupler_system
columns. Asking only about coupler_system
predicts an average value of 9.21 versus 10.1.
import dtreeviz
= np.random.permutation(len(y))[:500]
samp_idx
=dtreeviz.model(m,
viz_model=xs.iloc[samp_idx],
X_train=y.iloc[samp_idx],
y_train=xs.columns,
feature_names=dep_var)
target_name
='DejaVu Sans', scale=1.6, label_fontsize=10,
viz_model.view(fontname='LR') orientation
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
The YearMade
data has values of 1000
which we need to change to make it more realistic:
'YearMade']<1900, 'YearMade'] = 1950
xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950 valid_xs.loc[valid_xs[
= DecisionTreeRegressor(max_leaf_nodes=4).fit(xs,y); m
=dtreeviz.model(m,
viz_model=xs.iloc[samp_idx],
X_train=y.iloc[samp_idx],
y_train=xs.columns,
feature_names=dep_var)
target_name
='DejaVu Sans', scale=1.6, label_fontsize=10,
viz_model.view(fontname='LR') orientation
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
The change in YearMade
doesn’t change the model in any significant way—shows how resilient decision trees are to data issues.
Build a bigger tree (don’t pass any stopping criteria).
= DecisionTreeRegressor()
m ; m.fit(xs,y)
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
m_rmse(m, xs, y)
0.0
The model has 0.0
root mean square error but that is on the training set. Let’s check the validation error:
m_rmse(m, valid_xs, valid_y)
0.331731
The model is overfitting pretty badly.
len(xs) m.get_n_leaves(),
(324567, 404710)
We have nearly as many leaf nodes as data points.
Let’s change the stopping rule to tell sklearn to ensure every leaf node contains at least 25 auction records:
= DecisionTreeRegressor(min_samples_leaf=25)
m
m.fit(to.train.xs, to.train.y) m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.248564, 0.323369)
That looks better.
m.get_n_leaves()
12397
Random Forests
- Leo Breiman in 1994 while retired published a technical report called “Bagging Predictors” which turned out to be one of the most influential ideas in modern machine learning.
- Here is his procedure, known as bagging:
- Randomly choose a subset of rows of your data.
- Train a model using this subset.
- Save that model, and then return to step 1 a few times.
- This will give you multiple trained models. To make a prediction, predict using all of the models, and then take the average of each of those model’s predictions.
- Although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. Different models will make different errors. The average of those errors is zero.
- If we take the average of all of the models’ predictions, we should end up with a prediction that gets closer and closer to the correct answer, the more models we have.
- We can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of data, and averaging its predictions.
- Random Forest: a model that averages the predictions of a large number of decision trees which are generated by randomly varying various parameters that specify what data is used to train the tree and other tree parameters.
- Ensembling: combining the results of multiple models together.
Creating a Random Forest
- Similar to creating a decision tree except now we are also specifying parameters that indicate how many trees should be in the forest, how we should subset the data items (the rows) and how we should subset the fields (the columns).
- In the function
rf
:n_estimators
: number of trees.max_samples
: number of rows to sample for training each tree.max_features
: how many columns to sample at each split point (where0.5
means “take half the total number of columns).min_samples_leaf
: when to stop splitting the tree nodes.n_jobs=-1
: tell sklearn to use all our CPUs to buil the trees in parallel.
def rf(xs, y, n_estimators=40, max_samples=200_000, max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators, max_samples=max_samples, max_features=max_features, min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs,y)
= rf(xs, y); m
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
(0.170922, 0.233145)
Random forests aren’t very sensitive to the hyperparameter choices such as max_features
. You can st n_estimators
to as high a number as you have time to train. The more trees you have the more accuracte the model will be. If you have over 200k data points, set max_samples
to 200k and it will train faster with little impact on accuracy. The models with the lowest error result from using a subset of features with a larger number of trees.
Get the predictions from each individual tree in our forest:
= np.stack([t.predict(valid_xs) for t in m.estimators_]); preds
0), valid_y) r_mse(preds.mean(
0.233145
+1].mean(0), valid_y) for i in range(40)]); plt.plot([r_mse(preds[:i
The improvement levels off quite a bit after around 30 trees.
We don’t know if the performance on the validation set is worse than on our training set because we’re overfitting or because the validation set covers a different time period.
Out-of-Bag Error
In a random forest, each tree is trained on a different subset of the training data. The OOB error is a way of measuring prediction error in the training dataset by including in the calculation of a row’s error trees only where that row was not included in training. This allows us to see whether the model is overfitting without needing a separate validation set. Since every tree was trained with a different randomly selected subset of rows, out-of-bag error is a little like imagining that every tree therefore also has its own validation set, which is simply the rows that were not selected for that tree’s training. This is particularly beneficial when you have a small amount of training data.
len(m.oob_prediction_)
404710
# use training y
r_mse(m.oob_prediction_, y)
0.210661
OOB error is lower than validation set error, which means that something else is causing that error, in addition to normal generalization error. I’m not sure what that means but the text says it’s looked into later in this chapter.
Model Interpretation
- How confident are we in our predictions using a particular row of data?
- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
- Which columns are the strongest predictors, which can we ignore?
- Which columns are effectively redundant with each other, for purposes of prediction?
- How do predictions vary as we vary these columns?
Tree Variance for Prediction Confidence
The standard deviation of predictions across the trees tells us the relative confidence of predictions. We would want to be more cautious of using the results for rows where trees give very different results (higher standard deviations), compared to cases where they are more consistent (lower standard deviations).
We have a prediction for every tree and every auction in the validation set (40 trees and 7,988 auctions):
= np.stack([t.predict(valid_xs) for t in m.estimators_]); preds
preds.shape
(40, 7988)
Get the standard deviation of the predictions over all the trees for each auction:
= preds.std(0)
preds_std 5] preds_std[:
array([0.21168835, 0.09996709, 0.0911939 , 0.25939701, 0.08520345])
len(preds_std)
7988
The confidence in the predictions varies widely. For some auctions there is low std meaning the trees agree. For others it’s higher, meaning the trees don’t agree.
Feature Importance
def rf_feat_importance(m, df):
return pd.DataFrame({
'cols': df.columns,
'imp': m.feature_importances_}
'imp', ascending=False) ).sort_values(
= rf_feat_importance(m, xs)
fi 10] fi[:
cols | imp | |
---|---|---|
57 | YearMade | 0.180981 |
6 | ProductSize | 0.116321 |
30 | Coupler_System | 0.089648 |
7 | fiProductClassDesc | 0.074037 |
32 | Hydraulics_Flow | 0.064145 |
54 | ModelID | 0.059373 |
31 | Grouser_Tracks | 0.053432 |
65 | saleElapsed | 0.050231 |
3 | fiSecondaryDesc | 0.043258 |
1 | fiModelDesc | 0.031560 |
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
30]); plot_fi(fi[:
The feature importance algorithm loops through each tree, and then recursively explores each branch. At each branch, it looks to see what feature was used for that split and how much the model improves as a result of that split. The improvement (weighted by the number of rows in that group) is added to the importance score for that feature. This is summed across all branches of all trees and finall the scores are normalized such that they add to 1.
Removing Low-Importance Variables
Retrain the model using the subset of columns with importance greater than 0.005
:
= fi[fi.imp>0.005].cols
to_keep len(to_keep)
20
= xs[to_keep]
xs_imp = valid_xs[to_keep]
valid_xs_imp = rf(xs_imp, y) m
m_rmse(m, xs_imp, y), m_rmse(m, valid_xs_imp, valid_y)
(0.181078, 0.231864)
Our accuracy is about the same with fewer columns that we have to study.
len(xs.columns), len(xs_imp.columns)
(66, 20)
; plot_fi(rf_feat_importance(m, xs_imp))
Removing Redundant Features
from scipy.cluster import hierarchy as hc
def cluster_columns(df, figsize=(10,6), font_size=12):
= np.round(scipy.stats.spearmanr(df).correlation, 4)
corr = hc.distance.squareform(1-corr)
corr_condensed = hc.linkage(corr_condensed, method='average')
z = plt.figure(figsize=figsize)
fig =df.columns, orientation='left', leaf_font_size=font_size)
hc.dendrogram(z, labels plt.show()
cluster_columns(xs_imp)
The pairs of columns that are most similar are the ones that were merged together early, far from the “root” of the tree at the left.
The most similar pairs are found by calculating the rank correlation, which means that all the values are replaced with their rank (first, second, third, etc within the column) and then the correlation is calculated.
Let’s try removing some of these closely related features to see if the model can be simplified without impacting accuracy.
Create a function that quickly trains a random forest and returns the OOB score by using a lower max_samples
and higher min_samples_leaf
.
def get_oob(df):
= RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
m =50_000, max_features=0.5, n_jobs=-1, oob_score=True)
max_samples
m.fit(df, y)return m.oob_score_
# baseline
get_oob(xs_imp)
0.8764807857774278
Remove each of our potentially redundant variables and see what score we get:
=1)) for c in (
{c: get_oob(xs_imp.drop(c, axis'saleYear', 'saleElapsed', 'ProductGroupDesc', 'ProductGroup',
'fiModelDesc', 'fiBaseModel',
'Hydraulics_Flow', 'Grouser_Tracks', 'Coupler_System')}
{'saleYear': 0.875295601149204,
'saleElapsed': 0.8717546976024381,
'ProductGroupDesc': 0.8767983331719241,
'ProductGroup': 0.8764742908741526,
'fiModelDesc': 0.8748490763639009,
'fiBaseModel': 0.8760895658282863,
'Hydraulics_Flow': 0.8770549322909539,
'Grouser_Tracks': 0.8775175679664963,
'Coupler_System': 0.8764559009225574}
Now let’s try dropping multiple variables:
= ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']
to_drop =1)) get_oob(xs_imp.drop(to_drop, axis
0.8751984884391564
This is really not much worse than the model with all the fields so we’ll create DataFrame
s without these columns:
= xs_imp.drop(to_drop, axis=1)
xs_final = valid_xs_imp.drop(to_drop, axis=1) valid_xs_final
# check accuracy
= rf(xs_final, y)
m m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
(0.182723, 0.232926)
Partial Dependence
Important to understand the relationship between the two most important predictors (ProductSize
and YearMade
) and sale price.
= valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
p = to.classes['ProductSize']
c range(len(c)), c); plt.yticks(
= valid_xs_final['YearMade'].hist() ax
Partial dependence plots try to answer the question: if a row varied on nothing other than the feature in question, how would it impact the dependent variable?
How does YearMade
impact sale price, all other things being equal? We can’t just take the average sale price for each YearMade
, as it would capture the effect of how every other field also changed along with YearMade
and how that overall change affected price.
Instead we replace every single value in the YearMade
column with 1950, and then calculate the predicted sale price for every auction, and take the average over all auctions, then do the same for every single year. This isolates the effect of only YearMade
.
from sklearn.inspection import partial_dependence
= plt.subplots(figsize=(6,4))
fig, ax = partial_dependence(m, valid_xs_final, ['YearMade', 'ProductSize'],
pdp =20)
grid_resolution
0,1,2,3,4,5,6]), pdp['average'].mean(axis=1).squeeze()); ax.plot(array([
= plt.subplots(figsize=(6,4))
fig, ax 'values'][0], pdp['average'].mean(axis=2).squeeze()); ax.plot(pdp[
After 1990, where most of the data is, there is a linear relationship in the plot (y-axis is log(SalePrice) so this is an exponential relationship between YearMade
and SalePrice
). SalePrice
has the lowest price for the last two categories (Large
and #na#
). This doesn’t make sense because I would expect the price to increase with ProductSize
. Missing values can sometimes be useful predictors. Sometimes, they can indicate data leakage.
Data Leakage
Data leakage is the use of information in the model training process which would not be expected to be available at prediction time.
For example, if you are trying to predict successful grant applications using data that was not available at the time of receiving the application (such as information filled out only when a grant application was accepted such as date of processing).
Identifying data leakage involves building a model and then:
- Check whether the accuracy of the model is too good to be true.
- Look for important predictors that don’t make sense in practice.
- Look for partial dependence plot results that don’t make sense in practice.
It’s often a good idea to build a model first and then do your data cleaning, as the model can help you identify potentially problematic data issues.
Tree Interpreter
We still have to answer the following question:
- For predicting with a particular row of data, what were the most important factors, and how did they influence the prediction?
!pip install treeinterpreter
!pip install waterfallcharts
We computed feature importance across the entire random forest by looking at the contribution of each variable to improving the model, at each branch of every tree and then add up all of these contributions per variable.
We can do the same thing but for a single row of data. Let’s say we are looking at a single item at auction. The model might predict that this item will be very expensive, and we want to know why. Take the one row of data, put it through the first decision tree, looking to see what split is used at each point throughout the tree. For each split, we find the decrease or increase in the addition, compared to the parent node of the tree. We do this for every tree and add up the total change in importance by split variable.
= valid_xs_final.iloc[:5] row
from treeinterpreter import treeinterpreter
= treeinterpreter.predict(m, row.values) prediction, bias, contributions
0], bias[0], contributions[0].sum() prediction[
(array([10.03216082]), 10.104110088290454, -0.0719492660421904)
prediction
is the prediction that the random forest makes. bias
is the prediction based on taking the mean of the dependent variable (i.e. the model that is at the root of every tree). contributions
is the the total change in prediction due to each of the independent variables. The sum of contributions
plus bias
must equal the prediction
for each row.
from waterfall_chart import plot as waterfall
0], threshold=0.08,
waterfall(valid_xs_final.columns, contributions[=45, formatting='{:,.3f}'); rotation_value
Extrapolation and Neural Networks
Random forests, like all machine learning or deep learning algorithms, don’t always generalize well to new data.
The Extrapolation Problem
Consider the simple task of making predictions from 40 data points showing a slightly noisy linear relationship:
= torch.linspace(0,20,steps=40)
x_lin = x_lin + torch.randn_like(x_lin)
y_lin ; plt.scatter(x_lin, y_lin)
sklearn
expects a matrix of independent variables:
= x_lin.unsqueeze(1)
xs_lin x_lin.shape, xs_lin.shape
(torch.Size([40]), torch.Size([40, 1]))
None].shape x_lin[:,
torch.Size([40, 1])
# use only the first 30 rows
= RandomForestRegressor().fit(xs_lin[:30], y_lin[:30]) m_lin
Test the model on the full dataset:
20)
plt.scatter(x_lin, y_lin, ='red', alpha=0.5); plt.scatter(x_lin, m_lin.predict(xs_lin), color
What we are seeing is that a tree and a random forest can never predict values outside the range of the training data, because a tree simply predicts the average value of the rows in a leaf and a random forest just averages the predictions of a number of trees. Your predictions outside the domain will be systematically too low. Random forests are not able to extrapolate outside the types of data they have seen in a more general sense, that’s why we need to make sure our validation set does not contain out-of-domain data.
Finding Out-of-Domain Data
Use a random forest to predict whether a row is in the validation set or the training set.
= pd.concat([xs_final, valid_xs_final])
df_dom = np.array([0]*len(xs_final) + [1]*len(valid_xs_final))
is_valid
= rf(df_dom, is_valid)
m 6] rf_feat_importance(m, df_dom)[:
cols | imp | |
---|---|---|
6 | saleElapsed | 0.874998 |
9 | SalesID | 0.088186 |
12 | MachineID | 0.032512 |
0 | YearMade | 0.000888 |
5 | ModelID | 0.000784 |
11 | Enclosure | 0.000594 |
Three columns differ significantly between the training and validation sets: saleElapsed
, SalesID
, and MachineID
. It makes sense that saleElapsed
is different since it directly encodes date (number of days between the start of the dataset and each row) and the other two likely increment over time.
# baseline
= rf(xs_final, y)
m print('orig', m_rmse(m, valid_xs_final, valid_y))
for c in ('SalesID', 'saleElapsed', 'MachineID'):
= rf(xs_final.drop(c, axis=1), y)
m print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))
orig 0.232669
SalesID 0.230199
saleElapsed 0.235264
MachineID 0.231392
We should be able to remove SalesID
and MachineID
without losing accuracy:
= ['SalesID', 'MachineID']
time_vars = xs_final.drop(time_vars, axis=1)
xs_final_time = valid_xs_final.drop(time_vars, axis=1)
valid_xs_time
= rf(xs_final_time, y)
m m_rmse(m, valid_xs_time, valid_y)
0.229906
Removing these variables has improved the accuracy and will make the model more resilient over time.
'saleYear'].hist(); xs[
Try just using the most recent few years of the data:
= xs['saleYear']>2004
filt = xs_final_time[filt]
xs_filt = y[filt]
y_filt
= rf(xs_filt, y_filt)
m m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)
(0.177093, 0.229919)
Using a Neural Network
Replicate the steps to set up the TabularPandas
object:
= pd.read_csv(path/'TrainAndValid.csv', low_memory=False)
df_nn 'ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
df_nn[= np.log(df_nn[dep_var])
df_nn[dep_var] = add_datepart(df_nn, 'saledate') df_nn
FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
= df_nn[list(xs_final_time.columns) + [dep_var]] df_nn_final
A great way to handle categorical variables in a neural net is with embeddings. Embedding sizes larger than 10,000 should generally be used only after you’ve tested whether there are better ways to group the variable, so use 9,000 as max_card
(lower cardinality means fastai creates embedding for the categorical variable):
= cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var) cont_nn, cat_nn
We don’t want to treat saleElapsed
as categorical since we want to predict auction sale prices in the future and a categorical variable cannot extrapolate outside the range of values that it has seen:
'saleElapsed' in cont_nn, 'saleElapsed' in cat_nn
(True, False)
# look at cardinality
df_nn_final[cat_nn].nunique()
YearMade 73
ProductSize 6
Coupler_System 2
fiProductClassDesc 74
Hydraulics_Flow 3
ModelID 5281
fiSecondaryDesc 177
fiModelDesc 5059
Hydraulics 12
Enclosure 6
fiModelDescriptor 140
ProductGroup 6
Drive_System 4
dtype: int64
When analyzing redundant features relies on similar variables being sorted in the same order (they need to have similarly named levels). Here we see that ModelID
and fiModelDesc
both have 5000+ levels, meaning they would need 5000+ columns in our embedding matrix. Let’s see the impact of removing one of these model columns on the random forest:
= xs_filt.drop('fiModelDesc', axis=1)
xs_filt2 = valid_xs_time.drop('fiModelDesc', axis=1)
valid_xs_time2 = rf(xs_filt2, y_filt)
m2 m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
(0.183026, 0.233514)
= xs_filt.drop('ModelID', axis=1)
xs_filt2 = valid_xs_time.drop('ModelID', axis=1)
valid_xs_time2 = rf(xs_filt2, y_filt)
m2 m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)
(0.18152, 0.232451)
Dropping ModelID
has the smaller effect on accuracy so we’ll drop that variable.
'ModelID') cat_nn.remove(
df_nn_final[cat_nn].nunique()
YearMade 73
ProductSize 6
Coupler_System 2
fiProductClassDesc 74
Hydraulics_Flow 3
fiSecondaryDesc 177
fiModelDesc 5059
Hydraulics 12
Enclosure 6
fiModelDescriptor 140
ProductGroup 6
Drive_System 4
dtype: int64
A neural net cares about normalization whereas a random forest doesn’t:
= [Categorify, FillMissing, Normalize]
procs_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn, splits=splits, y_names=dep_var) to_nn
Tabular models and data don’t generally require much GPU RAM so we can use larger batch sizes:
= to_nn.dataloaders(1024) dls
Set y_range
for regression models:
= to_nn.train.y
y min(), y.max() y.
(8.465899, 11.863583)
from fastai.tabular.all import *
= tabular_learner(dls, y_range=(8,12), layers=[500,250],
learn =1, loss_func=F.mse_loss)
n_out
learn.lr_find()
SuggestedLRs(valley=0.0002754228771664202)
5, 1e-2) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.062091 | 0.074148 | 00:08 |
1 | 0.054561 | 0.066272 | 00:04 |
2 | 0.048428 | 0.053494 | 00:06 |
3 | 0.043653 | 0.051082 | 00:04 |
4 | 0.040581 | 0.051459 | 00:05 |
= learn.get_preds()
preds, targs r_mse(preds, targs)
0.226845
The neural net is more accurate than the random forest.
tabular_learner??
TabularModel??
Ensembling
It would be reasonable to expect that the kinds of errors that each model makes (random forest and neural network) would be quite different. We might expect that the average of their predictions would be better than either one’s individual predictions.
= m.predict(valid_xs_time)
rf_preds = (to_np(preds.squeeze()) + rf_preds) / 2 ens_preds
r_mse(ens_preds, valid_y)
0.223161
This result is better than each individual model.
Boosting
bagging = combining many models (each trained on a different data subset) by averaging them.
boosting = add models instead of averaging them:
- Train a small model that underfits your dataset.
- Calculate the predictions in the training set for this model.
- Subtract the predictions from the targets; these are called the residuals and represent the error for each point in the training set.
- Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
- Continue doing this until you reach a stopping criterion, such as maximum number of trees, or you observe your validation set error getting worse.
Each new tree will be attempting to fit the error of all the previous trees combined. The residuals get smaller and smaller each time. To make predictions, calculate the predictions from each tree and then add them all together. Most common model names: Gradient Boosting Machines(GBMs) and Gradient Boosted Decision Trees (GBDTs). XGBoost is the most popular library for implementing these.
Using more trees in a random forest does not lead to overfitting, because each tree is independent of the others. In a boosted ensemble, the more trees you have, the better the training error becomes and eventually you will see overfitting on the validation set. Unlike random forests, gradient boosted trees are extremely sensitive to the choices of hyperparameters.
Combining Embeddings with Other Methods
The embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. Models dramatically improve by using the neural network’s categorical embeddings instead of the raw categories as inputs.
At inference time, you can just use an embedding along with a small decision tree ensemble.
Once a set of embeddings are learned for a column for a particular task, they could be stored in a central place and reused across multiple models.
Conclusion
- Random forests are the easiest to train, because they are extremely resilient to hyperparameter choices and require little preprocessing. They are fast to train, and should not overfit if you have enough trees. But they can be a little less accurate especially if extrapolation is required, such as predicting future time periods.
- Gradient boosting machines in theory are just as fast to train as random forests, but in practice you will have to try lots of hyperparameters. They can overfit, but they are often a little more accurate than random forests.
- Neural networks take the longest time to train and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.
Start your analysis with a random forest. Then use that model for feature selection and partial dependence analysis, to get a better understanding of your data. Then try neural nets and GBMs and use them if they give significantly better results on your validation set in a reasonable amount of time. If decision tree ensembles are working well for you, try adding the embeddings for the categorical variables to the data and see if that helps your decision trees learn better.
Questionnaire
1. What is a continuous variable?
A variable that can take on any value within a range.
2. What is a categorical variable?
A variable that can only take on discrete values or levels within a fixed set.
3. Provide two of the words that are used for the possible values of a categorical variable.
Levels or categories.
4. What is a dense layer?
A linear layer.
5. How do entity embeddings reduce memory usage and speed up neural networks?
They are dense compared to one-hot-encoded vectors which are sparse.
6. What kinds of datasets are entity embeddings especially useful for?
Datasets with categorical variables with high cardinality.
7. What are the two main families of machine learning algorithms?
- Ensembles of decision trees for structured data.
- Multilayered neural networks learned with SGD for unstructured data.
8. Why do some categorical columns need a special ordering in their classes? How do you do this in Pandas?
Ordinal columns have a natural order (like size) and can be specified using the Series.cat.set_categories
Pandas method.
9. Summarize what a decision tree algorithm does.
A decision tree algorithm loops through each column and for each column loops through all possible splits in the data, and calculates the objective (such as average SalePrice
) for each group in the split. It then splits the data with the best split, meaning the split that has the highest average objective. Within each split, it continues to split the data and calculate the next best split until some stopping criteria is met.
10. Why is a date different from a regular categorical or continuous variable, and how can you preprocess is to allow it to be useful in a model?
Dates have many meanings such as day of the week, the month it’s in and whether it’s a holiday. You can preprocess a date variable with fastai’s add_datepart
function.
11. Should you pick a random validation set in the bulldozer competition? If no, what kind of validation set should you pick?
No. Since we want to predict auction price for future dates, the validation set should have date values that come after the training set dates.
12. What is pickle and what is it useful for?
Pickle is a method to save (serialize) Python objects.
13. How are mse
, samples
, and values
calculated in the decision tree drawn in this chapter?
The mse
is calculated between the average sale price and the individual sale price in the group. samples
are the number of rows in the dataset that correspond to the given split that resulted in the group. values
is the average sale price in the group.
14. How do we deal with outliers before building a decision tree?
Decision trees are resilient to data issues but if you want to treat outliers you can do so by setting their value to a more reasonable value (as we did by setting any YearMade
value less than 1900 to 1950.
15. How do we handle categorical variables in a decision tree?
We don’t have to handle them in anyway other than encoding them as integers. Research shows that one-hot-encoding categorical variables doesn’t improve model performance.
16. What is bagging?
Averaging predictions across multiple models that are trained on different subsets of the dataset. Since different models make different errors, the average of the errors is zero.
17. What is the difference between max_samples
and max_features
when creating a random forest?
max_samples
is the maximum number of rows to sample for each decision tree.
max_features
defines how many columns to sample at each split point.
18. If you increase n_estimators
to a very high value, can that lead to overfitting? Why or why not?
No, because decision trees are trained on a subset of data independent from each other.
19. In the section “Creating a Random Forest”, after Figure 9-7, why did preds.mean(0)
give the same result as our random forest?
Since the random forest does the same thing: take the average prediction across all trees.
20. What is out-of-bag error?
The error of a tree’s prediction on rows from the dataset that it has not been trained on.
21. List the reasons that a model’s validation set error might be worse than the OOB error. How could you test your hypotheses?
It could be that the model does not generalize well to data other than the training set. It could also mean that the distribution of the validation set is different from the training set (which can be tested by training a random forest on is_valid
–1/0 whether the data is validation/train and seeing which features have high importance).
22. Explain why random forests are well suited to answering each of the following questions:
- How confident are we in our predictions using a particular row of data?
- This is answered by calculating the standard deviation of the trees’ predictions for each row in the validation set.
- For predicting with a particular row of data, what were the mot important factors, and how did they influence that prediction?
- Using
treeinterpreter
we can see how much each column contributed to the total change in prediction.
- Using
- Which columns are the strongest predictors?
- This is answered by calculating the feature importance, which is the (weighted by number of rows in each branch group) improvement to the model made by each feature.
- How do predictions vary as we vary these columns?
- We can look at partial dependence plots to answer this question.
23. What’s the purpose of removing unimportant variables?
To simplify the model so that we can understand and study how each feature influences the predictions.
24. What’s a good type of plot for showing tree interpreter results?
Waterfall charts.
25. What is the extrapolation problem?
Random forests and trees can never predict values outside the range of the training data. Predictions in this case will systematically be too low.
26. How can you tell if your test or validation set is distributed in a different way than your training set?
By training a random forest where the dependent variable is is_valid
a field that is 1
for rows in the validation set and 0
for rows in the training set, and then calculating feature importance. Features with the highest importance will differ in value between the training and validation set.
27. Why do we make saleElapsed
a continuous variable, even though it has fewer than 9,000 distinct values?
saleElapsed
is the number of days since the start of the dataset that the auction took place, so it represents date/time of the auction. Since we want to extrapolate auction prices to future dates, we want to treat it as something that can be extrapolated (continuous variable) as opposed to something that can’t be extrapolated (categorical variable).
28. What is boosting?
Boosting is when you train a model to underfit the dataset and train subsequent models on residuals (difference between targets and predictions) and then add together the predictions from the models.
29. How could we use embeddings with a random forest? Would we expect this to help?
Research shows that using neural net trained categorical embeddings as inputs (instead of categorical columns) improves the accuracy of random forests.
30. Why might we not always use a neural net for tabular modeling?
Neural nets take the longest time to train (compared to random forests and gradient boosting), require preprocessing and are sensitive to hyperparameters.
Lesson 6: Random Forests
- Further Research
- Pick a competition on Kaggle with tabular data (current or past) and try to adapt the techniques seen in this chapter to get the best possible results. Compare your results to the private leaderboard.
- Implement the decision tree algorithm in this chapter from scratch yourself, and try it on the dataset you used in the first exercise.
- Use the embeddings from the neural net in this chapter in a random forest, and see if you can improve on the random forest results we saw.
- Explain what each line of the source of
TabularModel
does (with the exception ofBatchNorm1d
andDropout
layers).
Video Notes
How random forests really work
- We created binary splits in the Titanic dataset for continuous and categorical variables.
- We came up with a score of how good a job did that split do of grouping the survival characteristics into two groups where nearly all of one survived and all of one didn’t survive. Small (weighted) standard deviation in each group.
- What if we split Males and Females into two other groups each?
- Age <=6 is the biggest predictor of whether males survive.
- Pclass <= 2 is the biggest predictor of whether females survive.
- We hope to get the strongest prediction about survival in the leaf nodes of our decision tree.
- We use sklearn’s
DecisionTreeClassifier
. - scikit-learn focuses on classical machine learning algorithms.
- Decision trees as exploratory data analysis: allows us to get a quick picture of what are the key driving variables in this dataset and how much do they predict what was happening in the data.
- gini is another way of measuring how good a split is: how likely is it that if you go into that sample and grab one item and then go in again and grab another item—how likely is it that you’re going to grab the same item each time? If the entire leaf node is just people who survived or just people who didn’t survive, the probability would be 1.0. If it was an exactly equal mix the probability would be 0.5.
- OneR MAE was 0.215, decision tree with four leaf nodes’ MAE was 0.224. Reflects the fact that we have a small validation set.
- Decision tree with minimum samples of 50 per node has MAE of 0.183.
- One of the biggest mistakes is not to submit to the leaderboard on Kaggle for a competition. You should try and submit something to the leaderboard everyday.
- We don’t need to do as much preprocessing for decision trees. All the decision tree cares about is the ordering of the data.
- For tabular data, always start with a decision tree approach.
- Use dummy variables for <= 4 levels, numeric codes otherwise.
- There are limitations to how accurate a decision tree can be.
- Leo Breiman came up with the idea of bagging. Decision trees on average will predict the average, they are not biased. Build lots of unbiased, better-than-nothing, uncorrelated models, and average their predictions, ending up with errors on either side of the correct prediction whose average is 0. So it will be better than any individual model.
- We can get many trees who use some random proportion of rows and columns (called a random forest), make predictions with each of them, and then average the predictions.
- In each splot of each decision tree in the random forest you can calculate how much the prediction improved (e.g., how much gini value reduced weighted by sample size) on a split by the given column. This gives you the feature importances—how often did the trees pick the feature and how much did it improve the gini when picked as a split?
- Create a feature importance plot first with a tabular dataset to find the most important columns.
- Rule of thumb: use a maximum of 100 trees.
- If you don’t have much data you can get away with not having a validation set since for each tree in the random forest you can pick the rows not used in that tree as the validation set. The error across all rows not used in training a tree is called out-of-bag (OOB) error.
- Five important insights random forests can provide:
- How confident are we in our predictions using a particular row of data?
- For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?
- Which columns are the strongest predictors and which can we ignore?
- Which columns are effectively redundant with each other, for purposes of prediction?
- How do predictions vary, as we vary these columns?
- You can do a partial dependence plot for any machine learning model.
- Take the dataset and leave it exactly as it is except for the column you want to understand partial dependence on (such as
YearMade
). Set the column in question to its first value, then predict the dependent variable for every row and average it. Repeat for each value of the column in question.
- Take the dataset and leave it exactly as it is except for the column you want to understand partial dependence on (such as
- You can do feature importance for one row to understand why the random forest made the prediction.
- If you start deleting trees then you are no longer having an unbiased prediction of the dependent variable. You are biasing it by making a choice. Even the bad trees will be improving the quality of overall average.
- Can you overfit a random forest? Basically no. Adding more trees will make it more accurate,but accuracy asymptotes. If you don’t have enough trees and you let the trees grow very deep, that could overfit, so you have to make sure you have enough trees.
- Giving a random forest lots of randomly generated columns with fake data does not affect its performance.
- Gradient boosting machine: fit a very small tree, get the residual (the difference between the prediction and the actual), then create another very small tree which attempts to predict the residual and so forth. Each one is predicting the residual from all the previous ones. Then to calculate the prediction you take the sum of all of the trees’ predictions, because each one has predicted the difference between the actual and all of the previous trees. More accurate than random forests, but you can absolutely overfit, so it’s not the first go-to model.
First Steps: Road to the top, Part 1
- What does it look like to pick a Kaggle competition and just do like the normal, sensible, mechanical steps you would do for any computer vision model.
- Paddy Disease Classification: recognizing diseases in rice paddies.
- The library
fastkaggle
makes it easier to setup Kaggle competition stuff. Usesetup_comp
to grab data. - You can’t hide from the truth in a Kaggle competition.
- Focus on two things:
- Creating an effective validation set.
- Iterating rapidly to find changes which improve results on the validation set.
- What can I do that’s going to train in a minute or so and will quickly give me a sense of what I can try and what’s going to work. Try 80 things.
- Do be successful in Kaggle competitions and machine learning in general you have to do not just one thing well but everything well.
- Only use random seed when you are sharing a notebook, otherwise you want to see how much things change each time so you can tell if the modifications you are making are improving it, making it worse, or is it just random variation?
- PIL images size is columns x rows. PyTorch size is rows x columns.
- The amount of time it takes to decode a JPEG is quite significant. Use
fastcore.parallel
. - Most common way to do things is to either squish or crop every image to be a square.
- Models are a great way to understand your data. Refer to the notebook The best vision models for fine tuning—trained on PETS (fine-tuning to similar things they are pretrained on) and Planet (fine-tuning to things different than what is pretrained) datasets which are very different datasets. Measured how much memory it used, how accurate was it and how long did it take to fit.
- What matters about a model, which is just a function, is its inputs, outputs how accurate it is and how fast it is.
lr_find
will train one batch a time and track the loss at increasing learning rates (starting very small). LR recommendations are conservative.- We submit as soon as we can. We want a dataloader that is exactly like what we made for training but pointed at the test set. Use
dls.test_dl
method. Pass it test dataset files. A test dataloader does not have any labels. with_decoded
inlearn.get_preds
tells you the index of the most probably class. Map them to strings indls.vocab
.- Make everything fast and easy in the iteration including submitting to Kaggle.
- If you can create models that predict things well and you can communicate your results in a way that is clear and compelling, you’re a pretty good data scientist.
- Be highly intentional like a scientist; have hypotheses that you test carefully and come out with conclusions that you implement.
- Test out your hypotheses over a couple models from each of the main families (e.g., does squish or crop work better with different models).
- Random forests will give you good results, GBMs for better results (would run a hyperparameters sweep).
Small Models: Road to the top, Part 2
- Initial training took a minute on home computer, took 4 minutes per epoch on Kaggle. Because they only have two virtual CPUs. You want at least 8 physical CPUs per GPU. It was spending all its time reading data.
- Step 1 was making Kaggle implementation faster—
resize_images
. It was four times faster with no loss of accuracy. - Kaggle GPU was hardly being used so moved from
resnet26d
toconvnext_small_in22k
which was over twice as good. - resnets are the fastest, use convnext if you’re not sure what to use.
- Use crop instead of squish.
- Get everything for training into a single function that returns a
learner
. - Padding is the only way of preprocessing images that doesn’t distort (squish) or lose data (crop) with the downsize of having empty black pixels.
- Test time augmentation (TTA). Get preditions for all augmented images and take the average. Like a mini-bagging approach.
learn.tta
does this for you. TTA usually gives a better result. TTA uses the same data augmentation that you used during training. - Your images don’t have to be square. They just have to be the same size.
idxs
has indexes of vocab for each test set image,vocab
isnp.array(learn.dls.vocab
,results
ispd.Series[vocab[idxs], name='idxs)
will map index to vocab item.- Generally speaking in Kaggle competitions, top 25% is a solid, competent, very reasonable level. It’s not easy, you gotta know what you’re doing.
- Batch things that are similar aspect ratios together and use the median rectangles for those and have had good results but honestly 99.99% of people chuck everything into a square.
- fastai uses reflection padding as a default, also provide copy padding, neither really help. Computer wants to know where the image ends.
Lesson 7: Collaborative Filtering
Video Notes
- Digging into what’s inside of a neural net in this lesson.
- A neural net has a sandwich of fully connected layers and ReLUs. There’s a lot of tweaks that we can do. Most of the tweaks we care about are tweaking the very first or the very last layer. Over the next couple of weeks we’ll look at the tweaks we can do inside as well.
Paddy Doctor Competition
- Created a ConvNeXt model. Did a few types of preprocessing. Added Test Time Augmentation. Scaled that up to larger images and rectangular images.
- Larger models have more parameters which means they can learn more tricky features. And they ought to be more accurate. The also take up more memory on the GPU when calculating gradients. The GPU is not as clever as the CPU at sticking stuff it doesn’t need right now onto virtual memory on the hard drive. When it runs out of memory, it runs out of memory. It also doesn’t shuffle things around to try and find memory, it just allocates blocks of memory that stay allocated until you remove them.
- If you get a CUDA Out-Of-Memory error, restart your notebook. Tricky to recover from otherwise.
- Will I be able to train on 16GB? One way to quickly do that is train only on one label and see how much memory it used.
- Call python’s garbage collection
gc.collect()
and PyTorch’storch.cuda.empty_cache()
will get GPU back to a clean state. - If you run out of memory—use
GradientAccumulation
. Using a small batch size (bs = 16
instead ofbs = 64
) will solve the memory problem but will change the dyanmics of the training since the smaller your batch size the more volatility there is, so now your learning rates need to change. You don’t want to mess around trying to find different hyparameters for every batch size for every architecture.GradientAccumulation(bs)
makes the training behave as if the batch size isbs
even when it’s not. - Consider the training loop:
for x,y in dl:
calc_loss(coeffs, x, y).backward()* lr)
coeffs.data.sub_(coeffs.grad coeffs.grad.zero_()
- Note that you don’t need
with torch.no_grad()
since you are usingcoeffs.data
. - Here’s a variation of that loop with
GradientAccumulation
added:
= 0 # track count of items seen since last weight update
count for x,y in dl:
+= len(x) # update count based on this minibatch size
count
calc_loss(coeffs, x, y).backward()if count >= 64: # count is greater than accumulation target so do weight update
* lr)
coeffs.data.sub_(coeffs.grad
coeffs.grad.zero_()= 0 # reset count count
- In PyTorch if you call
backward()
without zeroing the gradients then it adds new gradients to old gradients. - Doing two half-size batches without zeroing out between them is adding up the gradients.
- You don’t need to buy a bigger GPU to train bigger models. Just use
GradientAccumulation
. GradientAccumulation
is numerically identical for some architectures. Other architectures use batch normalization (which keeps track of the moving average of standard deviation and averages and does it in a mathematically slightly incorrect way). UsingGradientAccumulation
with batch normalization can introduce more volatility. Which is not necessarily a bad thing but it’s not numerically identical so you won’t get the same results.lr_find
uses yourDataLoaders
batch size.- Pick the largest batch size that you can (you’re getting more parallel processing). Generally a good idea for it to be a multiple of 8 for performance reasons.
- In fastai use
GradientAccumulation
by passing it as acbs
(callback)
= GradientAccumulation(<effective batch size>)
cbs = vision_learner(dls, arch, metrics, cbs=cbs) learn
- For bigger models you’ll get to a linear scaling with
GradientAccumulation
. Models have a bit of an overhead. - Nearly all transformer models have a fixed input size.
- Use different training sets (i.e. don’t set
seed
in the validation splitter) when you are going to ensemble. - A popular thing is to do k-fold cross validation. 5-fold CV does something similar to what Jeremy did with training on a random 80% split. In theory that could be slightly better because you’re guaranteed that every row appears four times. Also has the benefit that you can average those five validation sets that have no overlap. Jeremy usually doesn’t bother because this way he can add or remove models very easily.
- NVIDIA consumer cards (RTX) are just as good as enterprise cards. NVIDIA will not allow you to use an RTX card in a data center. Which is why cloud computing is more expensive.
- teacher-student models and model distillation—there are ways to make inference faster by training small models that work the same way as large models.
- Build a model to predict both disease and variety of rice. The first thing you need is a
DataLoaders
that have two dependent variables:
= DataBlock(
dls =(ImageBlock, CategoryBlock, CategoryBlock),
blocks=1, # otherwise it doesn't know which of the 3 is ind/dep var
n_inp=get_image_files,
get_items= [parent_label, get_variety],
get_y = RandomSplitter(0.2, seed=42),
splitter = Resize(192, method='squish'),
item_tfms = aug_transforms(size=128, min_scale=0.75),
batch_tfms ).dataloaders(path)
- Jeremy first create a
DataBlock
that did exactly the same thing as the single-dependent-variable disease-classifier and then once he got that to work, expanded it to two dependent variables. - In pandas you can set one column to be the index:
= pd.read_csv(path/'train.csv', index_col='image_id') df
So that you can then use df.loc['100330.jpg', 'variety']
to get the variety
column for a given image_id
. You can then wrap this into a function to use for get_y
:
def get_variety(p): return df.loc[p.name, 'variety']
Where p
is a Path
object.
- How do we get a model that predicts two things? We never had a model that predicted one thing, we had a model that predicts 10 things (probabilities for 10 disease classes). We want a model that now predicts 20 things.
- fastai will pass to metrics and loss function three things: the input and two dependent variables. Can’t just use
error_rate
as metric since that takes only two inputs. Instead have to create a custom metric that takes three inputs and returns the error rate for disease-only (same thing for loss):
def disease_err(inp, disease, variety): return error_rate(inp, disease)
def disease_loss(inp, disease, variety): return F.cross_entropy(inp, disease)
- The stuff in the middle of the model, you’re going to think about that much but the stuff at the ends you think about a lot.
- Cross Entropy Loss example: assume you have a mini imagenet with 5 classes (cat, dog, plane, fish, building):
output | exp | softmax | actuals | index | |
---|---|---|---|---|---|
cat | -4.89 | 0.01 | 0.00 | 0 | 1 |
dog | 2.60 | 13.43 | 0.87 | 1 | 1 |
plane | 0.59 | 1.81 | 0.12 | 0 | 2 |
fish | -2.07 | 0.13 | 0.01 | 0 | 3 |
building | -4.57 | 0.01 | 0.00 | 0 | 4 |
- output is the output from the model (5 values for 5 classes). They’re not probabilities yet, they’re just 5 numbers. We want to convert these into probabilities.
Softmax: \[\frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]
- We’re going to go through each of the categories (1 to K = 5). Take \(e\) to the power of the output (\(z\)). Sum them all together. That’s the denominator. The numerator is \(e\) to the power of the thing that we care about (each row). The sum of these fractions is 1. Now we have things that are probabilities: numbers that are between 0 and 1 and add up to 1.0. Since we did \(e\) to the power of the output, the bigger outputs will be pushed up closer to 1.0. We’re making the model really try to pick one thing. There’s no way for it to predict anything other than the categories we are giving it. We are forcing it to pick one. You can have the probabilities add up to more than one (more than one thing being true) or less than one (no things being true).
- The first part of what
nn.CrossEntropy
does it to calculate the softmax. It’s actually the log of the softmax. - Now that we have the 5 probabilities, the next step is the actual cross-entropy calculation:
softmax | actuals | x-entropy | |
---|---|---|---|
cat | 0.00 | 0 | 1 |
dog | 0.87 | 1 | 1 |
plane | 0.12 | 0 | 2 |
fish | 0.01 | 0 | 3 |
building | 0.00 | 0 | 4 |
- The actuals are one-hot encoded (1 for the thing that is True and 0 everywhere else).
- We would expect a smaller loss where the softmax is high if the actual is high. Formula for cross-entropy:
\[-\sum_{j=1}^M y_j\log(p(y_j))\]
- Where \(y_j\) is an indicator variable and \(p(y_j)\) is the predicted probability (the softmax column). Cross entropy is then -log(softmax). For four classes that value is 0. That equation is finding the probablity for the class that is 1 and taking its log.
- Here’s the equation for binary cross-entropy:
\[-\sum_{i=1}^N y_i \log(p(y_i)) + (1 - y_i)\log(1 - p(y_i))\]
- Where \(y_i\) is the label, and \(p(y_i)\) is the probability of the positive class. \(y_i=1\) if it is a cat, \(y_i=0\) if it is not a cat.
- PyTorch loss functions have two versions:
nn
Class which you can instantiate passing in various tweaks and theF
function that doesn’t allow these tweaks. - When you have multiple targets you can’t rely on fastai to know what loss function to use so you have to pass your custom loss function to the
loss_func
parameter in the learner. Same for metrics. Also, fastai no longer knows how many activations to create because there is more than one target so you have to pass a value ton_out
which is the number of targets (the size of the last matrix). - For two-target situation, we have to set
n_out
to20
when creating the learner since 10 of those targets are for disease and 10 are for variety of rice. How does the model know what it’s predicting? The answer is: with the loss function—you’re going to have to tell it.inp
is going to have 20 columns (sincen_out
is 20) so we’re just going to have to decide that the first 10 columns correspond to the disease predictions.
def disease_loss(inp, disease, variety): return F.cross_entropy(inp[:,:10], disease)
- For variety, we use the second ten columns:
def variety_loss(inp, disease, variety): return F.cross_entropy(inp[:,10:], variety)
- The overall loss function is the sum of those two things:
def combine_loss(inp,disease,variety): return disease_loss(inp,disease,variety)+variety_loss(inp,disease,variety)
As the model trains, this loss function will be minimized when the first ten columns are doing a good job at predicting disease probabilities and the second ten columns are doing a good job at predicting variety probabilities. Therefore the gradients will point in the appropriate direction, the coefficients will get better and better at using those columns for those purposes.
- Do the same for
error_rate
:
def disease_err(inp,disease,variety): return error_rate(inp[:,:10],disease)
def variety_err(inp,disease,variety): return error_rate(inp[:,10:],variety)
= (disease_err,variety_err) err_metrics
- the
Learner
looks like:
= vision_learner(dls, arch, loss_func=combine_loss, metrics=err_metrics, n_out=20) learn
It was slightly less good at predicting disease but that makes sense because we have trained it for the same number of epochs (5 in this case) but have given it more stuff to do.
If we train it for longer, this model might end up getting better at predicting disease than the single-label disease model. It turns out quite often that the kind of features that help you recognize variety of rice also help recognize disease, maybe there are certain textures, or maybe some diseases impact different varieties in different ways.
Build a model that predicts two things in the Titanic dataset.
Look at the inputs and outputs of the multi-target part 4 notebook.
Collaborative Filtering Deep Dive
- This kind of data is very common:
user | movie | rating |
---|---|---|
196 | 242 | 3 |
186 | 302 | 3 |
22 | 377 | 1 |
244 | 51 | 2 |
166 | 346 | 1 |
Anytime you have a user and product you’ll have this kind of data. What happens when the rating is blank? How do you fill it in? To figure this out, ideally we’d like to know for each movide: what kind of movie is it? What are the features of it? If we had three categories: science-fiction, action and old movies, then Last Skywalker would be represented by the following (where each value ranges from -1 to 1):
= np.array([0.98, 0.9, -0.9]) last_skywalker
It’s very science-fictiony, very action-y and very not old.
A user who liked modern sci-fi could be represented by:
= np.array([0.9, 0.8, -0.6]) user1
To calculate the match between last_skywalker
and user1
we can multiple the values and sum:
*last_skywalker).sum() # = 2.142 (user1
- On the other hand, the movie Casablanca, not science-fiction, not really very action, and very much an old classic:
= np.array([-0.99, -0.3, 0.8]) casablanca
- On the other hand, the movie Casablanca, not science-fiction, not really very action, and very much an old classic:
= np.array([-0.99, -0.3, 0.8]) casablanca
- Matching it with the user:
*casablanca).sum() # -1.611 (user1
- Multiplying the corresponding elements of two vectors and adding them up is called dot product. The above is a dot product of the users preferences and a type of movie. The problem is we weren’t given this information about users and movies. What we can do is create things called Latent Factors: I don’t know what things about movies matter to people, but there’s probably something, and let’s just try using SGD to find them. We can do it in Microsoft Excel!
- In Excel we create 5 latent factors (rows) of random numbers for each movieId and userId. We don’t know what these represent but they represent something. Only quirk is that if the actual rating is blank we’re going to set the dot product to 0 by default.
- The matrix product of a row and a column is the same thing as a dot product. These dot products are everybody’s predicted ratings for movies. They are terrible predictions since the latent factors are just random numbers, but they are predictions nonetheless.
- When we have predictions using random numbers, we know how to make them better: stochastic gradient descent. To do that we need a loss function: RMSE = square root of sum of x minus y squared divided by the count (in Excel: =SQRT(SUMXMY2()/COUNT()))
- Excel solver: minimize cell with loss by changing userId and movieId latent factors. In Jeremy’s workbook: starts at 2.81 and ends at 0.42. In my workbook: starts at 2.92 and ends up at 0.43 after Solver is run.
- The cosine of the angle between vectors is the same as the normalized dot product.
- Using embeddings: replicating what we’ll have in Python which is a table with rows userid, movieid and rating. We’ll get the embeddings for each userid and the embeddings for each movieid all in one row, and then use Excel function SUMPRODUCT (which is dot product) to get the prediction. This is the same as before but when we put everything next to each other we have to lookup the index of userId and movieId and then lookup the embeddings.
- For each row calculate the error squared (pred-rating)^2 and take the square root of the average of error squareds to get the rmse, which is 2.71 in my case (Jeremy used the same random initial numbers for the dot product tab and the movielens_emb tab).
- Running solver: my rmse goes from 2.71 to 0.443 which is about the same as before (with different randomly initiated embeddings).
- What is an embedding? It’s just looking something up in an array.
- How do we do this in PyTorch? We’re going to need
DataLoaders
.
= pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None) movies
movies.head()
outputs:
movie | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
Merge this with ratings
so we can get the movie titles:
= pd.read_csv(path/'u.data', delimeter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp']) ratings
ratings.head()
outputs:
user | movie | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
Merge with movies
to get title
:
= ratings.merge(ratings) ratings
ratings.head()
output:
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
2 | 226 | 242 | 5 | 883888671 | Kolya (1996) |
3 | 154 | 242 | 3 | 879138235 | Kolya (1996) |
4 | 306 | 242 | 5 | 876503793 | Kolya (1996) |
Next we create the DataLoaders
with CollabDataLoaders
which expects a user
column and item
column where item
is the service or product that the user
is rating. By default the user column should be called user
and the item column called item
.
= CollabDataLaoders.from_df(ratings, item_name='title', bs=64) dls
dls.show_batch()
outputs:
user | title | rating | |
---|---|---|---|
0 | 518 | Richard III (1995) | 3 |
1 | 546 | Star Wars (1977) | 5 |
2 | 264 | Adventures of Priscilla, Queen of the Desert, The (1994) | 4 |
3 | 201 | Kolya (1996) | 4 |
4 | 664 | Dances with Wolves (1990) | 3 |
5 | 391 | Jerry Maguire (1996) | 4 |
6 | 401 | Beauty and the Beast (1991) | 2 |
7 | 771 | Strictly Ballroom (1992) | 5 |
8 | 330 | 101 Dalmatians (1996) | 4 |
9 | 594 | One Flew Over the Cuckoo’s Nest (1975) | 4 |
Now we’re going to create the user factors and movie factors (i.e. the two embedding matrices we created in the Excel file). The number of rows of movie factors is equal to the number of movies and the number of columns will be whatever we want (however many factors we want to create). How many factors to use? Jeremy wrote down how many factors he thought was appropriate for different sized categories in Excel and fitted a function to that and that’s the function fastai uses—a mathematical function that fits Jeremy’s intuition about what works well. It’s pretty fast to train these things so you can try a few.
= len(dls.classes['user'])
n_users = len(dls.classes['title'])
n_movies = 5
n_factors
= torch.randn(n_users, n_factors)
user_factors = torch.randn(n_movies, n_factors) movie_factors
Now we need to lookup the index of our movie in our movie latent factor matrix (and user index for the user latent factor matrix). When we’ve learned about deep learning we’ve learned about matrix multiplication, not look-something-up-in-a-matrix. In Excel we were using OFFSET
which can actually be represented as matrix multiplication. “Find this element in this list” is the same as matrix multiplying a one-hot-encoded vector. Taking the dot product of a one-hot-encoded vector with another vector is the same as looking up that index in the vector.
= one_hot(3, n_users).float() one_hot_3
is a vector where the 3rd element is set to 1
and everything else is set to 0
s.
If we matrix multiply that by our user_factors
transposed:
@ one_hot_3 user_factors.t()
We get tensor([-1.2493, -0.3099, 1.4229, 0.0840, 0.4132])
which is the same as the vector at index 3 in the matrix:
user_factors[3]
You can think of an embedding as a computational shortcut for multiplying something by a one-hot-encoded vector. It’s like dummy variables (without having to create the dummy variables). We never have to create a one-hot-encoded vector, we can just look up an array.
- Building a collaborative filtering model from scratch
In PyTorch, a model is a class
. Example:
class Example:
def __init__(self, a): self.a = a
def say(self, x): return f'Hello {self.a}, {x}.'
__init__
is called when you create an object of the given class.
= Example('Sylvain')
ex 'nice to meet you') ex.say(
outputs:
'Hello Sylvain, nice to meet you.'
You can put something in parenthesis after your class name, the super class, which will give you some functionality for free. A PyTorch model has to have Module
as its super class. fastai also has its own Module
class. Here’s a DotProduct
class:
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return (users * movies).sum(dim=1)
PyTorch calls a forward
method when you call a model object. This is where you put the calculation of your model. dim=1
because we are summing across the columns for each row in the batch–a prediction for each row. We can now pass the model to the Learner
:
= DotProduct(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat()) learn
Then we cant train:
learn.fit_one_cycle(5, 5e-3)
This runs on CPU and takes about 10 seconds per epoch (100k rows) and gets to 0.86 loss after 5 epochs. A whole lot faster than our few dozen rows in Excel. It’s not a great model. One problem is that some of the predictions are greater than 5.
When we add sigmoid, it squishes things to between 0 and 1 so the model doesn’t have to work so hard to get the predictions into the right zone. If you pass something through sigmoid and multiply it by 5, now you’re going to get something between 0 and 5. Use sigmoid_range
to do that:
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
Why not use upper limit of 5? That’s because sigmoid can never hit 1. So sigmoid times 5 can never hit 5. In this case, this didn’t improve the loss.
Some users just loved movies–they give everything 4s and 5s. Some people’s ratings have much more range (1s, 2s, 5s). Some people give nothing a 5. At the moment we don’t have any way in our formulation of this model to say this user tends give low scores and this user tends to give high scores. That would be very easy to add. Let’s add one more number to our 5 factors. Now instead of just matrix multiplying, let’s add this number to it. Matrix multiplication plus user bias plus movie bias. Effectively that’s making it so that we don’t have an intercept of 0 anymore. Implementing this in my Excel dropped the loss from 0.43 to 0.40.
Here’s the PyTorch version:
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
res return sigmoid_range(res, *self.y_range)
In Jeremy’s case, this made the training worse (the loss increased) and the validation loss started increasing after the second epoch—we might be overfitting. One way to avoid overfitting is to use weight decay (also known as L2 regularization). When we compute the gradients we’ll add to our loss function the sum of the weights squared (times some small number). What would make that loss function go down? If we reduce the magnitude of our weights. For example if we reduce all of our weights to 0, that part of the los function will be 0. The problem is, if our weights are all 0, our model doesn’t do anything. So we want it to increase the weights. But if it increases the weights too much, then it starts overfitting. How is it going to actually get the lowest value of the loss function? By finding the right mix. Weights not too high but high enough to be useful for predicting. If there’s some paramter that’s not useful, it can just set the weight to 0. It won’t be used to predict anything but it also won’t contribute to the weight decay.
= loss + wd * (parameters**2).sum() loss_with_wd
In fact, we don’t even need to do this because the whole purpose of the loss is to take its gradient. The gradient of parameters squared is 2 times parameters.
+= wd * 2 * parameters parameters.grad
Fold the 2 into the wd
since it’s just some number we’re going to pick. When you call fit, pass in the wd
parameter:
= DotProductBias(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
This finally improves our loss. In fastai applications like vision, fastai sets wd
appropriately, but in things like tabular and collaborative filtering fastai doesn’t know enough about your data so you just try a few wd
values. Regularization is about making your model no more complex than it has to be. The higher the weights, the more they’re moving the model around, we want to keep the weights down, but not so far down that they don’t make a prediction. If wd
is higher, it’ll keep the weights down more, reduce overfitting, but will also reduce the capacity of your model to make good predictions. If it’s lower, it increases the capacity of your model, and increases overfitting.
Can recommendation systems be built based on average ratings of users’ experience rather than collaborative filtering? Not really, if you’ve got lots of metadata you could (demographic data on users for example) then sure averages would be fine. But if all you’ve got is purchasing history, then you really want the granular data, there’s not enough information there to use averages.
Book Notes
Collaborative filtering: look at which products the current user has used or liked, find other users who have used or liked similar products, and then recommend other products that those users have used or liked. We don’t necessarily need to know anything about the products except who liked them.
Latent factors: the key foundational idea in collaborative filtering—the underlying concepts behind users and items that don’t need to be defined explicitly with columns of data.
A First Look at the Data
from fastai.collab import *
from fastai.tabular.all import *
= untar_data(URLs.ML_100k) path
= pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
ratings ratings.head()
user | movie | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors, and actors, and so forth, and we knew the same information about each movie, then a simply way to fill empty ratings would be to multiply this information together for each movie and user combination.
= np.array([0.98, 0.9, -0.9]) # sci-fi, action, old movie
last_skywalker last_skywalker
array([ 0.98, 0.9 , -0.9 ])
= np.array([0.9, 0.8, -0.6]) # sci-fi, action, old movie
user1 user1
array([ 0.9, 0.8, -0.6])
# combination with 3 being the max
* last_skywalker).sum() (user1
2.1420000000000003
dot product: multiplying two vectors together and add up the results.
The mathematical operation of multiplying the elements of two vectors together, and then summing up the results.
= np.array([-0.99, -0.3, 0.8]) # sci-fi, action, old movie
casablanca casablanca
array([-0.99, -0.3 , 0.8 ])
# user1 won't like this as much as last skywalker
* casablanca).sum() (user1
-1.611
Learning the Latent Factors
There is surprisingly little difference between specifying the structure of a model and learning one, since we can just use our general gradient descent approach:
- Step 1: randomly initialize some parameters.
- Step 2: calculate our predictions.
- Step 3: calculate our loss.
More details:
- Step 1: the parameters we randomly initialize will be a set of latent factors for each user and movie. We’ll use 5 latent factors for now.
- Step 2: calculate predictions by taking the dot product of each movie with each user. If the first latent user factor represents how much the user likes action movies and the first latent movie factor represents whether the movie has a lot of action, the product of those will be particularly high if either the user likes action movies and the movie has a lot of action in it, or the user doesn’t like action movies and the movie doesn’t have any action in it. The product will be low if we have a mismatch.
- Step 3: We’ll pick mean squared error for now.
With this in place we can optimize our parameters using stochastic gradient descent such as to minimize the loss. At each step, the stochastic gradient descent optimizer will calculate the match between each movie and each user using the dot product and will compare it to the actual rating that each user gave to each movie. It will then calculate the derivative of this value and step the weights by multiplying this by the learning rate. After doing this lots of times the loss will get better and the recommendations will also get better and better.
Creating the DataLoaders
To use the Learner.fit
function, we will need to get our data into a DataLoaders
. When showing the data we would rather see movie titles than their IDs:
= pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
movies movies.head()
movie | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
= ratings.merge(movies)
ratings ratings.head()
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
2 | 226 | 242 | 5 | 883888671 | Kolya (1996) |
3 | 154 | 242 | 3 | 879138235 | Kolya (1996) |
4 | 306 | 242 | 5 | 876503793 | Kolya (1996) |
= CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls dls.show_batch()
user | title | rating | |
---|---|---|---|
0 | 210 | Some Like It Hot (1959) | 5 |
1 | 651 | Godfather, The (1972) | 4 |
2 | 515 | Starship Troopers (1997) | 4 |
3 | 49 | Swimming with Sharks (1995) | 4 |
4 | 512 | Nikita (La Femme Nikita) (1990) | 5 |
5 | 497 | Rob Roy (1995) | 4 |
6 | 664 | Dave (1993) | 3 |
7 | 880 | Empire Strikes Back, The (1980) | 5 |
8 | 185 | Leaving Las Vegas (1995) | 4 |
9 | 815 | Aladdin (1992) | 3 |
To represent collaborative filtering in PyTorch, we can’t just use the crosstab representation directly, especially if we want to fit into our deep learning framework. We can represent our movie and user latent factor tables as simple matrices:
= len(dls.classes['user'])
n_users = len(dls.classes['title'])
n_movies = 5
n_factors
n_users, n_movies, n_factors
(944, 1665, 5)
= torch.randn(n_users, n_factors)
user_factors = torch.randn(n_movies, n_factors)
movie_factors
user_factors.shape, movie_factors.shape
(torch.Size([944, 5]), torch.Size([1665, 5]))
To calculate the result for a particular movie and user combination, we have to look up the index of the movie in our movie latent factor matrix, and the index of the user in our user latent factor matrix; then we can do our dot product between the two latent factor vectors.
We can represent look up an index as a matrix product by replacing indices with one-hot-encoded vectors.
= one_hot(3, n_users).float()
one_hot_3 @ one_hot_3 user_factors.t()
tensor([-2.5648, -0.4866, -0.9996, -1.8835, -1.0867])
3] user_factors[
tensor([-2.5648, -0.4866, -0.9996, -1.8835, -1.0867])
If we do that for a few indices at once, we will have a matrix of one-hot-encoded vectors and that operation will be a matrix multiplication. This would be a perfectly acceptable way to build models using this kind of architecture, except that it would use a lot more memory and time than necessary.
one_hot_3
tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.])
= torch.stack([
one_hots 3, n_users).float(),
one_hot(4, n_users).float(),
one_hot(5, n_users).float()])
one_hot(
@ user_factors one_hots
tensor([[-2.5648, -0.4866, -0.9996, -1.8835, -1.0867],
[-0.0096, -0.0892, -1.4639, 0.6083, -1.0248],
[ 0.0330, -0.6358, 0.6536, -0.9384, 0.0973]])
3:6] user_factors[
tensor([[-2.5648, -0.4866, -0.9996, -1.8835, -1.0867],
[-0.0096, -0.0892, -1.4639, 0.6083, -1.0248],
[ 0.0330, -0.6358, 0.6536, -0.9384, 0.0973]])
There is no real underlying reason to store the one-hot-encoded vector, or to search through it to find the occurrence of the number 1–we should just be able to index into an array directly with an integer.
embedding: a special layer that indexes into a vector using an integer, but has its derivative calculated in such a way that it is identical to what it would have been if it had done a matrix multiplication with a one-hot-encoded vector.
How do we determine the numbers to characterize these different features of movies and users? We don’t. We let the model learn them. By analyzing the existing relations between users and movies, our model can figure out itself the features that seem important or not.
We will attribute to each of our users and each of our movies a random vector of a certain length (here, n_factors=5
), and we will make those learnable parameters. That means that at each step, when we compute the loss by comparing our predictions to our targets, we will compute the gradients of the loss with respect to those embedding vectors and update them with the rules of SGD (or another optimizer).
Collaborative Filtering from Scratch
# example class
class Example:
def __init__(self, a): self.a = a
def say(self, x): return f'Hello {self.a}, {x}.'
'Vishal').say('how are you?') Example(
'Hello Vishal, how are you?.'
Creating a new PyTorch module requires inheriting from Module
. When your module is called, PyTorch will call a method in your class called forward
and will pass along to that any parameters that are included in the call.
Module??
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return (users * movies).sum(dim=1)
= dls.one_batch()
x, y x.shape
torch.Size([64, 2])
# doing from scratch so use plain Learner
= DotProduct(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=MSELossFlat()) learn
learn.arch
AttributeError: 'DotProduct' object has no attribute 'arch'
5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.311295 | 1.325218 | 00:11 |
1 | 1.011121 | 1.110533 | 00:12 |
2 | 0.882769 | 1.013594 | 00:11 |
3 | 0.790874 | 0.926204 | 00:12 |
4 | 0.769741 | 0.900602 | 00:11 |
Apply sigmoid_range
to force predictions to be between 0 and 5:
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return sigmoid_range((users * movies).sum(dim=1), *self.y_range)
Before I train, I want to look at why dim=1
is set in sum
:
x.shape
torch.Size([64, 2])
= Embedding(n_users, n_factors)
user_factors = Embedding(n_movies, n_factors) movie_factors
= user_factors(x[:,0])
users = movie_factors(x[:,1]) movies
users.shape, movies.shape
(torch.Size([64, 5]), torch.Size([64, 5]))
* movies).sum() (users
tensor(-0.0019, grad_fn=<SumBackward0>)
* movies).sum(dim=1).shape (users
torch.Size([64])
So, dim=1
ensures that the predictions (users * movies
) are calculated for each item in the batch.
# doing from scratch so use plain Learner
= DotProduct(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.923018 | 1.000980 | 00:13 |
1 | 0.663725 | 0.956221 | 00:13 |
2 | 0.439762 | 0.960771 | 00:12 |
3 | 0.361286 | 0.964587 | 00:13 |
4 | 0.333990 | 0.961902 | 00:14 |
This actually worsened the model!
One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. In our current implementation we do not have any way to encode such things. If all you can say about a movie is, for instance, that it is very sci-fi, very action-oriented, and very not old, then you don’t really have any way to say whether most people like it. We can handle this missing piece with biases—a single number for each user and movie that we can add to our score.
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
res return sigmoid_range(res, *self.y_range)
Before I train I want to walk through this code to make sure I understand what’s happening at each step and why.
= dls.one_batch()
x, y x.shape
torch.Size([64, 2])
= Embedding(n_users, n_factors)
user_factors = Embedding(n_users, 1)
user_bias = Embedding(n_movies, n_factors)
movie_factors = Embedding(n_movies, 1)
movie_bias user_factors, user_bias, movie_factors, movie_bias
(Embedding(944, 5), Embedding(944, 1), Embedding(1665, 5), Embedding(1665, 1))
= user_factors(x[:,0])
users = movie_factors(x[:,1])
movies users.shape, movies.shape
(torch.Size([64, 5]), torch.Size([64, 5]))
0] users[
tensor([ 0.0005, -0.0128, -0.0086, 0.0043, -0.0140],
grad_fn=<SelectBackward0>)
0] x[
tensor([ 422, 1407])
422])) user_factors(torch.tensor([
tensor([[ 0.0005, -0.0128, -0.0086, 0.0043, -0.0140]],
grad_fn=<EmbeddingBackward0>)
* movies).sum(dim=1, keepdim=True).shape, (users * movies).sum(dim=1, keepdim=False).shape (users
(torch.Size([64, 1]), torch.Size([64]))
= (users * movies).sum(dim=1, keepdim=True)
res += user_bias(x[:,0]) + movie_bias(x[:,1])
res res.shape
torch.Size([64, 1])
= (users * movies).sum(dim=1, keepdim=False)
res += user_bias(x[:,0]) + movie_bias(x[:,1])
res res.shape
RuntimeError: output with shape [64] doesn't match the broadcast shape [64, 64]
0]).shape user_bias(x[:,
torch.Size([64, 1])
keepdim=True
is needed so that we can add the 64
x 1
bias tensors to res
.
= DotProductBias(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.838959 | 0.961033 | 00:12 |
1 | 0.584438 | 0.919422 | 00:12 |
2 | 0.406340 | 0.946146 | 00:13 |
3 | 0.323580 | 0.960616 | 00:13 |
4 | 0.303791 | 0.959865 | 00:16 |
The valid_loss
was decreasing from the first to second epoch but increased from the second to third and third to fourth epoch, which is a sign of overfitting.
Weight Decay
Add the sum of all weights squared to the loss so that when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.
The larger the coefficients are the sharper the canyons we will have in the loss function. Letting our model learn high parameters might cause it to fit all the data points in the training set with an overcomplex function that has very sharp changes, which will lead to overfitting.
loss_with_wd = loss + wd * (parameters ** 2).sum()
Limiting our weights from growing too much is going to hinder the training of the model, but it will yield a state where it generalizes better.
In practice it would be very inefficient and maybe numerically unstable to compute that big sum and add it to the loss. Adding that sum to the loss function is the same as doing the following to the gradients:
parameters.grad += wd * 2 * parameters
= DotProductBias(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.913064 | 0.965725 | 00:12 |
1 | 0.682149 | 0.911183 | 00:12 |
2 | 0.539120 | 0.890211 | 00:12 |
3 | 0.445476 | 0.877169 | 00:12 |
4 | 0.429690 | 0.872437 | 00:12 |
Finally! The loss consistently decreases each epoch.
Creating Our Own Embedding Module
Optimizers require that they can get all the parameters of a module from the module’s parameters
method, but this does not happen automatically. If we just add a tensor as an attribute to a Module
, it will not be included in parameters
:
class T(Module):
def __init__(self): self.a = torch.ones(3)
L(T().parameters())
(#0) []
To tell Module
that we want to treat a tensor as a parameter, we have to wrap it in the nn.Parameter
class. This class doesn’t add any functionality (other than automatically calling requires_grad_
for us). It’s used only as a “marker” to show what ot include in parameters
.
class T(Module):
def __init__(self): self.a = nn.Parameter(torch.ones(3))
L(T().parameters())
(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]
All PyTorch modules use nn.Parameter
for any trainable paramters:
class T(Module):
def __init__(self): self.a = nn.Linear(1, 3, bias=False)
= T()
t L(t.parameters())
(#1) [Parameter containing:
tensor([[ 0.7111],
[-0.4145],
[ 0.4969]], requires_grad=True)]
type(t.a.weight)
torch.nn.parameter.Parameter
# create a tensor as a parameter, with random initialization
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
# create DotProductBias without embedding
class DotProductBias2(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
= self.user_factors[x[:,0]]
users = self.movie_factors[x[:,1]]
movies = (users * movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
res return sigmoid_range(res, *self.y_range)
= DotProductBias2(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.904102 | 0.942020 | 00:11 |
1 | 0.654384 | 0.880052 | 00:10 |
2 | 0.509761 | 0.857724 | 00:09 |
3 | 0.451252 | 0.838781 | 00:10 |
4 | 0.445227 | 0.834458 | 00:10 |
0] x[
tensor([ 422, 1407])
learn.model(x)
tensor([3.0069, 2.4127, 3.1486, 3.0827, 4.2363, 3.0727, 4.6257, 2.8603, 4.0537,
2.7821, 1.5680, 3.5075, 1.9389, 4.7167, 3.6408, 2.3103, 3.2855, 3.0170,
4.1721, 4.1315, 4.7459, 4.7006, 3.5047, 3.6348, 2.6413, 2.6047, 4.9137,
3.3948, 3.4747, 3.9256, 3.8448, 3.3731, 3.9032, 1.3075, 4.4401, 3.6986,
3.3045, 3.5697, 4.3211, 3.7295, 1.1871, 3.2220, 3.5667, 3.5309, 4.5825,
2.2819, 4.1466, 3.3740, 4.4830, 3.2561, 3.0671, 3.9423, 2.9384, 4.2050,
2.3258, 4.0698, 4.0947, 2.9023, 3.8935, 3.0650, 3.3615, 3.3382, 3.2790,
4.0849], grad_fn=<AddBackward0>)
2) ratings.head(
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
learn.get_preds(=learn.dls.test_dl(ratings.head(2), with_input=True, with_decoded=True)
dl )
(tensor([3.9271, 3.7005]),
tensor([[3],
[3]]))
=learn.dls.test_dl(pd.DataFrame(data={'user': [196, 63], 'title': ['Kolya (1996)', 'Kolya (1996)']}))) learn.get_preds(dl
(tensor([3.9271, 3.7005]), None)
196, 242]])) learn.model(tensor([[
tensor([3.1319], grad_fn=<AddBackward0>)
Interpreting Embeddings and Biases
The easiest parameters to interpret are biases. For movies with a low bias: even when a user is very well matched to its latent factors (which, as we will see in a moment, tend to represent things like level of action, age of movie, and so forth), they still generally don’t like it.
learn.model
DotProductBias2()
learn.model.movie_bias.shape
torch.Size([1665])
= learn.model.movie_bias.squeeze()
movie_bias movie_bias.shape
torch.Size([1665])
5] movie_bias[:
tensor([ 0.0034, -0.1036, 0.0292, -0.0822, 0.4568], grad_fn=<SliceBackward0>)
Note: PyTorch’s squeeze
:
Returns a tensor with all specified dimensions of input of size 1 removed.
# bottom 5 movies
= movie_bias.argsort()[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
['Children of the Corn: The Gathering (1996)',
'Lawnmower Man 2: Beyond Cyberspace (1996)',
'Solo (1996)',
'Mortal Kombat: Annihilation (1997)',
'Crow: City of Angels, The (1996)']
# top 5 movies
= movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
['Shawshank Redemption, The (1994)',
'Good Will Hunting (1997)',
'Titanic (1997)',
"Schindler's List (1993)",
'Rear Window (1954)']
learn.model.movie_factors.shape
torch.Size([1665, 50])
=True)[:5] movie_bias.argsort(descending
tensor([1318, 622, 1501, 1282, 1216])
'title'][1318] dls.classes[
'Shawshank Redemption, The (1994)'
PCA code from the fastbook repo:
= ratings.groupby('title')['rating'].count()
g g
title
'Til There Was You (1997) 9
1-900 (1994) 5
101 Dalmatians (1996) 109
12 Angry Men (1957) 125
187 (1997) 41
...
Young Guns II (1990) 44
Young Poisoner's Handbook, The (1995) 41
Zeus and Roxanne (1997) 6
unknown 9
Á köldum klaka (Cold Fever) (1994) 1
Name: rating, Length: 1664, dtype: int64
= g.sort_values(ascending=False).index.values[:1000]
top_movies 5] top_movies[:
array(['Star Wars (1977)', 'Contact (1997)', 'Fargo (1996)',
'Return of the Jedi (1983)', 'Liar Liar (1997)'], dtype=object)
=False).index g.sort_values(ascending
Index(['Star Wars (1977)', 'Contact (1997)', 'Fargo (1996)',
'Return of the Jedi (1983)', 'Liar Liar (1997)',
'English Patient, The (1996)', 'Scream (1996)', 'Toy Story (1995)',
'Air Force One (1997)', 'Independence Day (ID4) (1996)',
...
'Girl in the Cadillac (1995)', 'He Walked by Night (1948)',
'Hana-bi (1997)', 'Object of My Affection, The (1998)',
'Office Killer (1997)', 'Great Day in Harlem, A (1994)',
'Other Voices, Other Rooms (1997)', 'Good Morning (1971)',
'Girls Town (1996)', 'Á köldum klaka (Cold Fever) (1994)'],
dtype='object', name='title', length=1664)
= tensor([learn.dls.classes['title'].o2i[m] for m in top_movies]) top_idxs
'title'].o2i['Star Wars (1977)'] learn.dls.classes[
1399
= learn.model.movie_factors[top_idxs].cpu().detach()
movie_w = movie_w.pca(3)
movie_pca = movie_pca.t()
fac0,fac1,fac2 = list(range(50))
idxs = fac0[idxs]
X = fac2[idxs] Y
=(12,12))
plt.figure(figsize
plt.scatter(X, Y)for i, x, y in zip(top_movies[idxs], X, Y):
=np.random.rand(3)*0.7, fontsize=11)
plt.text(x,y,i, color plt.show()
Using fastai.collab
= collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.928530 | 0.946365 | 00:12 |
1 | 0.683226 | 0.883067 | 00:10 |
2 | 0.506070 | 0.853040 | 00:10 |
3 | 0.446197 | 0.840192 | 00:10 |
4 | 0.439921 | 0.836284 | 00:10 |
View the names of the model layers
learn.model
EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1665, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1665, 1)
)
Replicate previous analyses:
= learn.model.i_bias.weight.squeeze()
movie_bias = movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
['Titanic (1997)',
'Shawshank Redemption, The (1994)',
'Usual Suspects, The (1995)',
"Schindler's List (1993)",
'Silence of the Lambs, The (1991)']
= ratings.groupby('title')['rating'].count()
g = g.sort_values(ascending=False).index.values[:1000]
top_movies = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
top_idxs = learn.model.i_weight.weight[top_idxs].cpu().detach()
movie_w = movie_w.pca(3)
movie_pca = movie_pca.t()
fac0,fac1,fac2 = list(range(50))
idxs = fac0[idxs]
X = fac2[idxs]
Y =(12,12))
plt.figure(figsize
plt.scatter(X, Y)for i, x, y in zip(top_movies[idxs], X, Y):
=np.random.rand(3)*0.7, fontsize=11)
plt.text(x,y,i, color plt.show()
Embedding Distance
If there were two movies that were nearly identical, their embedding vectors would also have to be nearly identical, because the users who would like them would be nearly exactly the same. Movie similarity can be defined by the similarity of users who like those movies. The distance between two movies’ embedding vectors can define that similarity.
# find the most similar movie to Silence of the Lambs
= learn.model.i_weight.weight
movie_factors = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
idx = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
distances = distances.argsort(descending=True)[1]
idx 'title'][idx] dls.classes[
'His Girl Friday (1940)'
=True) distances.argsort(descending
tensor([1330, 688, 846, ..., 1048, 595, 850])
Bootstrapping a Collaborative Filtering Model
boostrapping problem: Having no users and therefore no history to learn from. What products do you recommend to your very first user? What do you do when a new user signs up? What do you do when you add a new product to your portfolio? Use your common sense.
- Pick a user to represent average taste (instead of averaging all user embeddings as this can incorrectly represent relationships between latent factors).
- User a tabular model based on user metadata to construct your initial embedding vector. Think about what questions you could ask to help you understand users’ tastes. Create a model with the dependent variable the user’s embedding vector and the independent variables are the rsults of the questions you ask them along with their signup metadata.
- A small number of extremely enthusiastic users may end up effectively setting the recommendations for your whole user base. Such a problem can change the entire makeup of your user base, and the behavior of the system, particularly because of positive feedback loops: a small number of users set the direction of recommendation systems, end up attracting more people like them to your system, amplifyig the original bias, exponentially. Ensure that humans are in the loop of the data pipeline, with careful monitoring of the system and a gradual and thoughtful rollout. Think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data.
The dot-product approach to collaborative filtering is known as probabilistic matrix factorization (PMF).
Deep Learning for Collaborative Filtering
Take the results of the embedding lookup and concatenate those activations together, giving us a matrix that we can then pass through linear layers and nonlinearities.
Since we’ll be concatenating the embedding matrices, rather than taking their dot product, the two embedding matrices can have different sizes (different number of latent factors). get_emb_sz
returns recommended sizes for embedding matrices based on Jeremy’s intuition for what works well.
= get_emb_sz(dls)
embs embs
[(944, 74), (1665, 102)]
class CollabNN(Module):
def __init__(self, user_sz, item_sz, y_range=(0, 5.5), n_act=100):
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz)
self.layers = nn.Sequential(
1]+item_sz[1], n_act),
nn.Linear(user_sz[
nn.ReLU(),1)
nn.Linear(n_act,
)self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0]), self.item_factors(x[:,1])
embs = self.layers(torch.cat(embs, dim=1))
x return sigmoid_range(x, *self.y_range)
Working through this code step-by-step:
= Embedding(*embs[0])
user_factors = Embedding(*embs[1])
movie_factors
user_factors, movie_factors
(Embedding(944, 74), Embedding(1665, 102))
= nn.Sequential(
layers 0][1]+embs[1][1], 100),
nn.Linear(embs[
nn.ReLU(),100, 1)
nn.Linear(
)
layers
Sequential(
(0): Linear(in_features=176, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=1, bias=True)
)
= (0, 5.5) y_range
= dls.one_batch()[0]
x x.shape
torch.Size([64, 2])
= user_factors(x[:,0]), movie_factors(x[:,1])
embs = torch.cat(embs, dim=1)
embs embs.shape
torch.Size([64, 176])
= layers(embs)
x x.shape
torch.Size([64, 1])
= sigmoid_range(x, *y_range)
x x.shape
torch.Size([64, 1])
# create a model
= get_emb_sz(dls)
embs = CollabNN(*embs) model
model
CollabNN(
(user_factors): Embedding(944, 74)
(item_factors): Embedding(1665, 102)
(layers): Sequential(
(0): Linear(in_features=176, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=1, bias=True)
)
)
= Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.01) # note the smaller wd learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.909496 | 0.933792 | 00:24 |
1 | 0.865688 | 0.909720 | 00:24 |
2 | 0.821369 | 0.876649 | 00:23 |
3 | 0.772402 | 0.854223 | 00:14 |
4 | 0.768974 | 0.850676 | 00:12 |
# alternative way to train a NN
= collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100, 50])
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.968165 | 0.987966 | 00:18 |
1 | 0.906930 | 0.917807 | 00:13 |
2 | 0.830938 | 0.884382 | 00:13 |
3 | 0.781613 | 0.859800 | 00:13 |
4 | 0.749270 | 0.855626 | 00:13 |
fastai uses EmbeddingNN
which inherits from TabularModel
.
**kwargs
in a parameter list means “put any additional keyword arguments into a dict called kwargs
.” And **kwargs
in an argument list means “insert all key/value pairs in the kwargs
dict as named arguments here.”
Jupyter Notebook doesn’t know what parameter are available with **kwargs
so things like tab completion or parameter names and pop-up lists won’t work. fastai resolves this by providing a special @delegates
decorator which automatically changes the signature of the class or function to insert all of its keyworkd arguments into the signature.
We can incorporate other user and movie information with EmbeddingNN
since it uses TabularModel
(EmbeddingNN
is a TabularModel
with n_cont=0
and out_sz=1
.)
Questionnaire
1. What problem does colalborative filtering solve?
It solves the problem of “filling in the blanks” to predict which items which users (who haven’t bought those items) will buy or rate highly.
2. How does it solve it?
It solves it buy learning different features (latent factors) of users and items and taking the dot product of a user’s latent factors with an item’s latent factors as the prediction.
3. Why might a collaborative filtering predictive model fail to be a very useful recommendation system?
4. What does a crosstab representation of collaborative filtering data look like?
It looks like a table with rows/columns as users/movies and the cells/values as the rating.
5. Write the code to create a crosstab representation of the MovieLens data.
=ratings['user'], columns=ratings['movie'], values=ratings['rating'], aggfunc='max') pd.crosstab(index
movie | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user | |||||||||||||||||||||
1 | 5.0 | 3.0 | 4.0 | 3.0 | 3.0 | 5.0 | 4.0 | 1.0 | 5.0 | 3.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 4.0 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
939 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
940 | NaN | NaN | NaN | 2.0 | NaN | NaN | 4.0 | 5.0 | 3.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
941 | 5.0 | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
942 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
943 | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
943 rows × 1682 columns
6. What is a latent factor? Why is it “latent”?
A latent factor is some characteristic of a variable such as user
or item
. It is latent because we do not explicitly know or define it. It is learned through training.
7. What is a dot product? Calculate a dot product manually using pure Python with lists.
A dot product is the sum of elementwise products between two sequences.
= [0.5, 0.8, 0.6]
a = [0.3, 0.2, 0.1]
b
= 0
dotproduct for i in range(3):
+= a[i] * b[i]
dotproduct
dotproduct
0.37000000000000005
8. What does pandas.DataFrame.merge
do?
It joins two DataFrames
based on a single key column. In our case, we merge
ratings
with movies
to get the title
column into ratings
.
9. What is an embedding matrix?
A matrix where the rows are users or movies, columns are latent factors, and values are decimal values.
10. What is the relationship between an embedding and a matrix of one-hot-encoded vectors?
An embedding is a lookup for which gradients are calculated for an equivalent one-hot-encoded vector matrix multiplication.
11. Why do we need Embedding
if we could use one-hot-encoded vectors for the same thing?
Because one-hot-encoded vectors take up a lot of memory.
12. What does an embedding contain before we start training (assuming we’re not using a pretrained model)?
Random numbers.
13. Create a class (without peeking, if possible!) and use it.
class Example2():
def __init__(self, a): self.a = a
def say(self, text): print(f"Hello {self.a}, {text}")
'Vishal').say('how are you doing?') Example2(
Hello Vishal, how are you doing?
14. What does x[:,0]
mean?
All rows of the first column.
= torch.rand((20, 3))
x x
tensor([[0.9112, 0.0080, 0.1422],
[0.5259, 0.5318, 0.4177],
[0.7601, 0.4428, 0.7609],
[0.1857, 0.6702, 0.5156],
[0.7433, 0.9224, 0.0756],
[0.1144, 0.9052, 0.4352],
[0.2535, 0.6668, 0.5542],
[0.1815, 0.1204, 0.7027],
[0.6851, 0.2904, 0.1381],
[0.4445, 0.2967, 0.3887],
[0.0353, 0.8038, 0.7396],
[0.3166, 0.4250, 0.4495],
[0.8432, 0.9193, 0.0062],
[0.9012, 0.0966, 0.8314],
[0.7679, 0.5781, 0.4155],
[0.4149, 0.3091, 0.3061],
[0.3020, 0.6649, 0.8742],
[0.6168, 0.4744, 0.0328],
[0.9663, 0.3894, 0.5954],
[0.0722, 0.1334, 0.2033]])
0] x[:,
tensor([0.9112, 0.5259, 0.7601, 0.1857, 0.7433, 0.1144, 0.2535, 0.1815, 0.6851,
0.4445, 0.0353, 0.3166, 0.8432, 0.9012, 0.7679, 0.4149, 0.3020, 0.6168,
0.9663, 0.0722])
15. Rewrite the DotProduct
class (without peeking, if possible!) and train a model with it.
class DotProduct(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return (users * movies).sum(dim=1)
= DotProduct(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.326553 | 1.332127 | 00:11 |
1 | 1.056348 | 1.104372 | 00:14 |
2 | 0.907263 | 0.994020 | 00:14 |
3 | 0.811368 | 0.898383 | 00:11 |
4 | 0.746635 | 0.878394 | 00:11 |
16. What is a good loss function to use for MovieLens? Why?
Mean squared error—because we our predictions are continuous values.
17. What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?
Currently the model predicts a single value—a continuous number for the rating.
114, 23]])) model(torch.tensor([[
tensor([2.3010], grad_fn=<SumBackward1>)
For Cross-Entropy loss, we need the model to predict 5 probabilities, one for each rating (1, 2, 3, 4, 5). My initial guess for doing this would be to add a linear layer that projects from 1 feature to 5 features.
= dls.one_batch()
x,y x.shape
torch.Size([64, 2])
class DotProduct2(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
self.linear = nn.Linear(1, 5)
def forward(self, x):
= self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return self.linear((users * movies).sum(dim=1).unsqueeze(0).permute(1,0))
= DotProduct2(n_users, n_movies, n_factors=50) model
model(x).shape
torch.Size([64, 5])
Running this with CollabDataLoaders
won’t work because that sets the ratings
as a continuous variable. Instead, I need to create a TabularDataLoaders
object where the dependent variable y
is a categorical ratings
column (with vocab
being 0
, 1
, 2
, 3
and 4
) and then using my updated linear layer model with CrossEntropyLossFlat
.
= DotProduct2(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=CrossEntropyLossFlat())
learn
# this won't work
5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|
IndexError: Target 5 is out of bounds.
'user', 'title', 'rating']].head() ratings[[
user | title | rating | |
---|---|---|---|
0 | 196 | Kolya (1996) | 3 |
1 | 63 | Kolya (1996) | 3 |
2 | 226 | Kolya (1996) | 5 |
3 | 154 | Kolya (1996) | 3 |
4 | 306 | Kolya (1996) | 5 |
= TabularDataLoaders.from_df(
dls 'user', 'title', 'rating']],
ratings[[=[Categorify],
procs=['user','title'],
cat_names=['rating'],
y_names=CategoryBlock) y_block
dls.vocab
[1, 2, 3, 4, 5]
= dls.one_batch() b
0].shape, b[1].shape, b[2].shape b[
(torch.Size([64, 2]), torch.Size([64, 0]), torch.Size([64, 1]))
Since the TabularDataLoaders
is going to pass categorical and continuous (empty) values to the model, I’ll have to update the model’s forward
pass accordingly.
class DotProduct3(Module):
def __init__(self, n_users, n_movies, n_factors):
self.user_factors = Embedding(n_users, n_factors)
self.movie_factors = Embedding(n_movies, n_factors)
self.linear = nn.Linear(1, 5)
def forward(self, x_cat, x_cont):
= x_cat
x = self.user_factors(x[:,0])
users = self.movie_factors(x[:,1])
movies return self.linear((users * movies).sum(dim=1).unsqueeze(0).permute(1,0))
= DotProduct3(n_users, n_movies, n_factors=50)
model = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn
5, 5e-3) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.365067 | 1.474111 | 0.344700 | 00:35 |
1 | 1.181627 | 1.604257 | 0.351150 | 00:15 |
2 | 1.017730 | 1.770345 | 0.358350 | 00:16 |
3 | 0.942282 | 1.899246 | 0.360700 | 00:21 |
4 | 0.902040 | 1.926463 | 0.359450 | 00:25 |
It’s not a great model, but it works! To recap, the three changes I made:
- Project the dot product to 5 values using a
nn.Linear
layer. - Use
TabularDataLoaders
instead ofCollabDataLoaders
. - Use
CrossEntropyLossFlat
instead ofMSELossFlat
.
18. What is the use of bias in the dot product model?
If you take away user preferences and movie characteristics, the bias represents how good or bad a movie is. It’s like a baseline rating sans preference. Low bias = bad movie even if it matches your preferences, high bias = good movie regardless of your preference.
19. What is another name for weight decay?
L2 regularization.
20. Write the equation for weight decay (without peeking!)
loss_with_wd = loss + wd * parameters.sum() ** 2
21. Write the equation for the gradient of weight decay. Why does it help reduce weights?
params.grad += 2 * wd * parameters.sum()
Increasing the loss by the weighted sum of parameters causes the parameters to reduce since the model is trying to minimize the loss and therefore minimize the sum of the parameters (weights).
22. Why does reducing weights lead to better generalization?
Because lower weights result in smoother surfaces where the model won’t overfit to data as it does if weights are higher and the surface is full of sharp peaks and valleys.
23. What does argsort
do in PyTorch?
Return the indexes of the current values in the tensor in sorted order.
= torch.tensor([1, 4, 3, 2])
t t.argsort()
tensor([0, 3, 2, 1])
24. Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?
= collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.894813 | 0.955763 | 00:21 |
1 | 0.693337 | 0.897217 | 00:15 |
2 | 0.512773 | 0.869337 | 00:13 |
3 | 0.448445 | 0.854036 | 00:11 |
4 | 0.438284 | 0.849818 | 00:10 |
= learn.model.i_bias.weight.squeeze()
movie_bias = movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs] [dls.classes[
["Schindler's List (1993)",
'Good Will Hunting (1997)',
'Titanic (1997)',
'Shawshank Redemption, The (1994)',
'Silence of the Lambs, The (1991)']
'title')['rating'].mean().sort_values(ascending=False) ratings.groupby(
title
They Made Me a Criminal (1939) 5.0
Marlene Dietrich: Shadow and Light (1996) 5.0
Saint of Fort Washington, The (1993) 5.0
Someone Else's America (1995) 5.0
Star Kid (1997) 5.0
...
Eye of Vichy, The (Oeil de Vichy, L') (1993) 1.0
King of New York (1990) 1.0
Touki Bouki (Journey of the Hyena) (1973) 1.0
Bloody Child, The (1996) 1.0
Crude Oasis, The (1995) 1.0
Name: rating, Length: 1664, dtype: float64
No, as shown above sorting movie biases doesn’t give the same result as sorting by average movie rating. The reason is that the full rating takes into account different movie and user characteristics, while the bias does not.
25. How do you print the names and details of the layers in a model?
By running a cell with the model list so:
learn.model
EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1665, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1665, 1)
)
26. What is the “bootstrapping problem” in collaborative filtering?
Dealing with new movies or new users that you don’t have information on.
27. How could you deal with the bootstrapping problem for new users? For new movies?
For new users, you can find the “average user” or use a TabularModel to predict the embeddings for this user. For new movies you can do the same. You can also collect metadata about the movies and users and use that in your model as additional information to train on.
28. How can feedback loops impact collaborative filtering systems?
A small number of users who are using and/or rating a lot of products will skew the recommendation system towards their latent factors, recommending products they like to its users and thus attracting more users that like those narrow band of products. The platform will thus become focused on this narrow band of products and users.
29. When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?
Because we eventually will concatenate them before passing them through (matrix multiplying by) the first Linear Layer.
30. Why is there an nn.Sequential
in the CollabNN
model?
Because that is the Neural Net (NN).
31. What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering system?
Tabular.
Further Research
Take a look at all the differences between the
Embedding
version ofDotProductBias
and thecreate_params
version and try to understand why each of those changes is required. If you’re not sure, try reverting each change to see what happens. Even the type of brackets used inforward
has changed?Find three other areas where collaborative filtering is being used, and identify the pros and cons of this approach in those areas.
Complete this notebook using the full MovieLens dataset, and compare your results to online benchmarks. See if you can improve your accuracy. Look on the book’s website and the fast.ai forums for ideas. Note that there are more columns in the full dataset–see if you can use those too (the next chapter might give you ideas).
Create a model for MovieLens that works with cross-entropy loss, and compare it to the model in this chapter.
Label Smoothing Cross Entropy Loss
I’ll work through the example in Aman Arora’s blog post in which he implements Label Smoothing Cross Entropy Loss.
# logits
= torch.tensor([
X 4.2, -2.4],
[1.6, -0.6],
[3.6, 1.2],
[-0.5, 0.5],
[-0.25, 1.7]
[
])
# labels
= torch.tensor([0,1,1,0,0])
y
X, y
(tensor([[ 4.2000, -2.4000],
[ 1.6000, -0.6000],
[ 3.6000, 1.2000],
[-0.5000, 0.5000],
[-0.2500, 1.7000]]),
tensor([0, 1, 1, 0, 0]))
=0.1, reduction='none')(X,y) # matches Excel calculations LabelSmoothingCrossEntropy(eps
tensor([0.3314, 2.1951, 2.3668, 1.2633, 1.9855])
= tensor([[0.95, 0.05], [0.05, 0.95], [0.05, 0.95], [0.95, 0.05], [0.95, 0.05]])
noisy_y noisy_y
tensor([[0.9500, 0.0500],
[0.0500, 0.9500],
[0.0500, 0.9500],
[0.9500, 0.0500],
[0.9500, 0.0500]])
= X.size()[-1]
c # number of classes c
2
# with mean reduction
= F.log_softmax(X, dim=-1)
log_preds = reduce_loss(-log_preds.sum(dim=-1), 'mean')
loss = F.nll_loss(log_preds, y, reduction='mean')
nll 1-0.1)*nll + 0.1*(loss/c) (
tensor(1.6284)
# without reduction
= F.log_softmax(X, dim=-1)
log_preds = reduce_loss(-log_preds.sum(dim=-1), 'none')
loss = F.nll_loss(log_preds, y, reduction='none')
nll 1-0.1)*nll + 0.1*(loss/c) # matches Excel calculations (
tensor([0.3314, 2.1951, 2.3668, 1.2633, 1.9855])
1-0.1)*nll + 0.1*(loss/c)).mean() # same as w/ mean reduction ((
tensor(1.6284)
# this is the negative sum of log_preds for both classes loss
tensor([6.6027, 2.4102, 2.5737, 1.6265, 2.2160])
-log_preds.sum(dim=-1) * 0.1 /2 # epsilon weighted negative sum of log_preds for both classes
tensor([0.3301, 0.1205, 0.1287, 0.0813, 0.1108])
* 0.9 # epsilon of negative log loss of target classes nll
tensor([1.2235e-03, 2.0746e+00, 2.2382e+00, 1.1819e+00, 1.8747e+00])
0.9 * nll + -log_preds.sum(dim=-1) * 0.1 /2 # matches Excel calculations
tensor([0.3314, 2.1951, 2.3668, 1.2633, 1.9855])
X
tensor([[ 4.2000, -2.4000],
[ 1.6000, -0.6000],
[ 3.6000, 1.2000],
[-0.5000, 0.5000],
[-0.2500, 1.7000]])
-torch.log(F.softmax(X, dim=-1)) * noisy_y).sum(dim=-1) # matches Excel calculations (
tensor([0.3314, 2.1951, 2.3668, 1.2633, 1.9855])
nll
tensor([1.3595e-03, 2.3051e+00, 2.4868e+00, 1.3133e+00, 2.0830e+00])
-torch.log(F.softmax(X, dim=-1)) # notice that nll is the loss values of the target class
tensor([[1.3595e-03, 6.6014e+00],
[1.0508e-01, 2.3051e+00],
[8.6836e-02, 2.4868e+00],
[1.3133e+00, 3.1326e-01],
[2.0830e+00, 1.3302e-01]])
-log_preds # same as -torch.log(F.softmax(X,dim=-1))
tensor([[1.3595e-03, 6.6014e+00],
[1.0508e-01, 2.3051e+00],
[8.6836e-02, 2.4868e+00],
[1.3133e+00, 3.1326e-01],
[2.0830e+00, 1.3302e-01]])
-log_preds.sum(dim=-1) # same as `loss`
tensor([6.6027, 2.4102, 2.5737, 1.6265, 2.2160])
So I think what’s going on here is that nll
is just the chosen label’s probabilities whereas loss
is the sum of both label’s probs. So multiplying nll
by 0.1 and then adding 0.05
* loss
results in the ground truth being multiplies by 0.95
(0.9
+ 0.05
) and not-truth being multiplied by 0.05
. I think.
Lesson 8: Convolutions
THE FINAL LESSON OF PART 1!!!!
Video Notes
Before you dig into this, make sure you understand the Linear model and neural net from scratch notebook. I’ll start by walking through that code first:
from pathlib import Path
= Path("~/.kaggle/kaggle.json").expanduser()
cred_path if not cred_path.exists():
=True)
cred_path.parent.mkdir(exist_ok
cred_path.write_text(creds)0o600) cred_path.chmod(
= Path('titanic')
path if not path.exists():
import zipfile,kaggle
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
Downloading titanic.zip to /content
100%|██████████| 34.1k/34.1k [00:00<00:00, 15.2MB/s]
= Path('/content/titanic') path
import torch, numpy as np, pandas as pd
=140)
np.set_printoptions(linewidth=140, sci_mode=False, edgeitems=7)
torch.set_printoptions(linewidth'display.width', 140) pd.set_option(
= pd.read_csv(path/'train.csv')
df df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
sum() df.isna().
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
# replacce missing values with mode
= df.mode().iloc[0]
modes modes
PassengerId 1
Survived 0.0
Pclass 3.0
Name Abbing, Mr. Anthony
Sex male
Age 24.0
SibSp 0.0
Parch 0.0
Ticket 1601
Fare 8.05
Cabin B96 B98
Embarked S
Name: 0, dtype: object
=True) df.fillna(modes, inplace
sum().sum() df.isna().
0
import numpy as np
=(np.number)) df.describe(include
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 28.566970 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 13.199572 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 24.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
'Fare'].hist(); df[
# normalize the values with log
'LogFare'] = np.log(df['Fare']+1) df[
'LogFare'].hist(); df[
# summary of non-numeric columns
=[object]) df.describe(include
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 891 | 891 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Braund, Mr. Owen Harris | male | 347082 | B96 B98 | S |
freq | 1 | 577 | 7 | 691 | 646 |
# create dummy variables for categorical columns with low cardinality
= pd.get_dummies(df, columns=["Sex","Pclass","Embarked"], dtype=float)
df df.columns
Index(['PassengerId', 'Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'LogFare', 'Sex_female', 'Sex_male',
'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
dtype='object')
= ['Sex_male', 'Sex_female', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
added_cols df[added_cols].head()
Sex_male | Sex_female | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
# create independent and dependent variables as tensors
from torch import tensor
= tensor(df.Survived)
t_dep 10] t_dep[:
tensor([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])
df[added_cols].dtypes
Sex_male float64
Sex_female float64
Pclass_1 float64
Pclass_2 float64
Pclass_3 float64
Embarked_C float64
Embarked_Q float64
Embarked_S float64
dtype: object
= ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols
indep_cols
= tensor(df[indep_cols].values, dtype=torch.float)
t_indep t_indep.shape
torch.Size([891, 12])
442)
torch.manual_seed(
= t_indep.shape[1]
n_coeff n_coeff
12
= torch.rand(n_coeff)-0.5
coeffs coeffs.shape
torch.Size([12])
10] coeffs[:
tensor([-0.4629, 0.1386, 0.2409, -0.2262, -0.2632, -0.3147, 0.4876, 0.3136, 0.2799, -0.4392])
Our predictions will be calculated by multiplying each row by the coefficients, and adding them up.
# each row has 12 variables, one coefficient per variable
# the coefficients are broadcasted to each row
*coeffs).shape (t_indep
torch.Size([891, 12])
*coeffs)[:2] (t_indep
tensor([[-10.1838, 0.1386, 0.0000, -0.4772, -0.2632, -0.0000, 0.0000, 0.0000, 0.2799, -0.0000, 0.0000, 0.3625],
[-17.5902, 0.1386, 0.0000, -0.9681, -0.0000, -0.3147, 0.4876, 0.0000, 0.0000, -0.4392, 0.0000, 0.0000]])
# normal so age doesn't dominate the values when summing the predictions
max(), t_indep.max(dim=0) t_indep.
(tensor(80.),
torch.return_types.max(
values=tensor([80.0000, 8.0000, 6.0000, 6.2409, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]),
indices=tensor([630, 159, 678, 258, 0, 1, 1, 9, 0, 1, 5, 0])))
= t_indep.max(dim=0)
vals, indices # division by vals broadcasted to each row
= t_indep / vals
t_indep 2] t_indep[:
tensor([[0.2750, 0.1250, 0.0000, 0.3381, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 1.0000],
[0.4750, 0.1250, 0.0000, 0.6859, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000]])
*coeffs)[:2] (t_indep
tensor([[-0.1273, 0.0173, 0.0000, -0.0765, -0.2632, -0.0000, 0.0000, 0.0000, 0.2799, -0.0000, 0.0000, 0.3625],
[-0.2199, 0.0173, 0.0000, -0.1551, -0.0000, -0.3147, 0.4876, 0.0000, 0.0000, -0.4392, 0.0000, 0.0000]])
*coeffs).sum() (t_indep
tensor(5.1269)
= (t_indep*coeffs).sum(axis=1)
preds preds.shape
torch.Size([891])
# look at the predictions
; pd.Series(preds).hist()
# loss function
abs(preds-t_dep).mean() torch.
tensor(0.5382)
-t_dep)[:5] (preds
tensor([ 0.1927, -1.6239, -0.9021, -0.7944, 0.0968])
# create helper functions
def calc_preds(coeffs, indeps): return (indeps*coeffs).sum(axis=1)
def calc_loss(coeffs, indeps, deps): return torch.abs(calc_preds(coeffs, indeps)-deps).mean()
== (coeffs*t_indep).sum(axis=1)).sum() (calc_preds(coeffs, t_indep)
tensor(891)
== torch.abs(preds-t_dep).mean() calc_loss(coeffs, t_indep, t_dep)
tensor(True)
# prepare for gradient descent
coeffs.requires_grad_()
tensor([-0.4629, 0.1386, 0.2409, -0.2262, -0.2632, -0.3147, 0.4876, 0.3136, 0.2799, -0.4392, 0.2103, 0.3625], requires_grad=True)
= calc_loss(coeffs, t_indep, t_dep)
loss loss
tensor(0.5382, grad_fn=<MeanBackward0>)
# calculate gradients
loss.backward()
coeffs.grad
tensor([-0.0106, 0.0129, -0.0041, -0.0484, 0.2099, -0.2132, -0.1212, -0.0247, 0.1425, -0.1886, -0.0191, 0.2043])
# gradients added when calling backward
= calc_loss(coeffs, t_indep, t_dep)
loss
loss.backward() coeffs.grad
tensor([-0.0212, 0.0258, -0.0082, -0.0969, 0.4198, -0.4265, -0.2424, -0.0494, 0.2851, -0.3771, -0.0382, 0.4085])
loss
tensor(0.5382, grad_fn=<MeanBackward0>)
# do a gradient descent step
= calc_loss(coeffs, t_indep, t_dep)
loss
loss.backward()with torch.no_grad():
* 0.1) # update the parameters (coefficients)
coeffs.sub_(coeffs.grad # set gradients of coefficients to 0
coeffs.grad.zero_() print(calc_loss(coeffs, t_indep, t_dep)) # yay the loss decreased
tensor(0.4945)
from fastai.data.transforms import RandomSplitter
=RandomSplitter(seed=42)(df) trn_split,val_split
len(trn_split), len(val_split), trn_split[:5], val_split[:5]
(713, 178, (#5) [788,525,821,253,374], (#5) [303,778,531,385,134])
= t_indep[trn_split],t_indep[val_split]
trn_indep,val_indep = t_dep[trn_split],t_dep[val_split]
trn_dep,val_dep len(trn_indep),len(val_indep)
(713, 178)
def update_coeffs(coeffs, lr):
* lr)
coeffs.sub_(coeffs.grad coeffs.grad.zero_()
def one_epoch(coeffs, lr):
= calc_loss(coeffs, trn_indep, trn_dep)
loss
loss.backward()with torch.no_grad(): update_coeffs(coeffs, lr)
print(f"{loss:.3f}", end="; ")
def init_coeffs(): return (torch.rand(n_coeff)-0.5).requires_grad_()
def train_model(epochs=30, lr=0.01):
442)
torch.manual_seed(= init_coeffs()
coeffs for i in range(epochs): one_epoch(coeffs, lr=lr)
return coeffs
= train_model(18, lr=0.2) coeffs
0.536; 0.502; 0.477; 0.454; 0.431; 0.409; 0.388; 0.367; 0.349; 0.336; 0.330; 0.326; 0.329; 0.304; 0.314; 0.296; 0.300; 0.289;
def show_coeffs(): return dict(zip(indep_cols, coeffs.requires_grad_(False)))
show_coeffs()
{'Age': tensor(-0.2694),
'SibSp': tensor(0.0901),
'Parch': tensor(0.2359),
'LogFare': tensor(0.0280),
'Sex_male': tensor(-0.3990),
'Sex_female': tensor(0.2345),
'Pclass_1': tensor(0.7232),
'Pclass_2': tensor(0.4112),
'Pclass_3': tensor(0.3601),
'Embarked_C': tensor(0.0955),
'Embarked_Q': tensor(0.2395),
'Embarked_S': tensor(0.2122)}
= calc_preds(coeffs, val_indep) preds
= val_dep.bool()==(preds>0.5)
results 16] results[:
tensor([ True, True, True, True, True, True, True, True, True, True, False, False, False, True, True, False])
float().mean() results.
tensor(0.7865)
def acc(coeffs): return (val_dep.bool()==(calc_preds(coeffs, val_indep)>0.5)).float().mean()
acc(coeffs)
tensor(0.7865)
28] preds[:
tensor([ 0.8160, 0.1295, -0.0148, 0.1831, 0.1520, 0.1350, 0.7279, 0.7754, 0.3222, 0.6740, 0.0753, 0.0389, 0.2216, 0.7631,
0.0678, 0.3997, 0.3324, 0.8278, 0.1078, 0.7126, 0.1023, 0.3627, 0.9937, 0.8050, 0.1153, 0.1455, 0.8652, 0.3425])
import sympy
"1/(1+exp(-x))", xlim=(-5,5)); sympy.plot(
def calc_preds(coeffs, indeps): return torch.sigmoid((indeps*coeffs).sum(axis=1))
= train_model(lr=100) coeffs
0.510; 0.327; 0.294; 0.207; 0.201; 0.199; 0.198; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194;
acc(coeffs)
tensor(0.8258)
show_coeffs()
{'Age': tensor(-1.5061),
'SibSp': tensor(-1.1575),
'Parch': tensor(-0.4267),
'LogFare': tensor(0.2543),
'Sex_male': tensor(-10.3320),
'Sex_female': tensor(8.4185),
'Pclass_1': tensor(3.8389),
'Pclass_2': tensor(2.1398),
'Pclass_3': tensor(-6.2331),
'Embarked_C': tensor(1.4771),
'Embarked_Q': tensor(2.1168),
'Embarked_S': tensor(-4.7958)}
= pd.read_csv(path/'test.csv')
tst_df tst_df
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S |
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
418 rows × 11 columns
'Fare'] = tst_df.Fare.fillna(0) tst_df[
=True)
tst_df.fillna(modes, inplace'LogFare'] = np.log(tst_df['Fare']+1)
tst_df[= pd.get_dummies(tst_df, columns=["Sex","Pclass","Embarked"], dtype=float)
tst_df
= tensor(tst_df[indep_cols].values, dtype=torch.float)
tst_indep = tst_indep / vals tst_indep
# used trained coefficients to predict survival on test set
'Survived'] = (calc_preds(tst_indep, coeffs)>0.5).int() tst_df[
= tst_df[['PassengerId','Survived']]
sub_df sub_df
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 0 |
... | ... | ... |
413 | 1305 | 0 |
414 | 1306 | 1 |
415 | 1307 | 0 |
416 | 1308 | 0 |
417 | 1309 | 0 |
418 rows × 2 columns
Multiplying elements together and then adding across rows is identical to doing a matrix-vector product!
*coeffs).sum(axis=1)).shape ((val_indep
torch.Size([178])
@coeffs).shape (val_indep
torch.Size([178])
def calc_preds(coeffs, indeps): return torch.sigmoid(indeps@coeffs)
# need coeffs to be matrix for matrix products later on
def init_coeffs(): return (torch.rand(n_coeff, 1)*0.1).requires_grad_()
# add new dimension
= trn_dep[:,None]
trn_dep = val_dep[:,None] val_dep
trn_dep.shape
torch.Size([713, 1])
= train_model(lr=100) coeffs
0.512; 0.323; 0.290; 0.205; 0.200; 0.198; 0.197; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194;
acc(coeffs)
tensor(0.8258)
First, we’ll need to create coefficients for each of our layers. Our first set of coefficients will take our
n_coeff
inputs, and createn_hidden
outputs. We can choose whatevern_hidden
we like – a higher number gives our network more flexibility, but makes it slower and harder to train. So we need a matrix of sizen_coeff
byn_hidden
. We’ll divide these coefficients byn_hidden
so that when we sum them up in the next layer we’ll end up with similar magnitude numbers to what we started with.
1)[0] torch.rand(
tensor(0.6722)
def init_coeffs(n_hidden=20):
= (torch.rand(n_coeff, n_hidden)-0.5)/n_hidden
layer1 = torch.rand(n_hidden, 1)-0.3
layer2 = torch.rand(1)[0]
const return layer1.requires_grad_(),layer2.requires_grad_(),const.requires_grad_()
Now we have our coefficients, we can create our neural net. The key steps are the two matrix products,
indeps@l1
andres@l2
(whereres
is the output of the first layer). The first layer output is passed toF.relu
(that’s our non-linearity), and the second is passed totorch.sigmoid
as before.
import torch.nn.functional as F
def calc_preds(coeffs, indeps):
= coeffs # get the two linear layers and bias term
l1,l2,const = F.relu(indeps@l1) # matrix product of independent variable values and first linear layer, passed through non-linearity
res = res@l2 + const # matrix product of that result and second layer, plus constant
res return torch.sigmoid(res) # that result passed through sigmoid
Finally, now that we have more than one set of coefficients, we need to add a loop to update each one:
def update_coeffs(coeffs, lr):
for layer in coeffs:
* lr)
layer.sub_(layer.grad layer.grad.zero_()
= train_model(lr=1.4) coeffs
0.543; 0.532; 0.520; 0.505; 0.487; 0.466; 0.439; 0.407; 0.373; 0.343; 0.319; 0.301; 0.286; 0.274; 0.264; 0.256; 0.250; 0.245; 0.240; 0.237; 0.234; 0.231; 0.229; 0.227; 0.226; 0.224; 0.223; 0.222; 0.221; 0.220;
= train_model(lr=20) coeffs
0.543; 0.400; 0.260; 0.390; 0.221; 0.211; 0.197; 0.195; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192;
acc(coeffs)
tensor(0.8258)
1) torch.rand(
tensor([0.1287])
# deep learning
def init_coeffs():
= [10, 10] # <-- set this to the size of each hidden layer you want
hiddens = [n_coeff] + hiddens + [1] # inputs, hidden layers, output
sizes = len(sizes)
n = [(torch.rand(sizes[i], sizes[i+1])-0.3)/sizes[i+1]*4 for i in range(n-1)]
layers = [(torch.rand(1)[0]-0.5)*0.1 for i in range(n-1)]
consts for l in layers+consts: l.requires_grad_()
return layers,consts
def calc_preds(coeffs, indeps):
= coeffs
layers,consts = len(layers)
n = indeps
res for i,l in enumerate(layers):
= res@l + consts[i]
res if i!=n-1: res = F.relu(res)
return torch.sigmoid(res)
def update_coeffs(coeffs, lr):
= coeffs
layers,consts for layer in layers+consts:
* lr)
layer.sub_(layer.grad layer.grad.zero_()
= train_model(lr=4) coeffs
0.521; 0.483; 0.427; 0.379; 0.379; 0.379; 0.379; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.377; 0.376; 0.371; 0.333; 0.239; 0.224; 0.208; 0.204; 0.203; 0.203; 0.207; 0.197; 0.196; 0.195;
acc(coeffs)
tensor(0.8258)
Continuing with video notes:
We initialized coeffs
(coeficients) and a bias term const
and updated them by going through each layers
and subtracting out the gradient .grad
multiplied by the learning rate lr
.
In PyTorch, we don’t have to keep track of what our coefficients (or parameters, or weights) are, PyTorch does that for us. It does that by looking inside our Module
and trying to find anything that looks like a tensor of neural net Parameter
s and it keeps track of them.
Creating out own model in PyTorch:
from fastai.collab import *
from fastai.tabular.all import *
class T(Module):
def __init__(self): self.a = torch.ones(3)
L(T().parameters())
(#0) []
PyTorch looks inside our Module
and keeps track of anything that looks like a tensor of neural network parameters. We can find out what parameters in general PyTorch knows about in our model by instantiating our model and then asking for the parameters
.
The way you tell PyTorch what your parameter are is by putting them inside a special object called nn.Parameter
which hardly does anything—they key thing it does is that when PyTorch checks to see which parameters should it update when it optimizes, it just looks for anything that’s been wrapped in this class.
class T(Module):
def __init__(self): self.a = nn.Parameter(torch.ones(3))
# by default assumes that we're going to want to require gradient L(T().parameters())
(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]
class T(Module):
def __init__(self): self.a = nn.Linear(1, 3, bias=False) # automatically considered a parameter by PyTorch
L(T().parameters())
(#1) [Parameter containing:
tensor([[-0.5822],
[ 0.4630],
[-0.4310]], requires_grad=True)]
= T()
t type(t.a.weight)
torch.nn.parameter.Parameter
We want to create something that works like an Embedding
which creates a matrix which will be trained as we train the model, something we can index into (during the forward
pass).
user_bias
will be a vector of parameters, user_factors
will be matrix.
When you put a tensor inside nn.Parameter
it has all the features a tensor has (for example, we can index into it).
The create_params
function is all that’s required to recreate PyTorch’s Embedding
layer from scratch.
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
= self.user_factors[x[:, 0]]
users = self.movie_factors[x[:, 1]]
movies = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]]+self.movie_bias[x[:,1]]
res return sigmoid_range(res, *self.y_range)
Let’s see if it trains:
= untar_data(URLs.ML_100k)
path = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
ratings = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
movies = ratings.merge(movies)
ratings = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls = len(dls.classes['user'])
n_users = len(dls.classes['title']) n_movies
= DotProductBias(n_users, n_movies, 50)
model = Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.882364 | 0.953949 | 00:11 |
1 | 0.654987 | 0.886329 | 00:10 |
2 | 0.521378 | 0.869882 | 00:10 |
3 | 0.444553 | 0.858153 | 00:09 |
4 | 0.429629 | 0.853653 | 00:10 |
# a parameter containing a bunch of numbers that have been trained model.movie_bias
Parameter containing:
tensor([-0.0033, -0.1991, 0.0095, ..., 0.0080, 0.1408, 0.0015],
requires_grad=True)
# 1665 movies model.movie_bias.shape
torch.Size([1665])
In PyTorch, a method that ends in an underscore changes in place the tensor it’s being applied to.
4]) torch.zeros([
tensor([0., 0., 0., 0.])
4]).normal_(0, 0.01) torch.zeros([
tensor([-0.0008, -0.0040, 0.0108, -0.0071])
We trained this model—but what did it do? How is it going about predicting who’s going to like what movie?
We can find which movies have the highest and lowest movie bias. We can grab the names of those movies from our DataLoaders
for each of those 5 lowest or highest numbers.
learn.model.movie_bias.shape, learn.model.movie_bias.squeeze().shape
(torch.Size([1665]), torch.Size([1665]))
The movies with the lowest movie_bias
values are some pretty crappy movies. Why is that? That’s because when it does that matrix product it’s trying to figure out who’s going to like what movie based on previous movies people have enjoyed or not, and then it adds movie bias, which can be positive or negative, that’s a different number for each movie. In order to do a good job at predicting whether you’re going to like a movie or not, it has to know which movies are crap. So that crap movies are going to end up with a very low movie bias parameter. We can find out not only which movies do people really not like, but which movies do people like less than one would expect given the kind of movie that it is?
So “Lawnmower Man 2”, not only is it a crappy movie but based on the kind of movie it is (kind of like a high-tech pop kind of sci-fi movie) people who like those kinds of movies still don’t like “Lawnmower Man 2”. In this way we can use a model not just to predict things but to understand things about the data.
= learn.model.movie_bias.squeeze()
movie_bias = movie_bias.argsort()[:5]
idxs 'title'][i] for i in idxs], movie_bias.sort()[0][:5] [dls.classes[
(['Lawnmower Man 2: Beyond Cyberspace (1996)',
'Children of the Corn: The Gathering (1996)',
'Grease 2 (1982)',
'Beverly Hills Ninja (1997)',
'Island of Dr. Moreau, The (1996)'],
tensor([-0.3312, -0.3286, -0.2666, -0.2655, -0.2578], grad_fn=<SliceBackward0>))
If we sort be descending, it’ll give us the exact opposite. Here are movies that people enjoy, even when they don’t enjoy that kind of movie.
= movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs], movie_bias.sort(descending=True)[0][:5] [dls.classes[
(['Shawshank Redemption, The (1994)',
'Star Wars (1977)',
'L.A. Confidential (1997)',
"Schindler's List (1993)",
'Titanic (1997)'],
tensor([0.5890, 0.5813, 0.5789, 0.5435, 0.5397], grad_fn=<SliceBackward0>))
We can do the same with users and find out which users just loves movies, even the crappy ones and vice versa.
# users who don't like any movies
= learn.model.user_bias.squeeze()
user_bias = user_bias.argsort()[:5]
idxs 'user'][i] for i in idxs], user_bias.sort()[0][:5] [dls.classes[
([181, 405, 724, 774, 445],
tensor([-0.7536, -0.5753, -0.4381, -0.4175, -0.3963], grad_fn=<SliceBackward0>))
# users who like all movies
= learn.model.user_bias.squeeze()
user_bias = user_bias.argsort(descending=True)[:5]
idxs 'user'][i] for i in idxs], user_bias.sort(descending=True)[0][:5] [dls.classes[
([907, 295, 507, 472, 849],
tensor([0.7339, 0.6750, 0.6721, 0.6689, 0.6160], grad_fn=<SliceBackward0>))
What about the latent factors? We can do something called Principal Component Analysis which compresses those 50 columns of latent factors down to (however many you specify).
= ratings.groupby('title')['rating'].count()
g g
title
'Til There Was You (1997) 9
1-900 (1994) 5
101 Dalmatians (1996) 109
12 Angry Men (1957) 125
187 (1997) 41
...
Young Guns II (1990) 44
Young Poisoner's Handbook, The (1995) 41
Zeus and Roxanne (1997) 6
unknown 9
Á köldum klaka (Cold Fever) (1994) 1
Name: rating, Length: 1664, dtype: int64
= g.sort_values(ascending=False).index.values[:1000]
top_movies 5] top_movies[:
array(['Star Wars (1977)', 'Contact (1997)', 'Fargo (1996)',
'Return of the Jedi (1983)', 'Liar Liar (1997)'], dtype=object)
= tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
top_idxs 5] top_idxs[:
tensor([1399, 334, 499, 1235, 861])
= learn.model.movie_factors[top_idxs].cpu().detach()
movie_w movie_w.shape
torch.Size([1000, 50])
= movie_w.pca(3)
movie_pca movie_pca.shape
torch.Size([1000, 3])
= movie_pca.t()
fac0, fac1, fac2 fac0.shape, fac1.shape, fac2.shape
(torch.Size([1000]), torch.Size([1000]), torch.Size([1000]))
= list(range(50))
idxs = fac0[idxs]
X = fac2[idxs]
Y
=(12,12))
plt.figure(figsize
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
=np.random.rand(3)*0.7, fontsize=11)
plt.text(x,y,i, color# compressed view of the latent factors plt.show()
fastai provides a collab_learner
:
= collab_learner(dls, n_factors=50, y_range=(0, 5.5)) learn
5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.907180 | 0.950933 | 00:17 |
1 | 0.657494 | 0.902355 | 00:11 |
2 | 0.501019 | 0.877131 | 00:11 |
3 | 0.444013 | 0.865720 | 00:12 |
4 | 0.409778 | 0.861589 | 00:11 |
learn.model
EmbeddingDotBias(
(u_weight): Embedding(944, 50)
(i_weight): Embedding(1665, 50)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1665, 1)
)
= learn.model.i_bias.weight.squeeze()
movie_bias = movie_bias.argsort(descending=True)[:5]
idxs 'title'][i] for i in idxs], movie_bias.sort(descending=True)[0][:5] [dls.classes[
(['Shawshank Redemption, The (1994)',
'L.A. Confidential (1997)',
'Good Will Hunting (1997)',
"Schindler's List (1993)",
'Star Wars (1977)'],
tensor([0.6044, 0.5852, 0.5647, 0.5546, 0.5231], grad_fn=<SliceBackward0>))
The fastai model for collaborative filtering (without Neural Network) is pretty much identical to the DotProductBias
model we created from scratch. Here’s its forward
method:
def forward(self, x):
= x[:,0],x[:,1]
users,items = self.u_weight(users)* self.i_weight(items)
dot = dot.sum(1) + self.u_bias(users).squeeze() + self.i_bias(items).squeeze()
res if self.y_range is None: return res
return torch.sigmoid(res) * (self.y_range[1]-self.y_range[0]) + self.y_range[0]
= learn.model.i_weight.weight
movie_factors = dls.classes['title'].o2i['Silence of the Lambs, The (1991)'] idx
= nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None]) # calculate how far apart each embedding is from the Silence of the Lambs
distances # CosineSimilarity is basically the angle between the vectors
distances.shape
torch.Size([1665])
= distances.argsort(descending=True)[1] # the closest movie to Silence of the Lambs
idx 'title'][idx] dls.classes[
'Casablanca (1942)'
We can use Deep Learning instead of dot products.
class CollabNN(Module):
def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz)
self.layers = nn.Sequential( # layers of a neural network in order
1]+item_sz[1], n_act),
nn.Linear(user_sz[
nn.ReLU(),1)
nn.Linear(n_act,
)self.y_range = y_range
def forward(self, x):
= self.user_factors(x[:,0]), self.item_factors(x[:,1])
embs = self.layers(torch.cat(embs, dim=1)) # concatenate the user and item embeddings with torch.cat
x return sigmoid_range(x, *self.y_range)
Ask fastai how big our NN embeddings should be (based on a formula that matches Jeremy’s intuition):
= get_emb_sz(dls)
embs embs
[(944, 74), (1665, 102)]
get_emb_sz??
= CollabNN(*embs) model
= Learner(dls, model, loss_func=MSELossFlat())
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.918632 | 0.965365 | 00:15 |
1 | 0.868667 | 0.928044 | 00:14 |
2 | 0.825990 | 0.911105 | 00:16 |
3 | 0.777693 | 0.879813 | 00:26 |
4 | 0.767727 | 0.872732 | 00:19 |
= collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100, 50])
learn 5, 5e-3, wd=0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.953472 | 1.010605 | 00:17 |
1 | 0.891541 | 0.936580 | 00:16 |
2 | 0.816217 | 0.904296 | 00:17 |
3 | 0.758014 | 0.884302 | 00:15 |
4 | 0.749216 | 0.878866 | 00:15 |
The dot product version is doing better because it’s taking advantage of our understanding of the problem domain. In practice, companies create a combined model with a dot product component and also has a neural net component. The neural net component is particularly helpful if you have metadata. You can concatenate that in with your embeddings.
In collaborative filtering in general there’s an issue where a small number of users and movies overwhelm everybody else. A classic one is anime (small number of viewers who watch it a lot). You have to be careful about these subtlety issues, involves taking various ratios or normalizing things.
Embeddings are not just for collaborative filtering. You’ve probably heard about them in the context of Natural Language Processing (NLP). How do we go about using text as inputs to models? You can turn words into integers—taking the unique words from a text and assigning them an id. We then create an embedding matrix for those words. To give this text to a neural net, we list out our words, and for each word we look up the word id (MATCH
in Excel) and then find that word’s embeddings using OFFSET
. You can then train the embeddings and then interpret them as we’ve done with movie bias factors and the latent factors.
Our different models, the inputs to them are based on a relatively small number of basic principles. These principles are generally thinks like “look up something in an array.” And then we know inside the model we’re multiplying things, adding them up and replacing the negatives with 0s.
In tabular_learner
it creates an Embedding
for each of the categorical variables (from number of inputs to number of factors based on get_emb_sz
). In its forward
pass, if there’s embeddings it’ll go through and pass the inputs into them and concatenate the results, and run it through the neural net layers.
You can create your neural net, get your trained embeddings and put those embeddings into a random forest or gradient boosted tree and you’re mean average percent error will dramatically improve.
Convolutions
We’ve learned about what goes into the model (categories, embeddings, or continuous numbers). We’ve learned about what comes out the other side (a bunch of activations—a tensor of numbers) which we can use things like softmax to constrain them to add up to 1. We’ve looked at what can go in the middle which is the matrix multiplication sandwiched together with rectified linear units. There are other things that can go in the middle, which is convolutions (another kind of matrix multiplication).
Convolutional Neural Networks are similar to what we’ve seen so far (inputs, things that are a form of matrix multiplication, sandwiched with activation functions). But there’s a particular thing that makes them very useful for computer vision.
Back in the mid-90s, Yann LeCun showed really practically useful performance on this dataset which resulted in convnets being used in the american banking system for reading checks.
In the Excel file, Jeremy has recreated a 28x28 cell “7” from the MNIST dataset and is multiplying each 3x3 cells with the following filter:
1 | 1 | 1 |
0 | 0 | 0 |
-1 | -1 | -1 |
and taking the max of that dot product and 0. It’s like ReLU but it’s not doing a matrix product, it’s doing a dot product just on those 9 cells (3x3) and just those 9 weights (the 3x3 “filter”). When you move one to the right, it’s using the next 9 cells, and so on.
A convolution is when you slide a little 3x3 matrix across a bigger matrix and at each location you do a dot product of that 3x3 matrix with the 3x3 matrix of coefficients. Why does that create something that finds something like top edges? It’s because of the way we’ve constructed the coefficient matrix.
All of the rows just above are going to get a 1
. All of the ones just below are going to get a -1
. And all of the ones in the middle are going to get a 0
. When the image’s 3x3 is:
1 | 1 | 1 |
1 | 1 | 1 |
1 | 1 | 1 |
Multiplying it by the filter:
1 | 1 | 1 |
0 | 0 | 0 |
-1 | -1 | -1 |
Gives us 0
. But when the image’s 3x3 is something like:
1 | 1 | 1 |
0.8 | 0.8 | 0.8 |
0 | 0 | 0 |
Multiplying it by the filter gives us 3
. We’ll only get such 3
s when the image’s 3x3 has the top row as dark as possible (1
) and the bottom row blank (0
). That’s only going to happen at a horizontal edge.
A horizontal edge detector is the filter of coefficients:
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
The dot product will only be 3 where the 3x3’s leftmost column is 1’s and the rightmost is 0’s.
You can think of a convolution as being a sliding window, of little mini dot products of these little 3x3 matrices. They don’t have to be 3x3 we could have just as easily done 5x5 then we’d have a 5x5 matrix of coefficients. Whatever size you like. The size is called the kernel size. A 3x3 kernel for this convolution.
We repeat these steps again and again. In the second layer we now have two channels. In the first layer we just had one (the grayscale original image). The two channels are the horizontal edges channel and the vertical edges channel. Our filter is now 3x3x2 or two 3x3 kernels or one 3x3x2 kernel. It combines the horizontal and the vertical edge detectors.
We’ll eventually end up with a single set of 10 activations (one for each digit 0-9) or 1 activation (7 or not-7). We’d back propogate through these calculations using SGD. And that is going to end up optimizing the coefficients in the filters. In real life you start with random numbers and then optimize them with SGD (instead of the manual edge detectors Jeremy instantiated).
A few years what we used was max pooling—which is like a convolution except you don’t take a dot product you take the max of a sliding window (in our case, a 2x2 max pooling). With a 2x2 max pooling, we end up losing half of our activations on each dimension, so we’re going to end up with only 1/4th the activations that we started out with. And that’s a good thing because if we keep doing conv layers and max pools, will have fewer and fewer activations, then we take a dot product of those with a bunch of coefficients (dense layer) for each channel and then add them all up for our final big dot product. MNIST would have 10 such final activations, with a softmax layer after that.
Nowadays we normally don’t have max pool layers. But instead when we do our sliding window, we skip one every time we move to the next 3x3 (after doing column I we skip column J and go straight to K). That’s called a “stride 2” convolution. So every time we do a convolution we reduce our effective feature size by 2 on each axis (reducing by 4x in total), instead of doing max pooling.
The other thing is nowadays we don’t have a single dense layer but instead we keep doing stride-2 convolutions until we’ve got about a 7x7 grid, and then we do a single pooling at the end (average instead of max). So we average the activations of each one of the 7x7 features. This is important know to because something like an imagenet style image detector is going to end up with a 7x7 grid for “is this a bear?” and for each of the 7x7 squares it’s seeing if there is a bear in that part of the photo, and it takes the average of those 49 predictions to decide whether there’s a bear in the photo. That works very well if it’s basically a photo of a bear. If the bear is big and takes up most of the frame, then most of the 7x7 bits are bits of the bear. On the other hand, if there’s a teeny tiny bear in the corner, then potentially only one of those 49 squares has a bear in it. Even worse, if it’s a picture with lots of different things only one of which is a bear—it could end up being not a good bear detector. The details of how we construct our model turn out to be important. If you’re trying to find one part of the photo that has a bear in it, you might want to try max pooling (“i think this is a picture of a bear if any one of the 49 bits has a bear in it”). The max/average pool is happening right at the very end. fastai does max pool and average pool and concatenate them together (concat pooling) and that has since been reinvented in at least one paper.
Convolution is the same thing as a matrix multiplication. Here is convolution as a sliding window—
the kernel
\(\alpha\) | \(\beta\) |
\(\gamma\) | \(\delta\) |
and a 3x3 image:
A | B | C |
D | E | F |
G | H | J |
and the resulting convolution
P | Q |
R | S |
I’ll show the first sliding window multiplication (in italics/bold):
A | B | C |
D | E | F |
G | H | J |
Which is:
\(\alpha A+\beta B+ \gamma D + \delta E + b = P\)
and the rest of the sliding windows:
A | B | C |
D | E | F |
G | H | J |
\(\alpha B+\beta C+ \gamma E + \delta F + b = Q\)
A | B | C |
D | E | F |
G | H | J |
\(\alpha D+\beta E+ \gamma G + \delta H + b = R\)
A | B | C |
D | E | F |
G | H | J |
\(\alpha E+\beta F+ \gamma H + \delta J + b = S\)
We can also write it as a matrix multiplication. This matrix of kernel or filter values:
\(\alpha\) | \(\beta\) | 0 | \(\gamma\) | \(\delta\) | 0 | 0 | 0 | 0 |
0 | \(\alpha\) | \(\beta\) | 0 | \(\gamma\) | \(\delta\) | 0 | 0 | 0 |
0 | 0 | \(\alpha\) | \(\beta\) | 0 | \(\gamma\) | \(\delta\) | 0 | 0 |
0 | 0 | 0 | \(\alpha\) | \(\beta\) | 0 | \(\gamma\) | \(\delta\) | 0 |
0 | 0 | 0 | 0 | \(\alpha\) | \(\beta\) | 0 | \(\gamma\) | \(\delta\) |
Multiplied by a column of pixels:
A | |
B | |
C | |
D | |
E | |
F | |
G | |
H | |
J |
plus a column of biases:
b | |
b | |
b | |
b |
yields the convolution:
P | |
Q | |
R | |
S |
In practice it’s faster to do it as a sliding window but this matrix multiplication is a good way to think about it as a special type of matrix multiplication.
Dropout
Same convolutions as before, followed by a bunch of random numbers. We define a dropout factors (from 0.0 to 0.9) which we use to create a dropout mask (if the random number in a given cell/pixel is greater than the dropout factor, use that random number otherwise set it to 0). We start with the image and then corrupt it (random bits of it have been deleted). Higher dropout factor will delete more of the picture. That “corrupted” image is the input to the next layer (which is max pool in our example).
Why would we delete some data at random from our processed image/activations after convolutions? The reason is that a human is able to look at the corrupted image and still recognize it’s a 7. A computer should be able to as well. If we randomly delete different bits of the activations each time, then the computer is forced to learn the underlying real representation rather than overfitting. Think of this as data augmentations for the activations. This is called a Dropout layer, which is really helpful for avoiding overfitting. The more dropout you use, the less good it will be on the training data but the better it ought to generalize.
A different set of activations will be deleted each batch. Dropout was initially rejected by NIPS, disseminated by arxiv. Peer-review is a very fallible thing in both directions.
Part 1 Summary
We’ve seen quite a few ways of dealing with inputs to neural networks, things that can happen in the middle of the NN. We’ve talked about Rectified Linear Units (ReLU) (0 if x is less than 0 or x otherwise), there are other activations you can use (except for Identity, with which you end up with a linear model)—these don’t matter very much, any non-linearity works fine–inputs can be one-hot encoded (or embeddings which is a computational shortcut), there are sandwiched layers of matrix multipliers and activation functions, matrix multipliers can be special cases like convolutions or embeddings, the output can go through some tweaking such as Softmax, and you’ve got the loss function such as cross-entropy loss or mean squared error or mean absolute error.
AMA
Read Radek’s book “Meta Learning”. One of the fastai alums went on to create the Mish activation function now used in many SOTA models around the world and is now at Mila, one of the top research labs of the world.
How do you stay motivated? You don’t have to know everything—nobody know’s everything and that’s okay. Take an interest in some area and follow that and do the best job of keeping up with some little sub area. If you’re sub area is too much to keep up on, pick a sub-sub area. From time to time, take a dip into other areas you’re not following as closely. Things are not changing that fast at all. Fundamentally the stuff that is in the course now is not that different to what was in the course five years ago. The foundations haven’t changed. It’s not that different to the convolutional neural network that Yann LeCun used on MNIST back in 1996. The basic ideas are forever. Everything else is tweaks. The more you learn about the basic ideas, the more you’ll recognize those tweaks as simple little tricks that you’ll quickly be able to get your head around.
The key thing to creating a legitimate business venture is to solve a legitimate problem. A problem that people need solving and will pay you to solve. It’s important not to start with your fun gradio prototype as the basis for your business, but instead start with: here’s a problem that I want to solve. Pick a problem that you understand better than most people. Eric Reis wrote “The Lean Startup” who recommends that what you do next is you fake it. You create a Minimum Viable Product—something that solves the problem that takes as little time to create. It could be very manual, it could be loss making, that’s fine. The bit in the middle where there’s going to be a neural net—you launch without it and do everything by hand. You’re just trying to find out: “are people going to pay for this? is this actually useful?” Once you have confirmed that the need is real and that people will pay for it and you can solve the need you can gradually make it less fake and more and more getting the product to where you want it to be.
Productivity hacks: not to work too much. Jeremy spends less hours a day working than most people. Jeremy has spent half of every working day since 18 learning or practicing something new. Doing it more slowly than if he used something that he already knew. In the other 50% of the time he’s constantly building up, exponentially, this base of expertise in a wide range of areas so he can do things multiples or orders of magnitudes faster than people around him. Try not overdo things, get good sleep, eat well and exercise well. It’s also a case of tenacity—Jeremy has noticed a lot of people give up much earlier than he does. If you just keep going until something’s actually finished (nicely), then that’s going to put you in a small minority. Most people don’t do that. Jeremy makes things like nbdev that make it easier to finish something nicely. Make the things that you want to do easier so that you’ll do them more.
Book Notes
feature engineering: creating new transformations of the input data in order to make it easier to model.
In the context of an image, a feature is a visually distinctive attribute.
Finding the edges in an image is a very common task in computer vision and to do it we use something called a convolution which requires nothing more than multiplication and addition. Let’s do this with code:
from fastai.vision.all import *
'image', cmap='Greys') matplotlib.rc(
= tensor([[-1, -1, -1],
top_edge 0, 0, 0],
[ 1, 1, 1]])
[
# this is our kernel top_edge
tensor([[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]])
= untar_data(URLs.MNIST_SAMPLE) path
= Image.open(path/'train'/'3'/'12.png')
im3 ; show_image(im3)
Multiply the top 3x3-pixel square of our image and multiply each of those values by each item in our kernel, then add them up.
= tensor(im3)
im3_t 0:3, 0:3] * top_edge im3_t[
tensor([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
0:3, 0:3] * top_edge).sum() (im3_t[
tensor(0)
# more interesting results
= pd.DataFrame(im3_t[:10, :20])
df **{'font-size': '6pt'}).background_gradient('Greys') df.style.set_properties(
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 12 | 99 | 91 | 142 | 155 | 246 | 182 | 155 | 155 | 155 | 155 | 131 | 52 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 138 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 252 | 210 | 122 | 33 | 0 |
7 | 0 | 0 | 0 | 220 | 254 | 254 | 254 | 235 | 189 | 189 | 189 | 189 | 150 | 189 | 205 | 254 | 254 | 254 | 75 | 0 |
8 | 0 | 0 | 0 | 35 | 74 | 35 | 35 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 224 | 254 | 254 | 153 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90 | 254 | 254 | 247 | 53 | 0 |
4:7, 6:9] # top edge im3_t[
tensor([[ 0, 0, 0],
[142, 155, 246],
[254, 254, 254]], dtype=torch.uint8)
4:7, 6:9] * top_edge im3_t[
tensor([[ 0, 0, 0],
[ 0, 0, 0],
[254, 254, 254]])
4:7, 6:9] * top_edge).sum() (im3_t[
tensor(762)
7:10, 17:20] # right edge im3_t[
tensor([[254, 75, 0],
[254, 153, 0],
[247, 53, 0]], dtype=torch.uint8)
7:10, 17:20] * top_edge im3_t[
tensor([[-254, -75, 0],
[ 0, 0, 0],
[ 247, 53, 0]])
7:10, 17:20] * top_edge).sum() (im3_t[
tensor(-29)
This calculation is returning a high number where the 3x3-pixel square represents a top edge (where there are low values at the top of the square and high values immediately underneath)—in that case the -1 values in our kernel have little impact.
Looking at the math, any window of size 3x3 in our image:
a1 | a2 | a3 |
a4 | a5 | a6 |
a7 | a8 | a9 |
Multiplying by a kernel:
1 | 1 | 1 |
0 | 0 | 0 |
-1 | -1 | -1 |
Will return:
a1 + a2 + a3 - a7 - a8 - a9.
If a1 = a7, a2 = a8, and a3 = a9, we’ll get 0. If a1 > a7, a2 > a8 and a3 > a9 we’ll get a positive number. This filter detects horizontal edges.
The kernel:
-1 | -1 | -1 |
0 | 0 | 0 |
1 | 1 | 1 |
detects horizontal edges where we go from light to dark.
The kernel:
1 | 1 | 1 |
0 | 0 | 0 |
-1 | -1 | -1 |
detects horizontal edges where we go from dark to light.
The kernel:
1 | 0 | -1 |
1 | 0 | -1 |
1 | 0 | -1 |
detects vertical edges where we go from dark (left) to light (right).
The kernel:
-1 | 0 | 1 |
-1 | 0 | 1 |
-1 | 0 | 1 |
detects vertical edges where we go from light (left) to dark (right).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from torch import tensor
# Create the tensor
= tensor([[254, 0, 0],
data 254, 0, 0],
[254, 0, 0]])
[
# Convert the tensor to a Pandas DataFrame
= pd.DataFrame(data.numpy())
df
# Plot the heatmap
=(3, 3))
plt.figure(figsize=True, cmap='Greys', cbar=False, linewidths=.5, fmt='d')
sns.heatmap(df, annot plt.show()
# vertical edge detector (light to dark - left to right)
= tensor([[-1, 0, 1],
k -1, 0, 1],
[-1, 0, 1]])
[
= tensor([[254, 0, 0],
data 254, 0, 0],
[254, 0, 0]])
[
= data * k
res # Convert the tensor to a Pandas DataFrame
= pd.DataFrame(res.numpy())
df
# Plot the heatmap
=(3, 3))
plt.figure(figsize=True, cmap='Greys', cbar=False, linewidths=.5, fmt='d')
sns.heatmap(df, annot plt.show()
# vertical edge detector (dark to light - left to right)
= tensor([[1, 0, -1],
k 1, 0, -1],
[1, 0, -1]])
[
= tensor([[254, 0, 0],
data 254, 0, 0],
[254, 0, 0]])
[
= data * k
res # Convert the tensor to a Pandas DataFrame
= pd.DataFrame(res.numpy())
df
# Plot the heatmap
=(3, 3))
plt.figure(figsize=True, cmap='Greys', cbar=False, linewidths=.5, fmt='d')
sns.heatmap(df, annot plt.show()
Let’s create a function to do this for one location, and check that it matches our result from before:
def apply_kernel(row, col, kernel):
return (im3_t[row-1:row+2, col-1:col+2] * kernel).sum()
5, 7, top_edge) apply_kernel(
tensor(762)
= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
l 5-1:5+2], l[7-1: 7+2] l[
([4, 5, 6], [6, 7, 8])
254*3
762
Note that we can’t apply this to a corner since there isn’t a complete 3x3 square there.
Mapping a Convolutional Kernel
We can map apply_kernel
across the coordinate grid—taking our 3x3 kernel and applying it to each 3x3 section of our image.
To get a grid of coordinates, use a nested list comprehension:
for j in range(1,5) for i in range(1,5)]] [[(i, j)
[[(1, 1),
(2, 1),
(3, 1),
(4, 1),
(1, 2),
(2, 2),
(3, 2),
(4, 2),
(1, 3),
(2, 3),
(3, 3),
(4, 3),
(1, 4),
(2, 4),
(3, 4),
(4, 4)]]
for i in range(1,5) for j in range(1,5)]] [[(i, j)
[[(1, 1),
(1, 2),
(1, 3),
(1, 4),
(2, 1),
(2, 2),
(2, 3),
(2, 4),
(3, 1),
(3, 2),
(3, 3),
(3, 4),
(4, 1),
(4, 2),
(4, 3),
(4, 4)]]
for i in ['inner1', 'inner2', 'inner3'] for j in ['outer1', 'outer2', 'outer3']]] [[(i, j)
[[('inner1', 'outer1'),
('inner1', 'outer2'),
('inner1', 'outer3'),
('inner2', 'outer1'),
('inner2', 'outer2'),
('inner2', 'outer3'),
('inner3', 'outer1'),
('inner3', 'outer2'),
('inner3', 'outer3')]]
# applying kernel over coordinate grid
= range(1,27)
rng = tensor([[apply_kernel(i, j, top_edge) for j in rng] for i in rng])
top_edge3 ; show_image(top_edge3)
for j in rng] for i in rng]).shape, \
tensor([[apply_kernel(i, j, top_edge) for j in rng for i in rng]]).shape, \
tensor([[apply_kernel(i, j, top_edge) for i in rng] for j in rng]).shape, \
tensor([[apply_kernel(i, j, top_edge) for j in rng for i in rng]).shape tensor([[apply_kernel(i, j, top_edge)]
(torch.Size([26, 26]),
torch.Size([1, 676]),
torch.Size([26, 26]),
torch.Size([676, 1]))
# left edge
= tensor([[-1, 1, 0],
left_edge -1, 1, 0],
[-1, 1, 0]]).float()
[
= tensor([[apply_kernel(i, j, left_edge) for j in rng] for i in rng])
left_edge3 ; show_image(left_edge3)
# right edge
= tensor([[0, 1, -1],
right_edge 0, 1, -1],
[0, 1, -1]]).float()
[
= tensor([[apply_kernel(i, j, right_edge) for j in rng] for i in rng])
right_edge3 ; show_image(right_edge3)
top_edge
tensor([[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]])
# bottom edge
= tensor([[1, 1, 1],
bottom_edge 0, 0, 0],
[-1, -1, -1]]).float()
[
= tensor([[apply_kernel(i, j, bottom_edge) for j in rng] for i in rng])
bottom_edge3
; show_image(bottom_edge3)
# top right diagonal
= tensor([[1, -1, -1],
top_right_diagonal 0, 1, -1],
[0, 0, 1]]).float()
[
= tensor([[apply_kernel(i, j, top_right_diagonal) for j in rng] for i in rng])
top_right_diagonal3
; show_image(top_right_diagonal3)
# top left diagonal
= tensor([[-1, -1, 0],
top_left_diagonal -1, 0, 1],
[0, 1, 1]]).float()
[
= tensor([[apply_kernel(i, j, top_left_diagonal) for j in rng] for i in rng])
top_left_diagonal3
; show_image(top_left_diagonal3)
# bottom right diagonal
= tensor([[1, 1, 0],
bottom_right_diagonal 1, 0, -1],
[0, -1, -1]]).float()
[
= tensor([[apply_kernel(i, j, bottom_right_diagonal) for j in rng] for i in rng])
bottom_right_diagonal3
; show_image(bottom_right_diagonal3)
# bottom left diagonal
= tensor([[ 0, 1, 1],
bottom_left_diagonal -1, 0, 1],
[-1, -1, 0]]).float()
[
= tensor([[apply_kernel(i, j, bottom_left_diagonal) for j in rng] for i in rng])
bottom_left_diagonal3
; show_image(bottom_left_diagonal3)
A image with height h
and width w
will have h-2
by w-2
3x3 windows. In our case we have 28x28 image, and 26x26 resulting convolutions.
Convolutions in PyTorch
PyTorch wants a rank-4 tensor as input
(minibatch, in_channels, iH, iW) and weight
(out_channels, in_channels, kH, kW) so that it can apply a convolution to multiple images at the same time (every item in a batch at once) and apply multiple kernels at the same time.
= tensor([[ 0, -1, 1],
diag1_edge -1, 1, 0],
[1, 0, 0]]).float()
[
= tensor([[apply_kernel(i, j, diag1_edge) for j in rng] for i in rng])
diag1_edge3
; show_image(diag1_edge3)
= tensor([[1, -1, 0],
diag2_edge 0, 1, -1],
[0, 0, 1]])
[
= tensor([[apply_kernel(i, j, diag2_edge) for j in rng] for i in rng])
diag2_edge3
; show_image(diag2_edge3)
= torch.stack([left_edge, top_edge, diag1_edge, diag2_edge])
edge_kernels edge_kernels.shape
torch.Size([4, 3, 3])
= DataBlock((ImageBlock(cls=PILImageBW), CategoryBlock),
mnist =get_image_files,
get_items=GrandparentSplitter(),
splitter=parent_label) get_y
= mnist.dataloaders(path)
dls = first(dls.valid)
xb, yb xb.shape
torch.Size([64, 1, 28, 28])
# by default fastai puts batches onto GPU when using DataBlocks
= to_cpu(xb), to_cpu(yb) xb, yb
A channel is a single basic color in an image. PyTorch represents an image as a rank-3 tensor with these dimensions:
[channels, rows, columns]
Kernels passed to F.conv2d
need to be rank-4 tensors:
[channels_in, features_out, rows, columns]
edge_kernels
is missing a dimension currently. We need to tell PyTorch that the number of input channels in the kernel is 1
which we can do by inserting an axis of size 1
(called a unit axis) in the first location.
1).shape edge_kernels.shape, edge_kernels.unsqueeze(
(torch.Size([4, 3, 3]), torch.Size([4, 1, 3, 3]))
= edge_kernels.unsqueeze(1) edge_kernels
= F.conv2d(xb, edge_kernels) batch_features
batch_features.shape
torch.Size([64, 4, 26, 26])
0,0]); # left edge show_image(batch_features[
0,1]); # top edge show_image(batch_features[
0,2]); # diag1 show_image(batch_features[
0,3]); # diag2 show_image(batch_features[
To become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time. Our manual convolution loop would be millions of times slower.
To avoid losing 2 pixels on each axis, we add padding (commonly zeroes).
Strides and Padding
If we add a kernel of size ks
by ks
(where ks
is an odd number) the necessary padding on each side to keep the same shape is ks//2
. An even number of ks
would require a different amount of padding on the top/bottom and left/right but in practice we almost never use an even filter size.
stride-2: move over two pixels after each kernel application, useful for decreasing the size of our outputs.
stride-1 convolutions are useful for adding layers without changing the output size.
The most common kernel size in practice is 3x3 and the most common padding is 1.
The general formula for output size given input image dimension n
, padding pad
, stride
and kernel size ks
:
(n + 2*pad - ks) // stride + 1
So for a 5x5 image with a 3x3 kernel, stride-2 and 1 pixel of padding:
(5 + 2 * 1 - 3) // 2 + 1 = 4/2 + 1 = 3
When looking at convolution as a matrix multiplication, it has two properties:
The zeros in the matrix are untrainable. They stay 0 throughout the optimization process.
Some of the weights are equal and while they are trainable (i.e. changeable), they must remain equal. These are called shared weights.
Our First Convolutional Neural Network
There is no reason to believe that some particular edge filters are the most useful kernels for image recognition. We don’t have a good idea for how to manually construct lower layer filters (of which later layer convolution kernels become complex transformations). Have the model learn the values of the kernels. When we use convolutions instead of (or in addition to) regular linear layers we create a convolutional neural network (CNN).
= nn.Sequential(
simple_net 28*28, 30),
nn.Linear(
nn.ReLU(),30, 1)
nn.Linear( )
simple_net
Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
(1): ReLU()
(2): Linear(in_features=30, out_features=1, bias=True)
)
Use convolutional layers instead of linear.
= sequential(
broken_cnn 1, 30, kernel_size=3, padding=1),
nn.Conv2d(
nn.ReLU(),30, 1, kernel_size=3, padding=1)
nn.Conv2d( )
We don’t need to specify 28*28
as the input size because the convolution is applied over each pixel automatically. The weights depend only on the number of input and output channels and the kernel size.
xb.shape
torch.Size([64, 1, 28, 28])
broken_cnn(xb).shape
torch.Size([64, 1, 28, 28])
0,0]); show_image(broken_cnn(xb)[
1, 30, kernel_size=3, padding=1)(xb).shape nn.Conv2d(
torch.Size([64, 30, 28, 28])
1, 30, kernel_size=3, padding=1)(xb)[0,29]); show_image(nn.Conv2d(
We can perform enough stride-2 convolutions to get this down to a single value for classification. 28x28 -> 14x14 -> 7x7 -> 4x4 -> 2x2 -> 1x1.
def conv(ni, nf, ks=3, act=True):
= nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)
res if act: res = nn.Sequential(res, nn.ReLU())
return res
When using stride-2, increase the number of features at the same time because we are decreasing the number of activations by 4 (we don’t want to decrease the capacity of a layer by too much at a time).
= sequential(
simple_cnn 1, 4), # 14 x 14
conv(4, 8), # 7x7
conv(8, 16), # 4x4
conv(16,32), # 2x2
conv(32,2,act=False), #1x1
conv(
Flatten() )
simple_cnn(xb).shape
torch.Size([64, 2])
1,4)(xb).shape conv(
torch.Size([64, 4, 14, 14])
4, 8)(
conv(1,4)(xb)
conv( ).shape
torch.Size([64, 8, 7, 7])
8, 16)(
conv(4, 8)(
conv(1,4)(xb))).shape conv(
torch.Size([64, 16, 4, 4])
16,32)(
conv(8, 16)(
conv(4, 8)(
conv(1,4)(xb)))).shape conv(
torch.Size([64, 32, 2, 2])
32, 2, act=False)(
conv(16,32)(
conv(8, 16)(
conv(4, 8)(
conv(1,4)(xb))))).shape conv(
torch.Size([64, 2, 1, 1])
Flatten()(32, 2, act=False)(
conv(16,32)(
conv(8, 16)(
conv(4, 8)(
conv(1,4)(xb)))))
conv( ).shape
torch.Size([64, 2])
# create our Learner
= Learner(dls, simple_cnn, loss_func=F.cross_entropy, metrics=accuracy) learn
learn.summary()
Sequential (Input shape: 64 x 1 x 28 x 28)
============================================================================
Layer (type) Output Shape Param # Trainable
============================================================================
64 x 4 x 14 x 14
Conv2d 40 True
ReLU
____________________________________________________________________________
64 x 8 x 7 x 7
Conv2d 296 True
ReLU
____________________________________________________________________________
64 x 16 x 4 x 4
Conv2d 1168 True
ReLU
____________________________________________________________________________
64 x 32 x 2 x 2
Conv2d 4640 True
ReLU
____________________________________________________________________________
64 x 2 x 1 x 1
Conv2d 578 True
____________________________________________________________________________
64 x 2
Flatten
____________________________________________________________________________
Total params: 6,722
Total trainable params: 6,722
Total non-trainable params: 0
Optimizer used: <function Adam at 0x790bb9f92830>
Loss function: <function cross_entropy at 0x790c7f74e950>
Callbacks:
- TrainEvalCallback
- CastToTensor
- Recorder
- ProgressCallback
Flatten
is like PyTorch’s squeeze
but as a Module
.
Let’s train! Since this is a deeper network than we’ve built from scratch before we’ll use a lower learning rate and more epochs:
2, 0.01) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.058796 | 0.036284 | 0.988714 | 00:28 |
1 | 0.022334 | 0.025466 | 0.991168 | 00:22 |
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv2d(input, weight, bias, self.stride,
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Understanding Convolution Arithmetic
Input size is 64x1x28x28
which is batch, channel, height, width
.
First layer of the model:
= learn.model[0] m
# 1 input channel, 4 output channels, 3x3 kernel m
Sequential(
(0): Conv2d(1, 4, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): ReLU()
)
# layer 1 weights
0].weight.shape m[
torch.Size([4, 1, 3, 3])
4 x 1 x 3 x 3 = 36 weights, but learn.summary
says this layer has 40 params. What are the other 4? Bias!
0].bias.shape # one bias for each channel m[
torch.Size([4])
1] learn.model[
Sequential(
(0): Conv2d(4, 8, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): ReLU()
)
1][0].weight.shape learn.model[
torch.Size([8, 4, 3, 3])
8*4*3*3
288
288 params + 8 bias values = 296 params.
Ignoring bias, this layer has 14 x 14 = 196 locations multiplied by 288 parameters resulting in 56_448 multiplications.
The next layer:
2] learn.model[
Sequential(
(0): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): ReLU()
)
2][0].weight.shape learn.model[
torch.Size([16, 8, 3, 3])
16*8*3*3
1152
Will have 7 x 7 x 1152 = 56_448 multiplications. We halved the grid size from 14x14 to 7x7 (using stride-2) and doubled the number of filters from 8 to 16.
7*7*1152
56448
If we left the number of channels the same in each stride-2 layer, the amount of computation being done in the net would get less and less as it gets deeper, but we know that deeper layers have to compute semantically rich features (such as eyes or furs) so we wouldn’t expect that doing less computation would make sense.
Receptive Fields
The receptive field is the area of an image that is involved in the calculation of a layer. The deeper we are in the network (the more stride-2 convs we have before a layer) the larger the receptive field for an activation in that layer is. A larger receptive field means that a large amount of the input image is used to calculate each activation in that layer. We’d expect that we’d need more weights for each of the deeper layer’s richer features to handle this increased complexity—which is why with stride-2 we increase the number of features in each deeper layer (since the input size decreases).
Color Images
A color picture is a rank-3 tensor.
= image2tensor(Image.open('/content/grizzly.jpg'))
im im.shape
torch.Size([3, 1000, 846])
; show_image(im)
The first axis contains red, green and blue channels
= subplots(1,3)
_,axs for bear,ax,color in zip(im,axs,('Reds', 'Greens', 'Blues')):
255-bear, ax=ax, cmap=color) show_image(
255-im[0],cmap='Reds'); show_image(
255-im[1],cmap='Greens'); show_image(
255-im[2],cmap='Blues'); show_image(
= subplots(1,3)
_,axs for bear,ax in zip(im,axs):
255-bear, ax=ax) show_image(
In one sliding window we have a certain number of channels and we need as many filters (we don’t use the same kernel for all the channels). kernel size: ch_in x 3 x 3
. We sum the results for all three channel window x filter multiplications to produce a single number for each grid location for each ch_out
output feature. The result of our convolutional layer: ch_out x ch_in x ks x ks
.
There are as many biases as we have kernels. The bias is a vector of size ch_out
.
Changing the encoding of colors won’t make any difference to your model results, as long as you don’t lose information in the transformation (transforming to B/W is a bad idea as it loses color information while converting to HSV generally won’t make a difference).
Improving Training Stability
Create a 10-digit classifier.
def conv(ni, nf, ks=3, act=True):
= nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)
res if act: res = nn.Sequential(res, nn.ReLU())
return res
= untar_data(URLs.MNIST)
path path.ls()
(#2) [Path('/root/.fastai/data/mnist_png/testing'),Path('/root/.fastai/data/mnist_png/training')]
# create a function to change dls params
def get_dls(bs=64):
return DataBlock(
=(ImageBlock(cls=PILImageBW), CategoryBlock),
blocks=get_image_files,
get_items=GrandparentSplitter('training', 'testing'),
splitter=parent_label,
get_y=Normalize()
batch_tfms=bs) ).dataloaders(path, bs
= get_dls() dls
=9, figsize=(4,4)) dls.show_batch(max_n
for el in dls.valid.items]).sort_values().unique() pd.Series([el.parent.name
array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
A Simple Baseline
Use a similar CNN as before but with more activations (more numbers to differentiate = we’ll likely need more filters).
We generally want to double the number of filters each time we have a stride-2 layer. One way to increase the number of filters throughout our network is to double the number of activations in the first layer—then every layer after that will end up twice as big as in the previous version.
Neural networks will create useful features only if they’re forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs. If we have a 3x3 kernel, the number of inputs is 9. If we have 8 filters, we’ll be using 9 numbers (3x3) to calculate 8 numbers. It isn’t learning much at all (input and output size is the same). To fix this, use a larger kernel for the first layer, 5x5, so that 25 values are used to learn 8 values (one for each of the 8 filters) at each location.
def simple_cnn():
return sequential(
1, 8, ks=5), # 14x14
conv(8, 16), # 7x7
conv(16, 32), # 4x4
conv(32, 64), # 2x2
conv(64, 10, act=False), # 1x1
conv(
Flatten() )
= first(dls.valid)
xb, yb = to_cpu(xb), to_cpu(yb) xb, yb
1, 8, ks=5)(xb).shape conv(
torch.Size([64, 8, 14, 14])
We can look inside of our models while they’re training with the Activation Stats
callback which records the mean, standard deviation and histogram of activations of every trainable layer.
from fastai.callback.hook import *
def fit(epochs=1):
= Learner(dls, simple_cnn(), loss_func=F.cross_entropy, metrics=accuracy, cbs=ActivationStats(with_hist=True))
learn 0.06)
learn.fit(epochs, return learn
= fit() learn
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 2.307114 | 2.306540 | 0.101000 | 01:07 |
That didn’t train well, let’s find out why.
learn.summary()
Sequential (Input shape: 64 x 1 x 28 x 28)
============================================================================
Layer (type) Output Shape Param # Trainable
============================================================================
64 x 8 x 14 x 14
Conv2d 208 True
ReLU
____________________________________________________________________________
64 x 16 x 7 x 7
Conv2d 1168 True
ReLU
____________________________________________________________________________
64 x 32 x 4 x 4
Conv2d 4640 True
ReLU
____________________________________________________________________________
64 x 64 x 2 x 2
Conv2d 18496 True
ReLU
____________________________________________________________________________
64 x 10 x 1 x 1
Conv2d 5770 True
____________________________________________________________________________
64 x 10
Flatten
____________________________________________________________________________
Total params: 30,282
Total trainable params: 30,282
Total non-trainable params: 0
Optimizer used: <function Adam at 0x7d6f11cb0700>
Loss function: <function cross_entropy at 0x7d6fd5b369e0>
Model unfrozen
Callbacks:
- ActivationStats
- TrainEvalCallback
- CastToTensor
- Recorder
- ProgressCallback
0) # first layer learn.activation_stats.plot_layer_stats(
Generally our model should have a consistent, or at least smooth, mean and standard deviation of layer activations during training. Activations near zero are problematic because that means the model is doing nothing and that carries over to the next layer.
# penultimate layer
-2) learn.activation_stats.plot_layer_stats(
The problem, as expected, gets worse by the end of the network with nearly 100% of the activations close to 0.
= learn.activation_stats.layer_stats(0) l_stats
len(l_stats[0])
937
len(dls.train.items) / 64
937.5
Note: 937 = number of batches in training set, so the mean activation is across the batch (as is standard deviation, and % near zero).
Increase Batch Size
One way to make training more stable is to increase the batch size. Large batches have gradients that are more accurate since they’re calculated from more data with the downside of fewer opportunities to update weights (fewer batches per epoch).
= get_dls(512) dls
= fit() learn
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.398756 | 0.203404 | 0.935500 | 01:01 |
# penultimate layer
-2) learn.activation_stats.plot_layer_stats(
Even though the accuracy is higher, most of the activations are near zero.
1cycle Training
Our initial weights are not well suited to the task we’re trying to solve. Starting with a large learning rate may diverge the training from the start. We don’t want to end with a high learning rate either because we don’t want to skip over the minimum. We should change learning rate from low, to high, and then back to low again. Leslie Smith developed this idea—a schedule where in the first phase the learning rate grows from the minimum value to the maximum value (warmup), and then decreases back to minimum (annealing)—1cycle training which allows for higher learning rates (trains faster, “super-convergence” and overfits less by skipping over the sharp local minima).
A model that generalizes well is one whose loss would not change very much if you changed the input by a small amount (I think one way to think about that is a smooth loss surface—no quick or sudden sharp changes). If a model trains with a large learning rate and finds a good loss when doing so (i.e. a loss that doesn’t change very much) it will generalize well.
Once we have found a nice smooth area for our parameters, we want to find the very best part of that area so we bring learning rates down again.
momentum: the optimizer takes a step not only in the direction of the gradients, but also that continues in the direction of previous steps. Momentum varies in the opposite direction of the learning rate: high learning rates -> less momentum (Leslie Smith, AGAIN!).
def fit(epochs=1, lr=0.06):
= Learner(dls, simple_cnn(), loss_func=F.cross_entropy, metrics=accuracy, cbs=ActivationStats(with_hist=True))
learn
learn.fit_one_cycle(epochs, lr)return learn
= fit() learn
/usr/local/lib/python3.10/dist-packages/fastai/callback/core.py:69: UserWarning: You are shadowing an attribute (modules) that exists in the learner. Use `self.learn.modules` to avoid this
warn(f"You are shadowing an attribute ({name}) that exists in the learner. Use `self.learn.{name}` to avoid this")
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.196365 | 0.070784 | 0.977500 | 01:11 |
-2) learn.activation_stats.plot_layer_stats(
% near zero is lower for some batches but still overall high.
learn.recorder.plot_sched()
fastai implements cosine annealing in the learning rate scheduler.
fit_one_cycle
parameters:
lr_max
: the highest learning rate to be used (can also be a list for each layer group or a Python slice
object containing the first and last layer group learning rates)
div
: How much to divide lr_max
by to get the starting learning rate.
div_final
: how much to divide lr_max
by to get the ending learning rate.
pct_start
: what percentages of the batches to use for the warmup
moms
: a tuple (mom1,mom2,mom3)
where mom1
is the initial momentum, mom2
is the minimum momentum, and mom3
is the final momentum.
The axes on the graph is the number of batches (60_000 training images / 512 images per batch = 117 batches).
# colorful dimension
-2) learn.activation_stats.color_dim(
This is a classic picture of “bad training”. White = zero activations. Black at the bottom left is near-zero activations. The near-zero activations exponentially increase and then collapse almost like the training is starting over. We see this increase and collapse again a few times before the distribution is spread throughout the range. Starting the training smooth from the start can be achieved with batch normalization.
Batch Normalization
We need to fix the initial large percentage of near-zero activations and then try to maintain a good distribution of activations throughout training.
From the Batch Normalization paper (2015, Ioffe and Szegedy):
internal covariate shift: distribution of each layer’s inputs changes during training as the parameters of the previous layers change which slows down the training (lower learning rates required) and requires careful parameter initialization.
Making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.
batchnorm: taking an average of the mean and standard deviations of the activations of a layer and using those to normalize the activations. The network will want to make some activations really high to make accurate predictions so there are two learnable parameters gamma
and beta
. After normalizing the activations to get some new activation vector y
a batchnorm layer returns gamma*y + beta
.
Our activations can have any mean and variance independent from the mean and standard deviation of the results of the previous layer. During training we use mean and std of the batch to normalize the data, during validation we use a running mean aof the stats calculated during training.
# add batchnorm
def conv(ni, nf, ks=3, act=True):
= [nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)]
layers
layers.append(nn.BatchNorm2d(nf))if act: layers.append(nn.ReLU())
return nn.Sequential(*layers)
= fit() learn
/usr/local/lib/python3.10/dist-packages/fastai/callback/core.py:69: UserWarning: You are shadowing an attribute (modules) that exists in the learner. Use `self.learn.modules` to avoid this
warn(f"You are shadowing an attribute ({name}) that exists in the learner. Use `self.learn.{name}` to avoid this")
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.126032 | 0.055595 | 0.987400 | 01:12 |
-4) learn.activation_stats.color_dim(
That’s what we hope to see—a smooth development of activations with no collapses. We see batchnorm in nearly all modern neural networks.
We haven’t as yet seen rigorous analysis of what’s going on here, but most researchers believe that the reason models containing batch norm layers generalize better is that the normalization adds some extra randomness to the training process. Each mini-batch will have a somewhat different mean and std than other mini-batches. The activations will be normalized by different values each time. The model will learn to become robust to these variations to make accurate predictions. Adding additional randomization to the training process often helps.
= fit(5, lr=0.1) learn
/usr/local/lib/python3.10/dist-packages/fastai/callback/core.py:69: UserWarning: You are shadowing an attribute (modules) that exists in the learner. Use `self.learn.modules` to avoid this
warn(f"You are shadowing an attribute ({name}) that exists in the learner. Use `self.learn.{name}` to avoid this")
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.187965 | 0.117173 | 0.963600 | 01:09 |
1 | 0.079750 | 0.052391 | 0.983100 | 01:08 |
2 | 0.052439 | 0.048151 | 0.985200 | 01:03 |
3 | 0.032153 | 0.032204 | 0.989200 | 01:10 |
4 | 0.017457 | 0.024827 | 0.992300 | 01:04 |
-4) learn.activation_stats.color_dim(
Conclusion
Convolutions are matrix multiplications with two constraints—some elements are always zero and som elements are tied (forced to be equal). These constraints enforce a certain pattern of connectivity, and allow us to use fewer parameters without sacrificing ability to represent complex visual features. We can train deeper models faster with less overfitting. Regular linear layers are called fully connected. Batch norm helps regularize training and makes it smoother.
Questionnaire
1. What is a feature?
A visually distinctive attribute of an image.
2. Write out the convolutional kernel matrix for a top edge detector.
-1 | -1 | -1 |
0 | 0 | 0 |
1 | 1 | 1 |
3. Write out the mathematical operation applied by a 3x3 kernel to a single pixel in an image.
Assuming that this question pertains to a 3x3 grid in the image, suppose we apply the kernel in #2 to the following 3x3 grid:
a1 | a2 | a3 |
a4 | a5 | a6 |
a7 | a8 | a9 |
The result is the equation:
-a1 - a2 - a3 + a7 + a8 + a9
4. What is the value of a convolutional kernel applied to a 3x3 matrix of zeros?
0
5. What is padding?
Adding additional pixels around the border of the image to avoid skipping two pixels on each axis. With padding, instead of the kernel fitting fully inside the image at the edges, a portion of the kernel is on the padding pixels.
6. What is stride?
The number of pixels by which the kernel moves or slides over the image. Stride-2 means the kernel moves over by two pixels (skipping one pixel).
7. Create a nested list comprehension to complete any task that you choose.
for i in range(3) for j in range(4) for k in range(2)] [(i, j, k)
[(0, 0, 0),
(0, 0, 1),
(0, 1, 0),
(0, 1, 1),
(0, 2, 0),
(0, 2, 1),
(0, 3, 0),
(0, 3, 1),
(1, 0, 0),
(1, 0, 1),
(1, 1, 0),
(1, 1, 1),
(1, 2, 0),
(1, 2, 1),
(1, 3, 0),
(1, 3, 1),
(2, 0, 0),
(2, 0, 1),
(2, 1, 0),
(2, 1, 1),
(2, 2, 0),
(2, 2, 1),
(2, 3, 0),
(2, 3, 1)]
for i in ['i1', 'i2', 'i3']] for j in ['j1', 'j2'] for k in ['k1', 'k2']] [[[i, j, k]
[[['i1', 'j1', 'k1'], ['i2', 'j1', 'k1'], ['i3', 'j1', 'k1']],
[['i1', 'j1', 'k2'], ['i2', 'j1', 'k2'], ['i3', 'j1', 'k2']],
[['i1', 'j2', 'k1'], ['i2', 'j2', 'k1'], ['i3', 'j2', 'k1']],
[['i1', 'j2', 'k2'], ['i2', 'j2', 'k2'], ['i3', 'j2', 'k2']]]
8. What are the shapes of the input
and weight
parameters to PyTorch’s 2D convolution?
input
: (minibatch, in_channels, lH, lW)
weight
: filters of shape (out_channels, in_channels, kH, kW)
9. What is a channel?
A single basic color in an image.
10. What is the relationship between a convolution and a matrix multiplication?
A convolution is a matrix multiplication with two constraints:
- some elements in the kernel matrix always stay 0.
- some elements in that matrix are always equal to each other.
11. What is a convolutional neural network?
A neural network with non-linearity function sandwiched between convolutions. In other words, replacing the fully-connected (linear) layers in a neural network with convolutions.
12. What is the benefit of refactoring parts of your neural network definition?
Less likely to get errors due to inconsistencies in architecutre. More obvious to reader which parts of your layers are actually changing.
13. What is Flatten
? Where does it need to be included in the MNIS CNN? Why?
A fastai/PyTorch layer that flattens the input to a single dimension, used at the end of the model. In our case, it removes the final 1x1 convolution dimensions.
= torch.ones(2, 1, 1)
x x.shape
torch.Size([2, 1, 1])
x
tensor([[[1.]],
[[1.]]])
1], x[1][0], x[1][0][0] x[
(tensor([[1.]]), tensor([1.]), tensor(1.))
1].shape, x[1][0].shape, x[1][0][0].shape x[
(torch.Size([1, 1]), torch.Size([1]), torch.Size([]))
Flatten()(x).shape
torch.Size([2, 1])
1], Flatten()(x)[1][0] Flatten()(x), Flatten()(x)[
(tensor([[1.],
[1.]]),
tensor([1.]),
tensor(1.))
1].shape, Flatten()(x)[1][0].shape Flatten()(x)[
(torch.Size([1]), torch.Size([]))
14. What does NCHW mean?
N = batch size
C = channels
H = height
W = width
15. Why does the third layer of the MNIST CNN have 7*7*(1168-16)
multiplications?
7x7 is the dimension of the resulting filters coming out of the previous convolution (second layer). 1152 (which is 1168 - 16) is the number of non-bias parameters in the third layer, which comes from the fact that the weight in that layers convolution has dimensions 8 x 16 x 3 x 3 (8 inputs, 16 output features and a 3x3 kernel).
8*16*3*3
1152
16. What is a receptive field?
The area of an image that is involved in the calculation of a layer.
17. What is the size of the receptive field of an activation after two stride-2 convolutions? Why?
7x7. The activation in layer 2 is made from a 3x3 receptive field in layer 1. That 3x3 layer 1 area is made up of 7x7 receptive field in layer 0 (the original image).
Let’s focus on the top-most and left-most activation in layer 2. That comes from the top-left-most 3x3 pixels in layer 1. The top-left-most pixel in layer 1 comes from the top-left-most pixel in layer 0. The next pixel to the right in layer 1 comes from a 3x3 grid starting at the the first-row third pixel in layer 0. The next pixel to the right in layer 1 comes from a 3x3 grid starting at the first-row fifth pixel in layer, which covers pixel 5, 6, and 7 in the first row of layer 0 (the original image). In this way, vertically, pixels 1 through 7 in the first column are involved, and the whole 7x7 grid is involved in layer 0 when looking at the whole 3x3 grid in layer 1.
18. Run conv-example.xlsx yourself and experiment with trace precedents
I recreated the whole notebook while following the Lesson 8 video. I also used trace precedents when answering question #17.
19. Have a look at Jeremy’s or Sylvain’s recent Twitter “likes”, and see if you find any interesting resources or ideas there.
Likes are no longer public on Twitter but I have follow a bunch of folks that Jeremy follows.
20. How is a color image represented as a tensor?
A rank-3 tensor (3 channels, height, width).
21. How does a convolution work with a color input?
It applies a different filter to each channel and then sums the results for each pixel.
22. What method can we use to see the data in DataLoaders
?
show_batch
23. Why do we double the number of filters after each stride-2 conv?
Since the image reduces in size by half, we double the number of filters so that as deeper layers learn more complex features there’s enough data to learn from.
24. Why do we use a larger kernel in the first conv with MNIST (with simple_cnn
)?
If we use a 3x3 kernel to produce 8 filters, that’s 9 inputs producing 8 outputs—the model isn’t learning much. For the model to learn things, the number of inputs should be larger than the number of outputs. So if we use a 5x5 kernel, that’s 25 pixels producing 8 outputs, so the model has to learn useful features.
25. What information does ActivationStats
save for each layer?
Mean and standard deviation of activations.
26. How can we access a learner’s callback after training?
Learner.<name of callback>
for example the ActivationStats
callback is accessed after training with Learner.activation_stats
.
27. What are the three statistics plotted by plot_layer_stats
? What does the x-axis represent?
Three statistics plotted: mean activations, standard deviation of activations and % of activations near zero. The x-axis represents the batches.
28. Why are activations near zero problematic?
Because they result in the model doing nothing after the computation (multiplying by 0 = 0). Also, the resulting 0-activations result in more 0 activations in the next layer. In this way, the deeper you go, the more the activations are near zero if the early layers have many near-zero activations.
29. What are the upsides and downsides of training with a larger batch size?
Upside: smoother training. Downside: fewer gradient updates (opportunities to “learn”).
30. Why should we avoid using a high learning rate at the start of training?
We will “overshoot” the minimum and the training will diverge (exploding gradeints).
31. What is 1cycle training?
A learning rate schedule where the learning starts starts off small, warms up to a larger value then anneals back down to a smaller value.
32. What are the benefits of training with a high learning rate?
You can train quicker and overfit less (since we skip over sharp local minima).
33. Why do we want to use a low learning rate at the end of training?
Assuming that we have found the minimum, we don’t want to overshoot it so a smaller learning rate takes smaller steps towards the minimum.
34. What is cyclical momentum?
Momentum is “a technique whereby the optimizer takes a stepo not only in the direction of the gradients, but also that continues in the direction of previous steps.” Cyclical momentum is when the momentum is on a schedule going from large, to small to large again (in the case of 1cycle training).
35. What callback tracks hyperparameter values during training (along with other information)?
Recorder
36. What does one column of pixels in the color_dim
plot represent?
The activations for one batch.
37. What does “bad training” look like in color_dim
? Why?
Cyclical increase and collapse of nonzero activations. This results in near zero activations at the end of training which can lead to poor results.
38. What trainable parameters does a batch normalization layer contain?
gamma
and beta
where given a vector y
of normalized activations, gamma
* y
+ beta
is the “learned” normalization returned by batchnorm.
39. What statistics are used to normalize in batch normalization during training? How about during validation?
training: mean and standard deviation of the activations of the batch.
validation: running mean and standard deviations from previous
40. Why do models with batch normalization layers generalize better?
We don’t fully know (at least at the time of writing) but it’s likely because the normalization (including the learned parameters) adds randomness and additional randomness generally helps the model generalize better.