Fine-Tuning a Language Model as a Text Classifier

deep learning
python
In this notebook I work through the chapter exercise presented in Chapter 10 of the fastai textbook.
Author

Vishal Bakshi

Published

August 5, 2023

In this notebook, I’ll fine-tune a lanaguage model on the IMDb reviews dataset, grab the encoder, create a new classification model with it and then fine-tune it to classify IMDb reviews as positive or negative. The code (and prose) below is taken from Chapter 10 of the fastai textbook.

from fastai.text.all import *
path = untar_data(URLs.IMDB)

The data is stored in three folders: train (25k labeled reviews), test (25k labeled reviews) and unsup (50k unlabeled reviews). The language model is trained on all 100k reviews and the classification model is trained using the train dataset (its accuracy calculated on the test validation set).

path.ls()
(#7) [Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/test')]

Fine-Tuning the Pretrained Language Model

First, we fine-tune the pretrained language model (which was trained on all of Wikipedia) using 100k movie reviews. This fine-tuned model will learn to predict the next word of an IMDb movie review.

Note that fastai’s TextBlock sets up its numericalizer’s vocab automatically.

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_ln=80)

The dependent variable is the independent variable shifted over by one token:

dls_lm.show_batch(max_n=2)
text text_
0 xxbos xxmaj this movie is my favorite of all time . xxmaj the dialogue is spectacular , and is delivered with such rapid - fire speed that one viewing is not enough . xxmaj the film comedy was elevated to new heights with xxmaj howard xxmaj hawks outstanding direction . xxmaj based on the classic play " the xxmaj front xxmaj page " , xxmaj hawks gives it a delightful twist by xxmaj this movie is my favorite of all time . xxmaj the dialogue is spectacular , and is delivered with such rapid - fire speed that one viewing is not enough . xxmaj the film comedy was elevated to new heights with xxmaj howard xxmaj hawks outstanding direction . xxmaj based on the classic play " the xxmaj front xxmaj page " , xxmaj hawks gives it a delightful twist by presenting
1 xxmaj woody xxmaj woodpecker , " duck xxmaj amuck " and especially " one xxmaj froggy xxmaj evening " show up how weak this movie is in comparison . xxmaj plus the movie fits in shambolic slapstick alongside strained sentiment ( the underlying theme of the story is family ; our hero is n't ready to have a son , and his nemesis - xxmaj alan xxmaj cumming as the xxmaj norse woody xxmaj woodpecker , " duck xxmaj amuck " and especially " one xxmaj froggy xxmaj evening " show up how weak this movie is in comparison . xxmaj plus the movie fits in shambolic slapstick alongside strained sentiment ( the underlying theme of the story is family ; our hero is n't ready to have a son , and his nemesis - xxmaj alan xxmaj cumming as the xxmaj norse god
learn = language_model_learner(
    dls_lm,
    AWD_LSTM,
    drop_mult=0.3,
    metrics=[accuracy, Perplexity()]
).to_fp16()

I fine-tuned the model for one epoch and saved it to load and use later. language_model_learner automatically freezes the pretrained model so it trains only the randomly instantiated embeddings representing the IMDb vocab.

learn.fit_one_cycle(1, 2e-2)

Paperspace’s file browser is located at /notebooks so I change the learn.path to that location:

learn.path = Path('/notebooks')

I then save the learner so that it saves the trained embeddings.

learn.save('1epoch')
Path('/notebooks/models/1epoch.pth')

Later on, I load the saved model, unfreeze the layers of the pretrained language model and fine-tune it for 10 epochs on the IMDb reviews dataset at a smaller learning rate (as shown in the fastai text):

learn = learn.load('1epoch')
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
epoch train_loss valid_loss accuracy perplexity time
0 4.214371 4.114542 0.300169 61.224136 41:36
1 3.917021 3.850335 0.316820 47.008827 42:00
2 3.752428 3.724050 0.326502 41.431866 42:13
3 3.660530 3.660284 0.331666 38.872364 42:32
4 3.560096 3.620281 0.335297 37.348042 42:36
5 3.507077 3.592660 0.338347 36.330578 42:44
6 3.430038 3.575986 0.340261 35.729839 42:39
7 3.360812 3.566898 0.341806 35.406578 42:53
8 3.310551 3.567138 0.342046 35.415089 43:28
9 3.297931 3.570799 0.341944 35.544979 44:01
IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

We save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder.

learn.save_encoder('imdb_finetuned')

Before we fine-tune the model to be a classifier, the textbook has us generate random reviews:

TEXT = 'I liked this movie because'
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]

print("\n".join(preds))
i liked this movie because it showed a lot of normal people in America about who we belong and what they say and do . 

 The acting was great , the story was fun and enjoyable and the movie was very well
i liked this movie because my family and i are great Canadians and also Canadians , especially the Canadians . This is not a Canadian and American movie , but instead of being a " mockumentary " about the

The reviews are certainly not polished, but it’s still fascinating to see how the model predicts the next word to create a somewhat sensical review.

Fine-tune the Text Classifier

For the final piece of this lesson, we move from language model to classifier, starting with creating the classifier DataLoaders.

We pass it the vocab of the language model to make sure we use the same correspondence of token to index, so that the embeddings learned in the fine-tuned language model can be applied to the classifier.

The dependent variable in this classifier is the label of the parent folder, pos for positive and neg for negative.

Finally, we don’t pass is_lm=True to the TextBlock since it’s False by default (which we want in this case because we have labeled data, and don’t want to use next token as the label).

(path/'train').ls()
(#4) [Path('/root/.fastai/data/imdb/train/pos'),Path('/root/.fastai/data/imdb/train/unsupBow.feat'),Path('/root/.fastai/data/imdb/train/neg'),Path('/root/.fastai/data/imdb/train/labeledBow.feat')]
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y = parent_label,
    get_items = partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

The independent variable is the movie review and the dependent variable is the sentiment (positive, pos, or negative, neg):

dls_clas.show_batch(max_n=3)
text category
0 xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero pos
1 xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released " kiki 's xxmaj delivery xxmaj service " on video which included a preview of the xxmaj laputa dub saying it was due out in " 1 xxrep 3 9 " . xxmaj it 's obviously way past that year now , but the dub has been finally completed . xxmaj and it 's not " laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky " , just " castle xxmaj in xxmaj the xxmaj sky " for the dub , since xxmaj laputa is not such a nice word in xxmaj spanish ( even though they use the word xxmaj laputa many times pos
2 xxbos xxmaj titanic directed by xxmaj james xxmaj cameron presents a fictional love story on the historical setting of the xxmaj titanic . xxmaj the plot is simple , xxunk , or not for those who love plots that twist and turn and keep you in suspense . xxmaj the end of the movie can be figured out within minutes of the start of the film , but the love story is an interesting one , however . xxmaj kate xxmaj winslett is wonderful as xxmaj rose , an aristocratic young lady betrothed by xxmaj cal ( billy xxmaj zane ) . xxmaj early on the voyage xxmaj rose meets xxmaj jack ( leonardo dicaprio ) , a lower class artist on his way to xxmaj america after winning his ticket aboard xxmaj titanic in a poker game . xxmaj if he wants something , he goes and gets it pos

Each batch has to have tensors of the same size, so fastai does the following (when using a TextBlock with is_lm=False):

  • Batch together texts that are roughly the same lengths (by sorting the documents by length prior to each epoch).
  • Expand the shortest texts to make them all the same size (as the largest document in the batch) by padding them with a special padding token that will be ignored by the model.

Let’s create the model to classify texts:

learn = text_classifier_learner(
    dls_clas, 
    AWD_LSTM, 
    drop_mult=0.5, 
    metrics=accuracy
).to_fp16()

Load the encoder from our fine-tuned language model:

learn.path = Path('/notebooks')
learn = learn.load_encoder('imdb_finetuned')

The last step is to train with discriminative learning rates and gradual unfreezing. For NLP classifiers the text recommends unfreezing a few layers at a time to achieve the best performance:

learn.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy time
0 0.245777 0.174727 0.934000 01:48

We get a similar accuracy as the textbook value (0.929320).

Next, train the model with all layers except the last two parameter groups frozen:

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))
epoch train_loss valid_loss accuracy time
0 0.226701 0.161235 0.938800 01:59

The accuracy improved a bit!

Unfreeze the third parameter group and keep training:

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))
epoch train_loss valid_loss accuracy time
0 0.188972 0.147045 0.946440 02:43

The accuracy continues to improve.

Finally, train the whole model:

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))
epoch train_loss valid_loss accuracy time
0 0.163849 0.143639 0.947600 03:18
1 0.149648 0.144494 0.947840 03:19

We’ll test the model with a few low-hanging-fruit inputs:

learn.predict("I really like this movie!")
('pos', tensor(1), tensor([0.0034, 0.9966]))
learn.predict("I really did not like this movie!")
('neg', tensor(0), tensor([0.9985, 0.0015]))
learn.predict("I'm not sure if I loved or hated this movie")
('neg', tensor(0), tensor([0.6997, 0.3003]))

To recap, here are the three steps that were involved in creating the IMDb movie review classifier:

  • A language model was pretrained on all of Wikipedia.
  • We then fine-tuned that model on 100k IMDb movie reviews (documents).
  • Using the encoder from the fine-tuned language model, we created a classification model and fine-tuned it for a few epochs, gradually unfreezing layers for consecutive epochs. This model accurately classifies movie review as positive or negative.

That’s a wrap for this exercise. I hope you enjoyed this blog post!