!pip install textstat -qq
!pip install datasets -qq
Calculating the Flesch Kincaid Reading Grade Level for the financial_phrasebank
Dataset
financial_phrasebank
dataset and find that it’s much higher than the average TinyStories reading level.
Background
In this notebook I’ll calculate the Flesch-Kincaid reading grade level for the financial_phrasebank
dataset. Previously, I found the TinyStories dataset had a median grade level of 2.7. I expect financial_phrasebank
to have a higher grade level due to financial jargon. If true, this suggests fine-tuning TinyInstruct-33M on financial_phrasebank
may be less effective than using a simplified version (which I’ll create with phi-3 as a future exercise) as I endeavor to build a “TinySentiment” classifier.
Setup
from datasets import load_dataset
import numpy as np
import textstat
= load_dataset(
ds "financial_phrasebank", "sentences_allagree",
="train" # note that the dataset does not have a default test split
split )
ds
Dataset({
features: ['sentence', 'label'],
num_rows: 2264
})
= ds.to_pandas() # convert to a DataFrame to apply textstat.flesch_kincaid_grade
ddf ddf.shape
(2264, 2)
Calculating Flesch-Kincaid Reading Grade Level
I can calculate the reading grade level for the 2264 rows in about 0.5 seconds by using textstat.flesch_kincaid_grade
. See my previous blog post for a deeper dive on how textstat
calculates reading grade level.
'fk_grade'] = ddf['sentence'].apply(lambda x: textstat.flesch_kincaid_grade(x)) ddf[
The mean and median reading grade levels for this dataset are 8.6 and 8, respectively, about 6 grade levels higher than the TinyStories dataset.
'fk_grade'].describe() ddf[
fk_grade | |
---|---|
count | 2264.000000 |
mean | 8.616078 |
std | 4.933004 |
min | -3.100000 |
25% | 4.600000 |
50% | 8.000000 |
75% | 12.200000 |
max | 24.800000 |
Here’s an example of a Grade 8 reading level sentence
:
"fk_grade == 8").iloc[0]['sentence'] ddf.query(
'Both operating profit and turnover for the three-month period increased , respectively from EUR0 .9 m and EUR8 .3 m , as compared to the corresponding period in 2005 .'
The highest grade level in this dataset is 24.1 (well beyond post-graduate) likely due to its long list of multisyllabic industries:
"fk_grade == 24.8").iloc[0]['sentence'] ddf.query(
'The company serves customers in various industries , including process and resources , industrial machinery , architecture , building , construction , electrical , transportation , electronics , chemical , petrochemical , energy , and information technology , as well as catering and households .'
The lowest reading level is -3.1:
"fk_grade == -3.1").iloc[0]['sentence'] ddf.query(
'R&D Loan ) .'
The histogram illustrates that a significant portion of the dataset (25%) is above Grade 12. Less than 0.0005% of TinyStories was above Grade 12.
'fk_grade'].hist(); ddf[
"fk_grade > 12")['fk_grade'].count() / 2264 ddf.query(
0.2539752650176678
Final Thoughts
As expected, financial_phrasebank
(median reading grade level of 8) uses more complex language than TinyStories (median reading grade level of 2.7). This may make it challenging for TinyInstruct-33M to learn from. To test this, I’ll:
- Fine-tune TinyInstruct-33M on
financial_phrasebank
- Create a lower reading grade version of the dataset using phi-3
- Fine-tune TinyInstruct-33M on the simplified dataset
- Compare performance of both fine-tuned models
I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.