Show pip installs
!pip install transformers -Uqq
!pip install accelerate -qq
!pip install torch==2.2.2 -qq
!pip install datasets~=2.16.1 -qq
!pip install scikit-learn==1.2 -qq
financial_phrasebank
dataset with 86.1% accuracy.
Vishal Bakshi
September 23, 2024
# load dataset
dataset = load_dataset(
"financial_phrasebank", "sentences_allagree",
split="train" # note that the dataset does not have a default test split
)
# create a new column with the numeric label verbalised as label_text (e.g. "positive" instead of "0")
label_map = {i: label_text for i, label_text in enumerate(dataset.features["label"].names)}
def add_label_text(example):
example["label_text"] = label_map[example["label"]]
return example
dataset = dataset.map(add_label_text)
print(dataset)
generate_response
functiondef generate_response(prompt):
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=2
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
def add_prompt(item, prompt):
item['prompt'] = prompt.format(text=item['sentence'])
return item
def generate_responses(dataset, prompt):
responses = []
dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
print(dataset[0]['prompt'])
for row in dataset:
messages = [
{"role": "user", "content": row['prompt']}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=2
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip().lower()
responses.append(response)
# calculate accuracy
df = dataset.to_pandas()
df['responses'] = pd.Series(responses)
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
df['lm_match'] = df['label_text'] == df['responses']
acc = df.lm_match.mean()
return df, acc
make_cm
functiondef make_cm(df):
"""Create confusion matrix for true vs predicted sentiment classes"""
cm = confusion_matrix(y_true=df['label_text'], y_pred=df['responses'], labels=['negative', 'neutral', 'positive', 'other'])
disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive', 'other'])
# I chose 8x8 so it fits on one screen but still is large
fig, ax = plt.subplots(figsize=(8,8))
disp.plot(ax=ax,text_kw={'fontsize': 16}, cmap='Blues', colorbar=False);
# change label font size without changing label text
ax.xaxis.label.set_fontsize(18)
ax.yaxis.label.set_fontsize(18)
# make tick labels larger
ax.tick_params(axis='y', labelsize=16)
ax.tick_params(axis='x', labelsize=16)
In this notebook I’ll use Qwen2-1.5B-Instruct to classify sentiment in the financial_phrasebank
dataset. In previous notebooks I have performed sentiment classification with phi-2, phi-3, phi-3.5, and the Claude series.
This notebook is part of a series of blog posts for a project I’m working called TinySentiment where I’m experimenting with tiny models to improve their ability to classify sentiment in the financial_phrasebank dataset
. I was inspired to do so after reading this blog post and this corresponding notebook by Moritz Laurer as part of a fastai study group last year.
Here are the results from my experiments so far (**the best-performing prompt from this notebook):
Model | Prompting Strategy | Overall Accuracy | negative |
neutral |
positive |
---|---|---|---|---|---|
claude-3-5-sonnet-20240620 | 3-Shot | 94.78% | 98% (297/303) | 94% (1302/1391) | 95% (544/570) |
claude-3-opus-20240229 | 0-Shot | 94.13% | 98% (297/303) | 96% (1333/1391) | 88% (501/570) |
phi-3.5 | 20-Shot | 93.94% | 96% (286/299) | 98% (1355/1379) | 83% (467/566) |
phi-3 | 30-Shot w/System Prompt | 92.79% | 98% (290/297) | 94% (1284/1373) | 88% (499/564) |
claude-3-haiku-20240307 | 3-Shot | 92.39% | 90% (272/303) | 91% (1267/1391) | 96% (550/570) |
phi-2 | 6-Shot | 91.94% | 88% (267/302) | 94% (1299/1387) | 90% (510/569) |
**Qwen2-1.5B | 27-Shot | 86.10% | 90% (264/294) | 95.5% (1320/1382) | 61% (342/561) |
Here are the results from this notebook. The best-performing prompt was a randomly shuffled 27-Shot prompt (Prompt AD), yielding an overall accuracy of 86.10%.
prompt | strategy | accuracy | negative | neutral | positive |
---|---|---|---|---|---|
A | 0-Shot | 81.76% | 97% (293/303) | 85% (1185/1391) | 65% (373/570) |
B | 0-Shot | 51.86% | 99% (300/303) | 61% (846/1391) | 5% (28/570) |
C | 0-Shot | 81.40% | 93% (283/303) | 96% (1330/1391) | 40% (230/570) |
D | 0-Shot | 78.53% | 92% (279/303) | 92% (1281/1391) | 38% (218/570) |
E | 0-Shot | 66.21% | 100% (302/303) | 82% (1145/1391) | 9% (52/570) |
F | 0-Shot | 78.05% | 88% (267/303) | 97% (1355/1391) | 25% (145/570) |
G | 0-Shot | 66.70% | 94% (285/303) | 80% (1107/1391) | 21% (118/570) |
H | 0-Shot | 70.89% | 85% (259/303) | 90% (1247/1391) | 17% (99/570) |
I | 0-Shot | 69.17% | 58% (176/303) | 86% (1201/1391) | 33% (189/570) |
J | 0-Shot | 57.38% | 47% (142/303) | 78% (1086/1391) | 12% (71/570) |
K | 0-Shot | 41.87% | 34% (102/303) | 52% (728/1391) | 21% (118/570) |
L | 0-Shot | 42.84% | 66% (200/303) | 45% (629/1391) | 25% (141/570) |
M | 0-Shot | 51.46% | 26% (79/303) | 77% (1078/1391) | 1% (8/570) |
N | 0-Shot | 29.77% | 11% (33/303) | 44% (608/1391) | 6% (33/570) |
O | 0-Shot | 61.00% | 37% (113/303) | 90% (1257/1391) | 2% (11/570) |
P | 3-Shot | 78.20% | 91% (275/302) | 91% (1266/1390) | 40% (227/569) |
Q | 6-Shot | 76.93% | 96% (289/302) | 73% (1010/1387) | 77% (438/569) |
R | 20-Shot | 81.42% | 92% (274/299) | 94% (1301/1379) | 45% (252/566) |
S | 30-Shot | 81.51% | 87% (255/294) | 98% (1345/1379) | 39% (221/561) |
T | 27-Shot | 83.73% | 93% (272/294) | 94.3% (1303/1382) | 53% (298/561) |
U | 25-Shot | 82.94% | 90% (266/294) | 96.2% (1331/1384) | 46% (260/561) |
V | 21-Shot | 83.28% | 92% (273/296) | 94.9% (1314/1384) | 50% (281/563) |
W | 15-Shot | 81.55% | 94% (279/298) | 87.7% (1215/1386) | 60% (340/565) |
X | 30-Shot | 81.74% | 89% (261/293) | 96.7% (1336/1381) | 41% (229/560) |
Y | 60-Shot | 75.82% | 66% (186/283) | 99.8% (1368/1371) | 21% (117/550) |
Z | 27-Shot | 81.36% | 80% (236/294) | 99.4% (1374/1382) | 37% (210/561) |
AA | 23-Shot | 82.95% | 93% (276/296) | 94.9% (1314/1384) | 48% (269/561) |
AB | 25-Shot | 83.70% | 92% (270/294) | 95.3% (1317/1382) | 51% (287/563) |
AC | 23-Shot | 83.40% | 95% (278/294) | 93.8% (1296/1382) | 52% (295/565) |
AD | 27-Shot | 86.10% | 90% (264/294) | 95.5% (1320/1382) | 61% (342/561) |
AE | 60-Shot | 83.71% | 83% (234/283) | 97.8% (1341/1371) | 49% (270/550) |
AF | 15-Shot | 82.00% | 91% (272/298) | 88.8% (1231/1386) | 60% (341/565) |
AG | 27-Shot w/System Prompt | 84.40% | 84% (248/294) | 97.8% (1351/1382) | 52% (289/561) |
AH | 27-Shot w/System Prompt | 84.67% | 87% (256/294) | 97.1% (1342/1382) | 53% (296/561) |
AI | 27-Shot w/System Prompt | 84.99% | 88% (260/294) | 97.3% (1345/1382) | 50% (283/561) |
I’ll start out with a simple instruction.
promptA = """Label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}"""
print(promptA)
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Looks good! I’m able to generate a response.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
'neutral'
At ~50ms per prompt, it would take about 2 minutes to run inference on the full 2264 item dataset.
45.4 ms ± 829 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Right off the bat, Qwen2-1.5B-Instruct has decent performance on sentiment classification. 81.85% is not bad at all! Let’s see if I can improve on that.
Interesting to note: Qwen2-1.5B-Instruct does not classify any other
values than negative
, neutral
and positive
.
I’ll repeat the instruction after providing the sentence, as this has usually improved performance.
Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
Interesting! Repeating the instruction usually increases the accuracy. Here, it drops by about 30%!
The model actually gets better at negative
classification but significantly plummets for the other two.
I’ll return to Prompt A and change the format a bit.
Respond with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
This yields a similar accuracy to Prompt A.
The model does quite well with neutral
and negative
sentences but terribly with positive
sentences. In fact, Qwen2-1.5B-Instruct (96%) beats the best-performing phi-3 prompt (94%) in neutral
True Positive Rate (TPR).
I’ll add a period after the instruction.
Respond with a single word: negative, positive, or neutral.
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Nope! Period == bad.
I’ll repeat the instruction to see if it improves performance.
Respond with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with a single word: negative, positive, or neutral
Nope! The repeated instruction decreases the accuracy.
Note that the negative
sentiment TPR is near perfect.
I realized I didn’t have “Instruct:” at the start of the prompt so I’ll add that to Prompt C.
Instruct: Respond with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
That decreases the accuracy by 3%.
Note that this prompt yields the best neutral
TPR so far (1355/1391 = 97.4%) for this model.
I’ll now add some additional instructions.
Your task is to analyze the sentiment (from an investor's perspective) of the text below.
Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Additional instruction worsens the model’s performance by ~15%.
Nothing immediately notable about the results.
I’ll now try a series of prompts suggested by Claude (based on the previous prompts and accuracies achieved). These prompts are simple and concise.
Sentiment: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Output: [negative/positive/neutral]
I’m surprised that such a simple prompt performs well, though it’s 11% worse than my best prompts so far.
What’s notable about this prompt’s results is the incredibly bad performance on positive
sentiment (18%). Also noteworthy is that with this prompt there are now other
responses than negative
, neutral
or positive
.
Another prompt suggested by Claude.
Classify sentiment: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Answer: [negative/positive/neutral]
A similar performance to Prompt H.
This prompt performs worse for all three sentiments, but especially negative
and positive
.
Text: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Sentiment classification: [negative/positive/neutral]
This prompt yields a significantly worse performance.
Another Claude suggested prompt:
Analyze the sentiment of this text:
According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Classification: [negative/positive/neutral]
This prompt yields worse results.
Another Claude suggested prompt (so far, has not yielded good results!).
Categorize the following text as negative, positive, or neutral:
According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Category:
Another subpar result.
Another Claude suggested prompt (3 more left to try out!)
Sentiment analysis task:
Input: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Output: [negative/positive/neutral]
Nope! No success with this one either (well, 51% success).
determine the sentiment:
According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
sentiment: [negative/positive/neutral]
This one was particularly bad.
Sentiment classification task:
'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .'
Label: [negative/positive/neutral]
Looks like I’ll stick with Prompt A for now!
I’ll revisit Prompt A and give it a few examples.
few_shot_responses
functiondef few_shot_responses(dataset, prompt, examples):
responses = []
dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
print(dataset[0]['prompt'])
few_shot_examples = []
for example in examples:
few_shot_examples.append({"role": "user", "content": prompt.format(text=example[0])})
few_shot_examples.append({"role": "assistant", "content": example[1]})
for row in dataset:
messages = few_shot_examples + [{"role": "user", "content": row['prompt']}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=2
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip().lower()
responses.append(response)
# calculate accuracy
df = dataset.to_pandas()
df['responses'] = pd.Series(responses)
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
df['lm_match'] = df['label_text'] == df['responses']
acc = df.lm_match.mean()
return df, acc
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2261
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
examples
[('According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
'neutral'),
("For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",
'positive'),
('Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .',
'negative')]
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .
3-Shot prompting does not improve performance.
I’ll try 6-Shot prompting next.
exclude_idxs=[0, 1, 292, 37, 38, 39]
promptQ_ds = ds_subset(dataset, exclude_idxs=exclude_idxs)
promptQ_ds
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2258
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
6
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .
6-Shot doesn’t fare much better.
I’ll now bump it up to 20 examples.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2244
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
20
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .
Interestingly, the 20-Shot prompt yields a worse accuracy than the 0-Shot prompt.
Compared to 0-Shot Prompt A (85%), the 20-Shot prompt has a significantly higher TPR for neutral
sentiment (94%) whereas negative
and positive
perform worse.
I’ll increase the number of examples to 30, but will only add negative
and positive
examples as the model’s performance was lacking for those sentiments.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2234
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
30
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
Increasing the number of examples and changing the proportions has not improved the overall accuracy.
Interestingly, the number of correctly predicted neutral
sentences has increased with this prompt while the number of correctly predicted negative
and positive
sentences decreased.
As the number of examples have increased, the model is classifying more and more negative and positive sentences as neutral. After chatting with Claude: I’ll start removing neutral sentences from the example to see if that reverses this trend.
I’ll start by giving the model 9 examples per sentiment.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2237
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
27
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
Hooray! This strategy yielded results. The accuracy increases by a couple percent.
The neutral
TPR has decreased, as expected, but negative
and positive
are now making a comeback.
I’ll continue to decrease the number of examples, this time by decreasing the number of neutral examples.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2239
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
25
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
Decreasing the number of neutral examples decreases the accuracy.
The model is actually worse at predicting negative
and positive
sentences.
I’ll continue decreasing examples, with equal amounts from each sentiment for a 21-Shot prompt.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2243
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
21
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Foundries division reports its sales increased by 9.7 % to EUR 63.1 mn from EUR 57.5 mn in the corresponding period in 2006 , and sales of the Machine Shop division increased by 16.4 % to EUR 41.2 mn from EUR 35.4 mn in the corresponding period in 2006 .
The accuracy improves, but is still second-best so far.
Compared to my best-performing prompting strategy, this prompt yields one more correct negative
sentence, 11 more correct neutral
s and 17 fewer correct positive
s.
I’ll try one more reduction in examples to 15 before increasing them past 27.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2249
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
15
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Clothing retail chain Sepp+ñl+ñ 's sales increased by 8 % to EUR 155.2 mn , and operating profit rose to EUR 31.1 mn from EUR 17.1 mn in 2004 .
Decreasing the number of examples to 15 has worsened the accuracy.
Interestingly, here the negative
and positive
sentiments perform better but neutral
performs worse (compared to the best- performing prompt).
I’ll now go in the other direction and increase the number of prompts from 27 to 30 (10 per sentiment).
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2234
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
30
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: MegaFon 's subscriber base increased 16.1 % in 2009 to 50.5 million users as of December 31 , while its market share by the number of customers amounted to 24 % as of late 2009 , up from 23 % as of late 2008 , according to TeliaSonera estimates .
Nope! Increasing the prompts to 30 doesn’t yield better results.
A similar trend as before is appearing: the model gets a lot better at predicting neutral
sentences at the cost of negative
and positive
sentences.
Before I return to the best-performing 21-Shot prompt, I’ll try and give the model a significantly higher number of examples (60) to see if that improves performance.
exclude_idxs = [
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,# positive
292, 293, 294, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 370, # negative
0, 37, 38, 39, 40, 263, 264, 265, 266, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280 # neutral
]
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2204
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
60
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: The fair value of the property portfolio doubled as a result of the Kapiteeli acquisition and totalled EUR 2,686.2 1,259.7 million .
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Nope! 60 examples does not improve performance.
Interestingly this prompt causes the model to perform ridiculously well on neutral
sentences (1368/1371 = 99.8%), but abysmally on negative
and especially positive
sentences (117/550 = 21%).
I’ll return to the best-performing prompt: 27-Shot Prompt A. I’ll see if adding instructions helps.
Label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
TEXT: {text}
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2237
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
27
Label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
The modified prompt resulted in a ~2% drop in accuracy.
This prompt performs better than the best-performing one for neutral
sentences (1374 > 1300) at the cost of negative
and positive
sentences.
Since positive
sentiment is performing the worst, I’ll see if reducing neutral
and negative
examples improves its performance. I’ll stick with the original Prompt A.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2241
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
23
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
The overall accuracy drops a bit.
The positive
performance deteriorates while negative
and neutral
get better.
I’ll now reduce the positive
examples while keeping the other two the same as my best-performing prompt.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2239
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
25
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Foundries division reports its sales increased by 9.7 % to EUR 63.1 mn from EUR 57.5 mn in the corresponding period in 2006 , and sales of the Machine Shop division increased by 16.4 % to EUR 41.2 mn from EUR 35.4 mn in the corresponding period in 2006 .
While still a bit lower than my best-performing accuracy, this approach might be worth expanding on.
The number of correctly predicted neutral
sentences increases, while the other two decrease.
I’ll continue reducing positive
examples.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2241
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
23
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Clothing retail chain Sepp+ñl+ñ 's sales increased by 8 % to EUR 155.2 mn , and operating profit rose to EUR 31.1 mn from EUR 17.1 mn in 2004 .
Nope! That doesn’t improve accuracy.
neutral
and positive
sentences are correctly predicted at a worse rate, negative
sentences predicted a bit better.
As I was staring at exclude_idxs
I realized that they are sorted by sentiment with positive
first, then negative
and then neutral
. Perhaps this order affects the generations? I’ll try randomizing the order of the 27 examples that yielded the best accuracy.
[293,
266,
264,
6,
347,
7,
263,
9,
38,
37,
0,
352,
5,
4,
351,
350,
40,
294,
1,
8,
349,
2,
348,
292,
3,
39,
265]
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2237
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
27
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Wow, that actually improved my accuracy! It’s making me question all previous few-shot prompt results!
negative
TPR decreases, neutral
and positive
increase quite a bit!
I’ll return to my 60-Shot prompt and shuffle the examples to see if that improves the accuracy.
exclude_idxs = [
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,# positive
292, 293, 294, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 370, # negative
0, 37, 38, 39, 40, 263, 264, 265, 266, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280 # neutral
]
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2204
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
60
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: The fair value of the property portfolio doubled as a result of the Kapiteeli acquisition and totalled EUR 2,686.2 1,259.7 million .
The shuffled 60-Shot prompt does not yield better results.
Interestingly it gets the same number of neutral
sentences right (1341) but get lower counts for the other two.
I’ll now try a shuffled 15-Shot prompt.
Dataset({
features: ['sentence', 'label', 'label_text', '__index_level_0__'],
num_rows: 2249
})
examples = []
for idx in exclude_idxs:
examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))
len(examples)
15
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Clothing retail chain Sepp+ñl+ñ 's sales increased by 8 % to EUR 155.2 mn , and operating profit rose to EUR 31.1 mn from EUR 17.1 mn in 2004 .
Shuffling the 15-Shot prompt doesn’t beat the 27-Shot prompt.
The last thing I’ll experiment with is adding a system prompt.
few_shot_responses
functiondef few_shot_responses(dataset, prompt, examples, sp):
responses = []
dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
print(dataset[0]['prompt'])
few_shot_examples = []
for example in examples:
few_shot_examples.append({"role": "user", "content": prompt.format(text=example[0])})
few_shot_examples.append({"role": "assistant", "content": example[1]})
for row in dataset:
messages = [{"role": "system", "content": sp}] + few_shot_examples + [{"role": "user", "content": row['prompt']}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=2
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip().lower()
responses.append(response)
# calculate accuracy
df = dataset.to_pandas()
df['responses'] = pd.Series(responses)
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
df['lm_match'] = df['label_text'] == df['responses']
acc = df.lm_match.mean()
return df, acc
sp = "You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral. Consider the overall impact and implications of the statement when making your classification. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral. Consider terms like 'growth', 'decline', 'improvement', or 'deterioration' as indicators of change."
You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral. Consider the overall impact and implications of the statement when making your classification. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral. Consider terms like 'growth', 'decline', 'improvement', or 'deterioration' as indicators of change.
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
This particular system prompt doesn’t improve the accuracy.
I’ll simplify the system prompt.
You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral. Consider the overall impact and implications of the statement when making your classification.
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
Using a simpler system prompt lowers the accuracy.
The model gets 22 more neutral
sentences correct but does worse on negative
and positive
sentences (when compared to the best-performing Prompt AD).
I’ll further simplify the system prompt.
You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral.
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
I still can’t recover (or improve upon) the best-performing accuracy of 86%.
The neutral
performance continues to increase while negative
and positive
sentences suffer.
def test_gen(examples):
few_shot_examples = []
for example in examples:
few_shot_examples.append({"role": "user", "content": promptA.format(text=example[0])})
few_shot_examples.append({"role": "assistant", "content": example[1]})
messages = few_shot_examples + [{"role": "user", "content": promptA.format(text=dataset[0]['sentence'])}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=2
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip().lower()
return response
208 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
few_shot_responses
functiondef few_shot_responses(dataset, prompt, examples):
responses = []
dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
few_shot_examples = []
for example in examples:
few_shot_examples.append({"role": "user", "content": prompt.format(text=example[0])})
few_shot_examples.append({"role": "assistant", "content": example[1]})
for row in dataset:
messages = few_shot_examples + [{"role": "user", "content": row['prompt']}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=2
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip().lower()
responses.append(response)
# calculate accuracy
df = dataset.to_pandas()
df['responses'] = pd.Series(responses)
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
df['lm_match'] = df['label_text'] == df['responses']
acc = df.lm_match.mean()
return df, acc
This prompt results in a pretty consistent overall accuracy, around 86%.
Here are the results from my experiments so far (**the best-performing prompt from this notebook):
Model | Prompting Strategy | Overall Accuracy | negative |
neutral |
positive |
---|---|---|---|---|---|
claude-3-5-sonnet-20240620 | 3-Shot | 94.78% | 98% (297/303) | 94% (1302/1391) | 95% (544/570) |
claude-3-opus-20240229 | 0-Shot | 94.13% | 98% (297/303) | 96% (1333/1391) | 88% (501/570) |
phi-3.5 | 20-Shot | 93.94% | 96% (286/299) | 98% (1355/1379) | 83% (467/566) |
phi-3 | 30-Shot w/System Prompt | 92.79% | 98% (290/297) | 94% (1284/1373) | 88% (499/564) |
claude-3-haiku-20240307 | 3-Shot | 92.39% | 90% (272/303) | 91% (1267/1391) | 96% (550/570) |
phi-2 | 6-Shot | 91.94% | 88% (267/302) | 94% (1299/1387) | 90% (510/569) |
**Qwen2-1.5B | 27-Shot | 86.10% | 90% (264/294) | 95.5% (1320/1382) | 61% (342/561) |
Here are the results from this notebook. The best-performing prompt was a randomly shuffled 27-Shot prompt (Prompt AD), yielding an overall accuracy of 86.10%.
prompt | strategy | accuracy | negative | neutral | positive |
---|---|---|---|---|---|
A | 0-Shot | 81.76% | 97% (293/303) | 85% (1185/1391) | 65% (373/570) |
B | 0-Shot | 51.86% | 99% (300/303) | 61% (846/1391) | 5% (28/570) |
C | 0-Shot | 81.40% | 93% (283/303) | 96% (1330/1391) | 40% (230/570) |
D | 0-Shot | 78.53% | 92% (279/303) | 92% (1281/1391) | 38% (218/570) |
E | 0-Shot | 66.21% | 100% (302/303) | 82% (1145/1391) | 9% (52/570) |
F | 0-Shot | 78.05% | 88% (267/303) | 97% (1355/1391) | 25% (145/570) |
G | 0-Shot | 66.70% | 94% (285/303) | 80% (1107/1391) | 21% (118/570) |
H | 0-Shot | 70.89% | 85% (259/303) | 90% (1247/1391) | 17% (99/570) |
I | 0-Shot | 69.17% | 58% (176/303) | 86% (1201/1391) | 33% (189/570) |
J | 0-Shot | 57.38% | 47% (142/303) | 78% (1086/1391) | 12% (71/570) |
K | 0-Shot | 41.87% | 34% (102/303) | 52% (728/1391) | 21% (118/570) |
L | 0-Shot | 42.84% | 66% (200/303) | 45% (629/1391) | 25% (141/570) |
M | 0-Shot | 51.46% | 26% (79/303) | 77% (1078/1391) | 1% (8/570) |
N | 0-Shot | 29.77% | 11% (33/303) | 44% (608/1391) | 6% (33/570) |
O | 0-Shot | 61.00% | 37% (113/303) | 90% (1257/1391) | 2% (11/570) |
P | 3-Shot | 78.20% | 91% (275/302) | 91% (1266/1390) | 40% (227/569) |
Q | 6-Shot | 76.93% | 96% (289/302) | 73% (1010/1387) | 77% (438/569) |
R | 20-Shot | 81.42% | 92% (274/299) | 94% (1301/1379) | 45% (252/566) |
S | 30-Shot | 81.51% | 87% (255/294) | 98% (1345/1379) | 39% (221/561) |
T | 27-Shot | 83.73% | 93% (272/294) | 94.3% (1303/1382) | 53% (298/561) |
U | 25-Shot | 82.94% | 90% (266/294) | 96.2% (1331/1384) | 46% (260/561) |
V | 21-Shot | 83.28% | 92% (273/296) | 94.9% (1314/1384) | 50% (281/563) |
W | 15-Shot | 81.55% | 94% (279/298) | 87.7% (1215/1386) | 60% (340/565) |
X | 30-Shot | 81.74% | 89% (261/293) | 96.7% (1336/1381) | 41% (229/560) |
Y | 60-Shot | 75.82% | 66% (186/283) | 99.8% (1368/1371) | 21% (117/550) |
Z | 27-Shot | 81.36% | 80% (236/294) | 99.4% (1374/1382) | 37% (210/561) |
AA | 23-Shot | 82.95% | 93% (276/296) | 94.9% (1314/1384) | 48% (269/561) |
AB | 25-Shot | 83.70% | 92% (270/294) | 95.3% (1317/1382) | 51% (287/563) |
AC | 23-Shot | 83.40% | 95% (278/294) | 93.8% (1296/1382) | 52% (295/565) |
AD | 27-Shot | 86.10% | 90% (264/294) | 95.5% (1320/1382) | 61% (342/561) |
AE | 60-Shot | 83.71% | 83% (234/283) | 97.8% (1341/1371) | 49% (270/550) |
AF | 15-Shot | 82.00% | 91% (272/298) | 88.8% (1231/1386) | 60% (341/565) |
AG | 27-Shot w/System Prompt | 84.40% | 84% (248/294) | 97.8% (1351/1382) | 52% (289/561) |
AH | 27-Shot w/System Prompt | 84.67% | 87% (256/294) | 97.1% (1342/1382) | 53% (296/561) |
AI | 27-Shot w/System Prompt | 84.99% | 88% (260/294) | 97.3% (1345/1382) | 50% (283/561) |
Here are my takeaways from working with Qwen2-1.5B-Instruct:
positive
sentiment True Positive Rate was considerably worse than neutral
or negative
sentiments. The most accurate positive
sentiment classification was 77% (Prompt Q) compared to 99.8% for negative
(Prompt Y) and 100% for neutral
(Prompt E).negative
or positive
sentiment if people are making decisions based on the predicted sentiment.I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.