Sentiment Classification with phi-3.5

python
LLM
TinySentiment
In this blog post I use phi-3.5 to classify sentiment in the financial_phrasebank dataset with 93.94% accuracy.
Author

Vishal Bakshi

Published

September 12, 2024

Setup

Show pip installs
!pip install torch==2.3.1 -qq
!pip install accelerate==0.31.0 -qq
!pip install transformers==4.41.2 -qq
!pip install huggingface_hub -qq
!pip install datasets~=2.16.1 -qq
!pip install plotly==5.19.0 -qq
!pip install scikit-learn==1.2 -qq
!pip install pynvml -qq
Show imports and setup
import gc
import pandas as pd
import numpy as np
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from pandas.api.types import CategoricalDtype
import torch

def report_gpu():
    print(torch.cuda.list_gpu_processes())
    gc.collect()
    torch.cuda.empty_cache()
    
import warnings
#warnings.filterwarnings("ignore")

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import time

from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 
from transformers.pipelines.pt_utils import KeyDataset
from fastcore.all import *

#torch.set_default_device("cuda")
torch.cuda.set_device(0)

model_nm = "microsoft/Phi-3.5-mini-instruct"
model = AutoModelForCausalLM.from_pretrained( 
    model_nm,  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained(model_nm)

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

# load dataset
dataset = load_dataset(
    "financial_phrasebank", "sentences_allagree", 
    split="train"  # note that the dataset does not have a default test split
)
# create a new column with the numeric label verbalised as label_text (e.g. "positive" instead of "0")
label_map = {i: label_text for i, label_text in enumerate(dataset.features["label"].names)}

def add_label_text(example):
    example["label_text"] = label_map[example["label"]]
    return example

dataset = dataset.map(add_label_text)

print(dataset)
Show add_prompt and generate_responses functions
def add_prompt(item, prompt):
        item['prompt'] = prompt.format(text=item['sentence'])
        return item
    
def generate_responses(dataset, prompt):
    responses = []
    dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
    
    # check that the prompt is correctly formatted
    print(dataset[0]['prompt'])
    print('---------')
    
    for row in dataset:
        messages = [  
            {"role": "user", "content": row['prompt']},
        ] 

        generation_args = { 
            "max_new_tokens": 2, 
            "return_full_text": False, 
            "temperature": 0.1, 
            "do_sample": True, 
        } 

        response = pipe(messages, **generation_args) 
        responses.append(response[0]['generated_text'].strip().lower())
        
    # calculate accuracy
    df = dataset.to_pandas()
    df['responses'] = pd.Series(responses)
    df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
    df['lm_match'] = df['label_text'] == df['responses']
    acc = df.lm_match.mean()
    return df, acc
Show generate_response function
def generate_response(prompt):
    messages = [  
        {"role": "user", "content": prompt},
    ] 

    generation_args = { 
        "max_new_tokens": 2, 
        "return_full_text": False, 
        "temperature": 0.1, 
        "do_sample": True, 
    } 

    output = pipe(messages, **generation_args) 
    return output[0]['generated_text']
Show make_cm function
def make_cm(df):
    """Create confusion matrix for true vs predicted sentiment classes"""
    
    cm = confusion_matrix(y_true=df['label_text'], y_pred=df['responses'], labels=['negative', 'neutral', 'positive', 'other'])
    disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive', 'other'])
    
    # I chose 8x8 so it fits on one screen but still is large
    fig, ax = plt.subplots(figsize=(8,8))
    disp.plot(ax=ax,text_kw={'fontsize': 16}, cmap='Blues', colorbar=False);
    
    # change label font size without changing label text
    ax.xaxis.label.set_fontsize(18)
    ax.yaxis.label.set_fontsize(18)
    
    # make tick labels larger
    ax.tick_params(axis='y', labelsize=16)
    ax.tick_params(axis='x', labelsize=16)
Show ds_subset function
def ds_subset(dataset, exclude_idxs, columns=[0, 1, 2]):
    idxs = list(range(len(dataset)))
    idxs = [x for x in idxs if x not in exclude_idxs]
    ddf = dataset.to_pandas()
    new_ds = Dataset.from_pandas(ddf.iloc[idxs, columns])
    return new_ds

Background

In this notebook I’ll use Phi-3-5-mini-4k-instruct to classify sentiment in the financial_phrasebank dataset. In previous notebooks I have performed sentiment classification with phi-2 and the Claude series.

This notebook is part of a series of blog posts for a project I’m working called TinySentiment where I’m experimenting with tiny models to improve their ability to classify sentiment in the financial_phrasebank dataset. I was inspired to do so after reading this blog post and this corresponding notebook by Moritz Laurer as part of a fastai study group last year.

Here are the results from my experiments so far (**the best-performing prompt from this notebook):

Model Prompting Strategy Overall Accuracy negative neutral positive
claude-3-5-sonnet-20240620 3-Shot 94.78% 98% (297/303) 94% (1302/1391) 95% (544/570)
claude-3-opus-20240229 0-Shot 94.13% 98% (297/303) 96% (1333/1391) 88% (501/570)
**phi-3.5 20-Shot 93.94% 96% (286/299) 98% (1355/1379) 83% (467/566)
ph-3 30-Shot w/System Prompt 92.79% 98% (290/297) 94% (1284/1373) 88% (499/564)
claude-3-haiku-20240307 3-Shot 92.39% 90% (272/303) 91% (1267/1391) 96% (550/570)
phi-2 6-Shot 91.94% 88% (267/302) 94% (1299/1387) 90% (510/569)

Here are the per-prompt results from this notebook (phi-3.5):

prompt strategy accuracy negative neutral positive
A 0-Shot 62.32% 98% (296/303) 43% (592/1391) 92% (523/570)
B 0-Shot 88.60% 96% (290/303) 87% (1215/1391) 88% (501/570)
C 0-Shot 83.48% 98% (298/303) 76% (1062/1391) 93% (530/570)
D 0-Shot 68.64% 99% (300/303) 51% (713/1391) 95% (541/570)
E 0-Shot 88.25% 96% (290/303) 87% (1207/1391) 88% (501/570)
F 3-Shot 84.65% 98% (296/302) 77% (1070/1390) 96% (548/569)
G 6-Shot 77.99% 98% (297/302) 66% (913/1387) 97% (551/569)
H 3-Shot 83.06% 98% (296/302) 74% (1028/1390) 97% (554/569)
I 3-Shot 51.61% 100% (302/302) 32% (447/1390) 73% (418/569)
J 3-Shot 85.94% 98% (296/302) 80% (1108/1390) 95% (539/569)
K 0-Shot 77.96% 98% (298/303) 66% (919/1391) 96% (548/570)
L 0-Shot 80.57% 98% (297/303) 70% (972/1391) 97% (555/570)
M 0-Shot 91.30% 97% (294/303) 90% (1257/1391) 91% (516/570)
N 0-Shot w/System Prompt 88.74% 97% (295/303) 85% (1184/1391) 93% (530/570)
O 0-Shot w/System Prompt 87.10% 94% (285/303) 83% (1156/1391) 93% (531/570)
P 0-Shot 92.23% 94% (285/303) 94% (1307/1391) 87% (496/570)
Q 0-Shot 79.37% 99% (300/303) 73% (1009/1391) 86% (488/570)
R 20-Shot 93.94% 96% (286/299) 98% (1355/1379) 83% (467/566)
S 28-Shot 93.25% 94% (281/298) 99% (1358/1373) 79% (446/565)
T 20-Shot 84.54% 78% (232/299) 99.9% (1378/1379) 51% (287/566)

Prompt A

The HuggingFace model card for Phi-3-5 Mini-4K-Instruct says:

Given the nature of the training data, the Phi-3.5-mini-instruct model is best suited for prompts using the chat format

So, the first prompt I’ll try will be a simple instruction:

promptA = """Label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}"""
text = dataset[1]["sentence"]
text
"For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."
formatted_prompt = promptA.format(text=text)
print(formatted_prompt)
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
generate_response(formatted_prompt)
You are not running the flash-attention implementation, expect numerical differences.
' Negative'
%time generate_response(formatted_prompt)
CPU times: user 101 ms, sys: 9.29 ms, total: 111 ms
Wall time: 109 ms
' Negative'
%timeit -n 10 generate_response(formatted_prompt)
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
91.9 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Good–at least it works! Although it looks like I’ll have to strip the outputs of whitespace and convert them to lowercase. It takes about 0.1 seconds to generate the response, so it should take about 4 minutes to run inference on the whole dataset.

df, acc = generate_responses(dataset, promptA)
Label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
---------
acc
0.6232332155477032
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
df.to_csv('/notebooks/phi-3-5_A.csv', index=False)

This prompt struggled with the neutral sentiment, as 568/1391 were misclassified as something other than positive, neutral or negative.

make_cm(df)

Prompt B

I’ll repeat the instruction after the sentence and see if that improves the performance (as it did for phi-2).

promptB = """Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral"""
df, acc = generate_responses(dataset, promptB)
Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
---------

The accuracy jumps up from 62.3% to 88.6%! Repeating the instruction after the dataset item was something I learned to do in fastai study group.

acc
0.8860424028268551
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
df.to_csv('/notebooks/phi-3-5_B.csv', index=False)

The model does a much better job at predicting neutral sentiment with this adjustment.

make_cm(df)

Prompt C

I’ll add some introductory text to the prompt to see if that improves the model’s performance:

promptC = """Your task is to analyze the sentiment (from an investor's perspective) of the text below.

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral"""
df, acc = generate_responses(dataset, promptC)
Your task is to analyze the sentiment (from an investor's perspective) of the text below.

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
---------

The addition of this introductory text actually worsens the model’s performance by about 5%.

acc
0.8348056537102474
df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
make_cm(df)

df.to_csv('/notebooks/phi-3-5_C.csv', index=False)

Prompt D

I’ll try another prompt language adjustment to Prompt B: I’ll replace “label” with “Respond”.

promptD = """Instruct: Respond with only one of these words: negative, positive, or neutral
TEXT: {text}
Respond with only one of these words: negative, positive, or neutral"""
df, acc = generate_responses(dataset, promptD)
Instruct: Respond with only one of these words: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with only one of these words: negative, positive, or neutral
---------

Wow! The accuracy plummets to 69%. This was something that had improved the accuracy for phi-2.

acc
0.6863957597173145

The model actually improves its performance on negative and positive sentences, but significantly worsens its performance when classifying neutral sentences.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_D.csv', index=False)

Prompt E

Another adjustment that improved phi-2’s performance was to add a period after the instruction. I’ll see if doing so improves phi-3.5’s performance.

promptE = """Instruct: label the following TEXT with a single word: negative, positive, or neutral.
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral."""
df, acc = generate_responses(dataset, promptE)
Instruct: label the following TEXT with a single word: negative, positive, or neutral.
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral.
---------

Interestingly, this actually worsens the overall accuracy a bit.

acc
0.8825088339222615

The negative and positive true positive rate is the same as Prompt B, but neutral rate is worse (1207 < 1215).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_E.csv', index=False)

Prompt F

I’ll now move on to few-shot prompting to see if I can improve on the best overall accuracy so far (88.6%). To do so, I’ll create a new helper function (since the chat template handles few-shot prompt as multiple query-response exchanges between user and assistant).

Show few_shot_responses function
def few_shot_responses(dataset, prompt, examples):
    responses = []
    dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})

    few_shot_examples = []
    
    for example in examples:
        few_shot_examples.append({"role": "user", "content": prompt.format(text=example[0])})
        few_shot_examples.append({"role": "assistant", "content": example[1]})
    
    count = 0
    for row in dataset:
        count += 1
        messages = few_shot_examples + [{"role": "user", "content": row['prompt']}]
        
        if count == 1: print(messages)
        
        generation_args = { 
            "max_new_tokens": 2, 
            "return_full_text": False, 
            "temperature": 0.1, 
            "do_sample": True, 
        } 

        response = pipe(messages, **generation_args) 
        responses.append(response[0]['generated_text'].strip().lower())
        
    # calculate accuracy
    df = dataset.to_pandas()
    df['responses'] = pd.Series(responses)
    df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
    df['lm_match'] = df['label_text'] == df['responses']
    acc = df.lm_match.mean()
    return df, acc
exclude_idxs = [0, 1, 292]
promptF_ds = ds_subset(dataset, exclude_idxs)
promptF_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2261
})
examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))

examples
[('According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
  'neutral'),
 ("For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",
  'positive'),
 ('Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .',
  'negative')]
df, acc = few_shot_responses(promptF_ds, promptB, examples)
[{'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .\nlabel the TEXT with a single word: negative, positive, or neutral'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .\nlabel the TEXT with a single word: negative, positive, or neutral"}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .\nlabel the TEXT with a single word: negative, positive, or neutral'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .\nlabel the TEXT with a single word: negative, positive, or neutral'}]

The accuracy drops to 84.65%.

acc
0.8465280849181778

Compared to Prompt B, the true positive rate for neutral decreases (1070 < 1215) whereas for positive and negative sentiment the TPR increases (296 > 290, 548 > 501).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_F.csv', index=False)

Prompt G

I’ll now try a 6-Shot prompt using the examples that were best-performing for phi-2.

exclude_idxs=[0, 1, 292, 37, 38, 39]
promptG_ds = ds_subset(dataset, exclude_idxs=exclude_idxs)
promptG_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2258
})
examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))

examples[0], len(examples)
(('According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
  'neutral'),
 6)
df, acc = few_shot_responses(promptG_ds, promptB, examples)
[{'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .\nlabel the TEXT with a single word: negative, positive, or neutral'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .\nlabel the TEXT with a single word: negative, positive, or neutral"}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .\nlabel the TEXT with a single word: negative, positive, or neutral'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: At the request of Finnish media company Alma Media 's newspapers , research manager Jari Kaivo-oja at the Finland Futures Research Centre at the Turku School of Economics has drawn up a future scenario for Finland 's national economy by using a model developed by the University of Denver .\nlabel the TEXT with a single word: negative, positive, or neutral"}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: STOCK EXCHANGE ANNOUNCEMENT 20 July 2006 1 ( 1 ) BASWARE SHARE SUBSCRIPTIONS WITH WARRANTS AND INCREASE IN SHARE CAPITAL A total of 119 850 shares have been subscribed with BasWare Warrant Program .\nlabel the TEXT with a single word: negative, positive, or neutral'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: A maximum of 666,104 new shares can further be subscribed for by exercising B options under the 2004 stock option plan .\nlabel the TEXT with a single word: negative, positive, or neutral'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .\nlabel the TEXT with a single word: negative, positive, or neutral'}]

Unexpectedly, the accuracy drops to 78%.

acc
0.7798937112488928

The model performs better with this prompt than the so far best-performing 3-Shot prompt (84.7%) for negative sentences (297 > 296) and positive sentences (551 > 548) but performs much worse for neutral sentences (913 < 1070).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_G.csv', index=False)

Prompt H

I’ll return to the 3-Shot prompt (84.65%) and see if I can improve it by adjusting the language. First, I’ll add some introductory text to the start of the prompt. Note that this did not improve the 0-Shot performance.

promptH = """Your task is to analyze the sentiment (from an investor's perspective) of the text below.
Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral"""
exclude_idxs = [0, 1, 292]

examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))

examples
[('According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
  'neutral'),
 ("For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",
  'positive'),
 ('Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .',
  'negative')]
promptF_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2261
})
df, acc = few_shot_responses(promptF_ds, promptH, examples)
[{'role': 'user', 'content': "Your task is to analyze the sentiment (from an investor's perspective) of the text below.\nInstruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .\nlabel the TEXT with a single word: negative, positive, or neutral"}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': "Your task is to analyze the sentiment (from an investor's perspective) of the text below.\nInstruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .\nlabel the TEXT with a single word: negative, positive, or neutral"}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': "Your task is to analyze the sentiment (from an investor's perspective) of the text below.\nInstruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .\nlabel the TEXT with a single word: negative, positive, or neutral"}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': "Your task is to analyze the sentiment (from an investor's perspective) of the text below.\nInstruct: label the following TEXT with a single word: negative, positive, or neutral\nTEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .\nlabel the TEXT with a single word: negative, positive, or neutral"}]

This does not improve the overall accuracy. Instead, it drops by about 1.6%.

acc
0.8306059265811587

Compared to the 3-Shot prompt, negative sentences are classified at the same frequency (296/302), neutral sentences at a lower rate (1028 < 1070) and positive sentences at a higher rate (554 > 548).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_H.csv', index=False)

Prompt I

Before I give the model more than 6 examples, I’ll deviate from the recommended multi-turn chat format for few-shot prompting and give the examples in a single prompt.

promptI = """Instruct: label the following TEXT with a single word: negative, positive, or neutral

Examples:

TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
neutral

TEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
label the TEXT with a single word: negative, positive, or neutral
positive

TEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
label the TEXT with a single word: negative, positive, or neutral
negative

TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral
"""
promptF_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2261
})
df, acc = generate_responses(promptF_ds, promptI)
Instruct: label the following TEXT with a single word: negative, positive, or neutral

Examples:

TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
neutral

TEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
label the TEXT with a single word: negative, positive, or neutral
positive

TEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
label the TEXT with a single word: negative, positive, or neutral
negative

TEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .
label the TEXT with a single word: negative, positive, or neutral

---------

Nope! The performance of few-shot prompting without multi-turn format is drastically worse.

acc
0.5161432994250331

The true positive rate for negative sentiment is actually higher (302/302 or 100%) but the rate is much lower for neutral sentiment (917 < 1070) and positive sentiment (418 < 548).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_I.csv', index=False)

Prompt J

I’ll try one more prompt with single-turn few-shot examples. I’ll add “Output:” before the label in each example, and add the “Instruct:” instructions before each example TEXT. I’ll also remove the extra new line that I have after the final instruction.

promptJ = """Instruct: label the following TEXT with a single word: negative, positive, or neutral

Examples:

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
Output: neutral

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
label the TEXT with a single word: negative, positive, or neutral
Output: positive

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
label the TEXT with a single word: negative, positive, or neutral
Output: negative

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral
Output: """
df, acc = generate_responses(promptF_ds, promptJ)
Instruct: label the following TEXT with a single word: negative, positive, or neutral

Examples:

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral
Output: neutral

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
label the TEXT with a single word: negative, positive, or neutral
Output: positive

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
label the TEXT with a single word: negative, positive, or neutral
Output: negative

Instruct: label the following TEXT with a single word: negative, positive, or neutral
TEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .
label the TEXT with a single word: negative, positive, or neutral
Output: 
---------

Wow! This actually made a difference. This is the second-best overall accuracy I have achieved.

acc
0.8593542680229986

Compared to the previous second-best prompt (3-Shot), this prompt results in the same true positive rate for negative sentiment (296/302), a much higher rate for neutral sentiment (1108 > 1070) and a lower rate for positive sentiment (539 < 548).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_J.csv', index=False)

Prompt K

I’ll return to few-shot prompting in a bit, but want to first revisit zero-shot prompting as it yielded the best overall performance so far (88.6% overall accuracy).

I asked Claude for suggestions on how to improve that prompt and will be trying them out.

First suggestion:

Refine the Instruction: Try slight variations of the instruction to see if they yield better results:

promptK = """Instruct: Analyze the sentiment of the following financial statement and respond with a single word: negative, positive, or neutral
Financial statement: {text}
Sentiment (respond with a single word):"""
df, acc = generate_responses(dataset, promptK)
Instruct: Analyze the sentiment of the following financial statement and respond with a single word: negative, positive, or neutral
Financial statement: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Sentiment (respond with a single word):
---------

This yields a worse overall accuracy.

acc
0.7795936395759717

Compared to the best-performing 0-Shot Prompt B (88.6%) this prompt yields a higher true positive rate for neutral sentiment (298 > 290) and positive sentiment (548 > 501) but lower for neutral sentiment (919 < 1215).

make_cm(df)

df.to_csv('/notebooks/phi-3-5_K.csv', index=False)

Prompt L

Given the success of that prompt with negative and positive sentiment, I’ll see if I can improve it for neutral sentiment by adding the phrase: “if you’re not sure, respond with neutral.”

promptL = """Instruct: Analyze the sentiment of the following financial statement and respond with a single word: negative, positive, or neutral. If you’re not sure, respond with neutral.
Financial statement: {text}
Sentiment (respond with a single word, if you’re not sure, respond with neutral):"""
df, acc = generate_responses(dataset, promptL)
Instruct: Analyze the sentiment of the following financial statement and respond with a single word: negative, positive, or neutral. If you’re not sure, respond with neutral.
Financial statement: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Sentiment (respond with a single word, if you’re not sure, respond with neutral):
---------

This improves the overall accuracy but is still lower than the Prompt B (88.6%).

acc
0.8056537102473498

Compared to Prompt K, the true positive rate for each sentiment increased.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_L.csv', index=False)

Prompt M

Given the success of the phrase “if you’re not sure, respond with neutral” I’ll add it to Prompt B.

promptM = """Instruct: label the following TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral."""
df, acc = generate_responses(dataset, promptM)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
---------

Hooray!! With this language adjustment, I have achieved the best overall accuracy so far.

acc
0.9129858657243817

The true positive rate for all three sentiments has increased.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_M.csv', index=False)

Prompt N

I’ll see if adding a system prompt improves the performance.

Show updated generate_responses function
def generate_responses(dataset, prompt, sp=False):
    responses = []
    dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
    
    # check that the prompt is correctly formatted
    print(dataset[0]['prompt'])
    print('---------')
    
    for row in dataset:
        
        if sp:
            messages = [{'role': 'system', 'content': 'You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral. Consider the overall impact and implications of the statement when making your classification.'}
                       ] + [{"role": "user", "content": row['prompt']},]
            
        else: messages = [{"role": "user", "content": row['prompt']},] 

        generation_args = { 
            "max_new_tokens": 2, 
            "return_full_text": False, 
            "temperature": 0.1, 
            "do_sample": True, 
        } 

        response = pipe(messages, **generation_args) 
        responses.append(response[0]['generated_text'].strip().lower())
        
    # calculate accuracy
    df = dataset.to_pandas()
    df['responses'] = pd.Series(responses)
    df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
    df['lm_match'] = df['label_text'] == df['responses']
    acc = df.lm_match.mean()
    return df, acc
df, acc = generate_responses(dataset, promptM, sp=True)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
---------
You are not running the flash-attention implementation, expect numerical differences.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

Adding that system prompt results in a worse accuracy.

acc
0.8873674911660777

The true positive rate for negative (295 > 290) and positive (530 > 501) increases but for neutral (1184 < 1215) decreases.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_N.csv', index=False)

Prompt O

I’ll see if adding “if you’re not sure, respond with neutral” to the system message improves performance.

Show updated generate_responses function
def generate_responses(dataset, prompt, sp=False):
    responses = []
    dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})
    
    # check that the prompt is correctly formatted
    print(dataset[0]['prompt'])
    print('---------')
    
    for row in dataset:
        
        if sp:
            messages = [{'role': 'system', 'content': sp}
                       ] + [{"role": "user", "content": row['prompt']},]
            
        else: messages = [{"role": "user", "content": row['prompt']},] 

        generation_args = { 
            "max_new_tokens": 2, 
            "return_full_text": False, 
            "temperature": 0.1, 
            "do_sample": True, 
        } 

        response = pipe(messages, **generation_args) 
        responses.append(response[0]['generated_text'].strip().lower())
        
    # calculate accuracy
    df = dataset.to_pandas()
    df['responses'] = pd.Series(responses)
    df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
    df['lm_match'] = df['label_text'] == df['responses']
    acc = df.lm_match.mean()
    return df, acc
sp = "You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral. Consider the overall impact and implications of the statement when making your classification. If you're not sure, respond with neutral."
print(sp)
You are an expert in financial sentiment analysis. Your task is to accurately classify the sentiment of financial statements as negative, positive, or neutral. Consider the overall impact and implications of the statement when making your classification. If you're not sure, respond with neutral.
df, acc = generate_responses(dataset, promptM, sp=sp)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral. If you're not sure, respond with neutral.
---------

This system prompt still performs worse than no system prompt.

acc
0.8710247349823321

Compared to the best-performing Prompt M, this prompt yields a higher true positive rate for positive sentiment (531 > 501) but a lower rate for neutral sentiment (1156 < 1215) and for negative (294 > 290) sentiment.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_O.csv', index=False)

Prompt P

I’ll move away from system prompts for now and revisit language adjustments. For Prompt M, I’ll replace:

If you’re not sure, respond with neutral.

with

If the amount of money is not explicitly increasing or decreasing, respond with neutral.

promptP = """Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."""
df, acc = generate_responses(dataset, promptP)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
---------

Excellent! The overall accuracy again increases, this time to 92.2%. I’m still quite surprised it’s taken so much effort to surpass phi-2, but I’ll reflect on that later on.

acc
0.9222614840989399

Both negative and positive sentiment true positive rates decrease, but this prompt results in almost 100 more correct neutral responses.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_P.csv', index=False)

Prompt Q

Given the success of the zero-shot prompt with instructions on handling neutral statements, I’ll try a prompt suggested by Claude, which adds more nuance to handling neutral sentences:

promptQ = """Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral. 
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral. If key financial metrics are not clearly changing, respond with neutral. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral."""
print(promptQ)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral. 
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral. If key financial metrics are not clearly changing, respond with neutral. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral.
df, acc = generate_responses(dataset, promptQ)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral. 
TEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
label the TEXT with a single word: negative, positive, or neutral. If key financial metrics are not clearly changing, respond with neutral. If the amount of money, market share, or key performance indicators are not explicitly increasing or decreasing, respond with neutral.
---------
You are not running the flash-attention implementation, expect numerical differences.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

A more nuanced prompt actually deteriorates the overall accuracy by about 13%.

acc
0.7937279151943463

Compared to the best performing prompt P (92.2%), this prompt performs better on negative sentiment and worse on neutral and positive sentiment.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_Q.csv', index=False)

Prompt R

I’ll now try providing a large number of examples (20) in the prompt. I don’t expect this to improve upon my 92.2% accuracy since 3-Shot and 6-Shot prompting performed worse. Nevertheless, I’ve heard that it’s not uncommon to give a model dozens of examples.

exclude_idxs = [1, 2, 3, 4, 292, 293, 294, 347, 0, 37, 38, 39, 40, 263, 264, 265, 266, 270, 274, 283]
promptR_ds = ds_subset(dataset, exclude_idxs=exclude_idxs, columns=[0, 1, 2])
promptR_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2244
})
examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))

examples[0], len(examples)
(("For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",
  'positive'),
 20)
df, acc = few_shot_responses(promptR_ds, promptP, examples)
[{'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: However , the growth margin slowed down due to the financial crisis .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: 2009 3 February 2010 - Finland-based steel maker Rautaruukki Oyj ( HEL : RTRKS ) , or Ruukki , said today it slipped to a larger-than-expected pretax loss of EUR46m in the fourth quarter of 2009 from a year-earlier profit of EUR45m .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: At the request of Finnish media company Alma Media 's newspapers , research manager Jari Kaivo-oja at the Finland Futures Research Centre at the Turku School of Economics has drawn up a future scenario for Finland 's national economy by using a model developed by the University of Denver .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: STOCK EXCHANGE ANNOUNCEMENT 20 July 2006 1 ( 1 ) BASWARE SHARE SUBSCRIPTIONS WITH WARRANTS AND INCREASE IN SHARE CAPITAL A total of 119 850 shares have been subscribed with BasWare Warrant Program .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: A maximum of 666,104 new shares can further be subscribed for by exercising B options under the 2004 stock option plan .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Tiimari operates 194 stores in six countries -- including its core Finnish market -- and generated a turnover of 76.5 mln eur in 2005 .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Finnish Talvivaara Mining Co HEL : TLV1V said Thursday it had picked BofA Merrill Lynch and JPMorgan NYSE : JPM as joint bookrunners of its planned issue of convertible notes worth up to EUR250m USD332m .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The mall is part of the Baltic Pearl development project in the city of St Petersburg , where Baltic Pearl CJSC , a subsidiary of Shanghai Foreign Joint Investment Company , is developing homes for 35,000 people .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Vacon controls a further 5 % of the company via investment fund Power Fund I. EUR 1.0 = USD 1.397\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: 4 ) Complete name of the shareholder : Otto Henrik Bernhard Nyberg 5 ) Further information : The amount of shares now transferred corresponds to 5.68 % of the total number of shares in Aspo Plc. .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: It has some 30 offices worldwide and more than 90 pct of its net sales are generated outside Finland .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The contract value amounts to about EUR11m , the company added .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The business to be divested generates consolidated net sales of EUR 60 million annually and currently has some 640 employees .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}]

Wow! I’m so glad I tried a larger number of examples. The accuracy (93.94%) is now competitive with the Claude models!

acc
0.9393939393939394

With 20-Shot prompting, the true positive rate for negative (286 > 285) and neutral (1355 > 1307) sentiment increases, and decreases for positive (467 < 496) sentiment.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_R.csv', index=False)

Prompt S

I’ll increase the number of examples to 28 and see if that yields an improvement. I currently have 4 positive, 4 negative and 12 neutral examples. I’ll up that to 5:5:18.

exclude_idxs = [
    1, 2, 3, 4, 5, # positive
    292, 293, 294, 347, 348, # negative
    0, 37, 38, 39, 40, 263, 264, 265, 266, 270, 274, 283, 284, 285, 286, 287, 288, 289 # neutral
]
examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))

examples[0], len(examples)
(("For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",
  'positive'),
 28)
promptS_ds = ds_subset(dataset, exclude_idxs=exclude_idxs, columns=[0, 1, 2])
promptS_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2236
})
df, acc = few_shot_responses(promptS_ds, promptP, examples)
[{'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'positive'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: However , the growth margin slowed down due to the financial crisis .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: 2009 3 February 2010 - Finland-based steel maker Rautaruukki Oyj ( HEL : RTRKS ) , or Ruukki , said today it slipped to a larger-than-expected pretax loss of EUR46m in the fourth quarter of 2009 from a year-earlier profit of EUR45m .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: ( ADPnews ) - Feb 3 , 2010 - Finland-based steel maker Rautaruukki Oyj ( HEL : RTRKS ) , or Ruukki , said today it slipped to a larger-than-expected pretax loss of EUR 46 million ( USD 64.5 m ) in the fourth quarter of 2009 from a\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'negative'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: At the request of Finnish media company Alma Media 's newspapers , research manager Jari Kaivo-oja at the Finland Futures Research Centre at the Turku School of Economics has drawn up a future scenario for Finland 's national economy by using a model developed by the University of Denver .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: STOCK EXCHANGE ANNOUNCEMENT 20 July 2006 1 ( 1 ) BASWARE SHARE SUBSCRIPTIONS WITH WARRANTS AND INCREASE IN SHARE CAPITAL A total of 119 850 shares have been subscribed with BasWare Warrant Program .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: A maximum of 666,104 new shares can further be subscribed for by exercising B options under the 2004 stock option plan .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Tiimari operates 194 stores in six countries -- including its core Finnish market -- and generated a turnover of 76.5 mln eur in 2005 .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Finnish Talvivaara Mining Co HEL : TLV1V said Thursday it had picked BofA Merrill Lynch and JPMorgan NYSE : JPM as joint bookrunners of its planned issue of convertible notes worth up to EUR250m USD332m .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The mall is part of the Baltic Pearl development project in the city of St Petersburg , where Baltic Pearl CJSC , a subsidiary of Shanghai Foreign Joint Investment Company , is developing homes for 35,000 people .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Vacon controls a further 5 % of the company via investment fund Power Fund I. EUR 1.0 = USD 1.397\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: 4 ) Complete name of the shareholder : Otto Henrik Bernhard Nyberg 5 ) Further information : The amount of shares now transferred corresponds to 5.68 % of the total number of shares in Aspo Plc. .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: It has some 30 offices worldwide and more than 90 pct of its net sales are generated outside Finland .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The contract value amounts to about EUR11m , the company added .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The business to be divested generates consolidated net sales of EUR 60 million annually and currently has some 640 employees .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The company generates net sales of about 600 mln euro $ 775.5 mln annually and employs 6,000 .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The contract covers the manufacturing , surface-treatment and installation of the steel structures .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The order also includes start-up and commissioning services .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: The phones are targeted at first time users in growth markets .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Tielinja generated net sales of 7.5 mln euro $ 9.6 mln in 2005 .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Tikkurila Powder Coatings has some 50 employees at its four paint plants , which generated revenues of EUR2 .4 m USD3 .3 m in 2010 .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.'}, {'role': 'assistant', 'content': 'neutral'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.\nTEXT: Clothing retail chain Sepp+ñl+ñ 's sales increased by 8 % to EUR 155.2 mn , and operating profit rose to EUR 31.1 mn from EUR 17.1 mn in 2004 .\nlabel the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."}]
You are not running the flash-attention implementation, expect numerical differences.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

Interestingly, that decreases the overall accuracy by about 0.7%.

acc
0.932468694096601

Compared to Prompt R, this prompt yields fewer correct negative and positive sentences. It classifies 3 more neutral sentences correctly, but that doesn’t make up for the loss in performance of the other two sentiments.

make_cm(df)

df.to_csv('/notebooks/phi-3-5_S.csv', index=False)

Prompt T

I noticed in the Prompt R results that 14 sentences were classified as something “other” than neutral, positive, or negative. Instead of asking the model to respond with negative, neutral or positive, I’ll ask it to respond with 0, 1 or 2 and see if that simplification yields better results.

promptT = """Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).
TEXT: {text}
label the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral)."""
print(promptT)
Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).
TEXT: {text}
label the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).
exclude_idxs = [1, 2, 3, 4, 292, 293, 294, 347, 0, 37, 38, 39, 40, 263, 264, 265, 266, 270, 274, 283]
examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], str(dataset[idx]['label'])))

examples[0], len(examples)
(("For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",
  '2'),
 20)
Show updated few_shot_responses function
def few_shot_responses(dataset, prompt, examples):
    responses = []
    dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})

    few_shot_examples = []
    
    for example in examples:
        few_shot_examples.append({"role": "user", "content": prompt.format(text=example[0])})
        few_shot_examples.append({"role": "assistant", "content": example[1]})
    
    count = 0
    for row in dataset:
        count += 1
        messages = few_shot_examples + [{"role": "user", "content": row['prompt']}]
        
        if count == 1: print(messages)
        
        generation_args = { 
            "max_new_tokens": 2, 
            "return_full_text": False, 
            "temperature": 0.1, 
            "do_sample": True, 
        } 

        response = pipe(messages, **generation_args) 
        responses.append(response[0]['generated_text'].strip().lower())
        
    # calculate accuracy
    df = dataset.to_pandas()
    df['responses'] = pd.Series(responses)
    return df
df = few_shot_responses(promptR_ds, promptT, examples)
[{'role': 'user', 'content': "Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral)."}, {'role': 'assistant', 'content': '2'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '2'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '2'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '2'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '0'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '0'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: However , the growth margin slowed down due to the financial crisis .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '0'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: 2009 3 February 2010 - Finland-based steel maker Rautaruukki Oyj ( HEL : RTRKS ) , or Ruukki , said today it slipped to a larger-than-expected pretax loss of EUR46m in the fourth quarter of 2009 from a year-earlier profit of EUR45m .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '0'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': "Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: At the request of Finnish media company Alma Media 's newspapers , research manager Jari Kaivo-oja at the Finland Futures Research Centre at the Turku School of Economics has drawn up a future scenario for Finland 's national economy by using a model developed by the University of Denver .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral)."}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: STOCK EXCHANGE ANNOUNCEMENT 20 July 2006 1 ( 1 ) BASWARE SHARE SUBSCRIPTIONS WITH WARRANTS AND INCREASE IN SHARE CAPITAL A total of 119 850 shares have been subscribed with BasWare Warrant Program .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: A maximum of 666,104 new shares can further be subscribed for by exercising B options under the 2004 stock option plan .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Tiimari operates 194 stores in six countries -- including its core Finnish market -- and generated a turnover of 76.5 mln eur in 2005 .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Finnish Talvivaara Mining Co HEL : TLV1V said Thursday it had picked BofA Merrill Lynch and JPMorgan NYSE : JPM as joint bookrunners of its planned issue of convertible notes worth up to EUR250m USD332m .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: The mall is part of the Baltic Pearl development project in the city of St Petersburg , where Baltic Pearl CJSC , a subsidiary of Shanghai Foreign Joint Investment Company , is developing homes for 35,000 people .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Vacon controls a further 5 % of the company via investment fund Power Fund I. EUR 1.0 = USD 1.397\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: 4 ) Complete name of the shareholder : Otto Henrik Bernhard Nyberg 5 ) Further information : The amount of shares now transferred corresponds to 5.68 % of the total number of shares in Aspo Plc. .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: It has some 30 offices worldwide and more than 90 pct of its net sales are generated outside Finland .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: The contract value amounts to about EUR11m , the company added .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: The business to be divested generates consolidated net sales of EUR 60 million annually and currently has some 640 employees .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}, {'role': 'assistant', 'content': '1'}, {'role': 'user', 'content': 'Instruct: label the following TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).\nTEXT: Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .\nlabel the TEXT with a single integer: 0 (negative), 1 (neutral), or 2 (positive). If the amount of money is not explicitly increasing or decreasing, respond with 1 (neutral).'}]
You are not running the flash-attention implementation, expect numerical differences.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
df['responses'] = df['responses'].apply(lambda x: dataset.features["label"].names[int(x)])
df['lm_match'] = df['label_text'] == df['responses']
acc = df['lm_match'].mean()

Interestingly, this decreases the overall accuracy by almost 10%.

acc
0.8453654188948306

While there are no other classifications, and neutral true positive rate increases (1378 > 1355), the rate for negative (232 < 286) and especially positive (287 < 467) sentiment decreases. The model classifies almost half of the positive sentences as neutral.

make_cm(df)

Running Inference with the Best Prompt Multiple Times

For phi-2 I ran the best-performing prompt 10 times to see if it consistently performed at a high accuracy. Inference with phi-3.5, given the 20 examples in each prompt, takes much longer:

exclude_idxs = [1, 2, 3, 4, 292, 293, 294, 347, 0, 37, 38, 39, 40, 263, 264, 265, 266, 270, 274, 283]
promptR_ds = ds_subset(dataset, exclude_idxs=exclude_idxs, columns=[0, 1, 2])
promptR_ds
Dataset({
    features: ['sentence', 'label', 'label_text', '__index_level_0__'],
    num_rows: 2244
})
examples = []
for idx in exclude_idxs:
    examples.append((dataset[idx]['sentence'], dataset[idx]['label_text']))

len(examples)
20
promptP = """Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral."""
print(promptP)
Instruct: label the following TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
TEXT: {text}
label the TEXT with a single word: negative, positive, or neutral. If the amount of money is not explicitly increasing or decreasing, respond with neutral.
Show test_gen function
def test_gen(examples):
    responses = []
    
    few_shot_examples = []
    
    for example in examples:
        few_shot_examples.append({"role": "user", "content": promptP.format(text=example[0])})
        few_shot_examples.append({"role": "assistant", "content": example[1]})
        
    messages = few_shot_examples + [{"role": "user", "content": promptP.format(text=dataset[0])}]


    generation_args = { 
        "max_new_tokens": 2, 
        "return_full_text": False, 
        "temperature": 0.1, 
        "do_sample": True, 
    } 

    response = pipe(messages, **generation_args) 
    responses.append(response[0]['generated_text'].strip().lower())
    return responses

The model takes about 1.2 seconds to generate a response for a single dataset item, or about 45 minutes for the 2244 items (on a Paperspace Free-A4000). Given the 6 hour limit, the max I can do is run inference on the dataset 8 times. To be conservative, I’ll do it 7 times.

%timeit -n 10 test_gen(examples)
1.2 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Show few_shot_responses function
def few_shot_responses(dataset, prompt, examples):
    responses = []
    dataset = dataset.map(add_prompt, fn_kwargs={"prompt": prompt})

    few_shot_examples = []
    
    for example in examples:
        few_shot_examples.append({"role": "user", "content": prompt.format(text=example[0])})
        few_shot_examples.append({"role": "assistant", "content": example[1]})
    
    for row in dataset:
        messages = few_shot_examples + [{"role": "user", "content": row['prompt']}]
        
        generation_args = { 
            "max_new_tokens": 2, 
            "return_full_text": False, 
            "temperature": 0.1, 
            "do_sample": True, 
        } 

        response = pipe(messages, **generation_args) 
        responses.append(response[0]['generated_text'].strip().lower())
        
    # calculate accuracy
    df = dataset.to_pandas()
    df['responses'] = pd.Series(responses)
    df['responses'] = df['responses'].apply(lambda x: x if x in ['negative', 'positive', 'neutral'] else "other")
    df['lm_match'] = df['label_text'] == df['responses']
    acc = df.lm_match.mean()
    return df, acc
accs = []
for _ in range(7):
    df, acc = few_shot_responses(promptR_ds, promptP, examples)
    accs.append(acc)

The accuracy of this prompt is consistently around 93.9%.

pd.Series(accs).describe()
count    7.000000
mean     0.939139
std      0.000992
min      0.937611
25%      0.938503
50%      0.939394
75%      0.939840
max      0.940285
dtype: float64

Final Thoughts

Here is a summary of results including phi-2, phi-3, and the Claude family:

Model Prompting Strategy Overall Accuracy negative neutral positive
claude-3-5-sonnet-20240620 3-Shot 94.78% 98% (297/303) 94% (1302/1391) 95% (544/570)
claude-3-opus-20240229 0-Shot 94.13% 98% (297/303) 96% (1333/1391) 88% (501/570)
phi-3.5 20-Shot 93.94% 96% (286/299) 98% (1355/1379) 83% (467/566)
phi-3 30-Shot w/System Prompt 92.79% 98% (290/297) 94% (1284/1373) 88% (499/564)
claude-3-haiku-20240307 3-Shot 92.39% 90% (272/303) 91% (1267/1391) 96% (550/570)
phi-2 6-Shot 91.94% 88% (267/302) 94% (1299/1387) 90% (510/569)

Here are the per-prompt results from this notebook (phi-3.5):

prompt strategy accuracy negative neutral positive
A 0-Shot 62.32% 98% (296/303) 43% (592/1391) 92% (523/570)
B 0-Shot 88.60% 96% (290/303) 87% (1215/1391) 88% (501/570)
C 0-Shot 83.48% 98% (298/303) 76% (1062/1391) 93% (530/570)
D 0-Shot 68.64% 99% (300/303) 51% (713/1391) 95% (541/570)
E 0-Shot 88.25% 96% (290/303) 87% (1207/1391) 88% (501/570)
F 3-Shot 84.65% 98% (296/302) 77% (1070/1390) 96% (548/569)
G 6-Shot 77.99% 98% (297/302) 66% (913/1387) 97% (551/569)
H 3-Shot 83.06% 98% (296/302) 74% (1028/1390) 97% (554/569)
I 3-Shot 51.61% 100% (302/302) 32% (447/1390) 73% (418/569)
J 3-Shot 85.94% 98% (296/302) 80% (1108/1390) 95% (539/569)
K 0-Shot 77.96% 98% (298/303) 66% (919/1391) 96% (548/570)
L 0-Shot 80.57% 98% (297/303) 70% (972/1391) 97% (555/570)
M 0-Shot 91.30% 97% (294/303) 90% (1257/1391) 91% (516/570)
N 0-Shot w/System Prompt 88.74% 97% (295/303) 85% (1184/1391) 93% (530/570)
O 0-Shot w/System Prompt 87.10% 94% (285/303) 83% (1156/1391) 93% (531/570)
P 0-Shot 92.23% 94% (285/303) 94% (1307/1391) 87% (496/570)
Q 0-Shot 79.37% 99% (300/303) 73% (1009/1391) 86% (488/570)
R 20-Shot 93.94% 96% (286/299) 98% (1355/1379) 83% (467/566)
S 28-Shot 93.25% 94% (281/298) 99% (1358/1373) 79% (446/565)
T 20-Shot 84.54% 78% (232/299) 99.9% (1378/1379) 51% (287/566)

I ran inference for phi-3 and phi-3.5 in separate notebooks at the same time, so I have shared final thoughts for both:

  • Few-shot prompting in a chat format is a different experience: The sentence/label pairs have to be presented as a multi-turn conversation. For a large number of examples, this can lead to running out of GPU memory (as it did for 30-Shot prompting with phi-3.5).
  • Few-shot example proportion matters: I used a higher proportion of neutral examples in my 20-shot prompt since the majority of the dataset is made up of neutral sentences. Determining whether the proportion I used is optimal would require further experimentation.
  • 20-Shot phi-3.5 approaches Opus, Sonnet and GPT4 accuracy: I was pleasantly surprised that phi-3.5 reached the 94% mark that was achieved by GPT4 (in the original work by Moritz Laurer), and 3-Opus and 3.5-Sonnet in my previous experiments.
  • The best performing prompt suffers from a low true positive rate for positive sentiments: Although the 20-Shot phi-3.5 prompt achieved a high true positive rate (TPR) for neutral sentences (96%), it had one of the lowest TPRs for positive sentences (83%). It’s unclear if this is due to the imbalance between positive and neutral examples, since the TPR for negative sentiment is high (98%) despite having the same number of examples (4) as positive.
  • phi-3 performed differently than phi-3.5: phi-3 performed well with a system prompt while phi-3.5 did not. On the other hand, phi-3.5 performed better with 0-Shot prompting than phi-3 (results not shown here).
  • Future work: My next step is to run inference on this dataset using the Qwen2-1.5B model. After that, I’ll analyze the errors, especially for sentences that a majority of models classified incorrectly. With prompt engineering, there is potentially unlimited future work. Before I finish this project, I’ll try 30-Shot prompts for phi-2 and Haiku to see if they can beat phi-3’s 92.79% overall accuracy (and maybe even phi-3.5’s 93.94% accuracy).

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.