Sentiment Classification with Claude Using `claudette`

python

LLM

TinySentiment

In this blog post I use Sonnet (94% accuracy), Opus (94%) and Haiku (92%) to classify sentiment in the financial_phrasebank dataset.

Author

Vishal Bakshi

Published

August 29, 2024

Background

In this blog post, I demonstrate how I achieved 94.8% accuracy in classifying sentiment in the financial_phrasebank dataset using Claude-3.5-Sonnet, 94.1% accuracy using Claude-3-Opus, and 92.4% accuracy using Haiku, all accessed through the Answer.AI library claudette.

This notebook is part of a series of blog posts for a project I’m working called TinySentiment where I’m experimenting with tiny models to improve their ability to classify sentiment in the financial_phrasebank dataset. This notebook establishes a baseline using these larger models.

Here is a summary of results from this notebook. The best performing approach (3-Shot prompt with Sonnet) cost $2.27, while it cost $3.50 for Zero-Shot Opus:

Model	Prompt	Overall Accuracy	`negative`	`neutral`	`positive`
claude-3-5-sonnet-20240620	3-Shot	94.78%	98% (297/303)	94% (1302/1391)	95% (544/570)
claude-3-opus-20240229	Zero-Shot	94.13%	98% (297/303)	96% (1333/1391)	88% (501/570)
claude-3-5-sonnet-20240620	Zero-Shot	94%	98% (297/303)	92% (1279/1391)	97% (552/570)
claude-3-haiku-20240307	3-Shot	92.39%	90% (272/303)	91% (1267/1391)	96% (550/570)
claude-3-haiku-20240307	Zero-Shot	89.84%	96% (292/303)	85% (1183/1391)	98% (559/570)
claude-3-haiku-20240307	6-Shot	84.99%	98% (296/303)	76% (1059/1391)	99% (564/570)

Setup

Show imports and setup

!pip install datasets -Uqq
!pip install claudette -qq
from datasets import load_dataset, Dataset
from transformers.pipelines.pt_utils import KeyDataset
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn.functional as F
from claudette import *

dataset = load_dataset(
    "financial_phrasebank", "sentences_allagree",
    split="train"  # note that the dataset does not have a default test split
)

Show imports

!pip install claudette -qq
from claudette import *

Show function to make confusion matrix

def make_cm(df):
    """Create confusion matrix for true vs predicted sentiment classes"""

    cm = confusion_matrix(y_true=df['label_text'], y_pred=df['responses'], labels=['negative', 'neutral', 'positive'])
    disp = ConfusionMatrixDisplay(cm, display_labels=['negative', 'neutral', 'positive'])

    fig, ax = plt.subplots(figsize=(4,4))
    disp.plot(ax=ax,text_kw={'fontsize': 12}, cmap='Blues', colorbar=False);

    # change label font size without changing label text
    ax.xaxis.label.set_fontsize(16)
    ax.yaxis.label.set_fontsize(16)

    # make tick labels larger
    ax.tick_params(axis='y', labelsize=14)
    ax.tick_params(axis='x', labelsize=14)

Performing Sentiment Classification with Claude

3-Opus

Since Opus is pricier, I’ll only do sentiment classification once.

I’ll start by asking Claude (through the UI on claude.ai) for a recommended prompt for this task:

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

[Insert sentence here]

prompt = """Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

{sentence}"""

formatted_prompt = prompt.format(sentence=dataset['sentence'][0])

print(formatted_prompt)

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .

model = models[0]
model

'claude-3-opus-20240229'

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat.use

In: 0; Out: 0; Total: 0

Testing it out on a single sentence, it looks like I’m getting the correct result (a single word response).

r = chat(formatted_prompt)
r

neutral

id: msg_01UGxLBjREK1n6rap8vqokJc
content: [{'text': 'neutral', 'type': 'text'}]
model: claude-3-opus-20240229
role: assistant
stop_reason: end_turn
stop_sequence: None
type: message
usage: {'input_tokens': 73, 'output_tokens': 4, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}

Before I do the full 2264 rows, I’ll test out the performance on 25 rows which should take about 5 minutes:

results = []
tokens = 0

for row in dataset.select(range(25)):
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(sentence=row['sentence'])
  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens

Looking at the responses, the model is consistently (correctly) responding with a single-word label.

results

['neutral',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'neutral',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive']

results = []
tokens = 0

for row in dataset:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(sentence=row['sentence'])
  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens, len(results)

(191874, 2264)

model

'claude-3-opus-20240229'

df = dataset.to_pandas()
df['label_text'] = df['label'].apply(lambda x: dataset.features['label'].names[x])
df['responses'] = results
df['match'] = df['label_text'] == df['responses']
df.head()

	sentence	label	label_text	responses	match
0	According to Gran , the company has no plans t...	1	neutral	neutral	True
1	For the last quarter of 2010 , Componenta 's n...	2	positive	positive	True
2	In the third quarter of 2010 , net sales incre...	2	positive	positive	True
3	Operating profit rose to EUR 13.1 mn from EUR ...	2	positive	positive	True
4	Operating profit totalled EUR 21.1 mn , up fro...	2	positive	positive	True

Claude-3 Opus achieves a 94% accuracy, which matches the GPT4 accuracy in the original blog post by Moritz Laurer that motivated my TinySentiment project.

df['match'].mean()

0.9412544169611308

Opus classifies neutral with the highest true positive rate (1333/1391—note that two of the neutral responses were neither positive nor negative), followed by negative (297/303) and positive (501/570) sentences.

make_cm(df)

3.5-Sonnet

model = models[1]
model

'claude-3-5-sonnet-20240620'

print(prompt)

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

{sentence}

results = []
tokens = 0

for row in dataset:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(sentence=row['sentence'])
  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens, len(results)

(194516, 2264)

What I immediately notice about the results is that Sonnet capitalized the first letter of the response in some cases:

results[0]

'Neutral'

df = dataset.to_pandas()
df['label_text'] = df['label'].apply(lambda x: dataset.features['label'].names[x])
df['responses'] = results
df['responses'] = df['responses'].apply(lambda x: x.lower())
df['match'] = df['label_text'] == df['responses']
df.head()

	sentence	label	label_text	responses	match
0	According to Gran , the company has no plans t...	1	neutral	neutral	True
1	For the last quarter of 2010 , Componenta 's n...	2	positive	positive	True
2	In the third quarter of 2010 , net sales incre...	2	positive	positive	True
3	Operating profit rose to EUR 13.1 mn from EUR ...	2	positive	positive	True
4	Operating profit totalled EUR 21.1 mn , up fro...	2	positive	positive	True

Sonnet gets about the same accuracy, 94%, as Opus and GPT4! Opus cost me $3.50 while Sonnet cost me $0.73.

df['match'].mean()

0.9399293286219081

Sonnet outshines Opus in correctly predicting positive sentiment (552 to 501) matches it for negative sentences (297 each) and does a bit worse for neutral sentences (1279 to 1333).

make_cm(df)

3-Haiku

I expect Haiku to perform worse, but then again I might get surprised!

model = models[2]
model

'claude-3-haiku-20240307'

print(prompt)

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

{sentence}

results = []
tokens = 0

for row in dataset:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(sentence=row['sentence'])
  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens, len(results)

(193756, 2264)

df = dataset.to_pandas()
df['label_text'] = df['label'].apply(lambda x: dataset.features['label'].names[x])
df['responses'] = results
df['responses'] = df['responses'].apply(lambda x: x.lower())
df['match'] = df['label_text'] == df['responses']
df.head()

	sentence	label	label_text	responses	match
0	According to Gran , the company has no plans t...	1	neutral	neutral	True
1	For the last quarter of 2010 , Componenta 's n...	2	positive	positive	True
2	In the third quarter of 2010 , net sales incre...	2	positive	positive	True
3	Operating profit rose to EUR 13.1 mn from EUR ...	2	positive	positive	True
4	Operating profit totalled EUR 21.1 mn , up fro...	2	positive	positive	True

Haiku doesn’t perform as well, but it’s not too shabby at 90% accuracy.

df['match'].mean()

0.8984098939929329

Haiku actually beats both Opus and Sonnet in predicting positive sentences correctly (559 vs. 552 and 501), and is competitive in the negative true positive rate (292 vs 297). Where it lacks is in predicting neutral sentences (1183 vs. 1279 and 1333).

make_cm(df)

Prompt Engineering with Haiku

Since Haiku is the cheapest model, I’ll try different prompts and see if it improves its performance.

Prompt B

I’ll create a few-shot prompt and exclude the three examples used in the prompt from the dataset.

exclude_idxs = [0, 1, 292]

def ds_subset(dataset, exclude_idxs):
    idxs = list(range(len(dataset)))
    idxs = [x for x in idxs if x not in exclude_idxs]
    ddf = dataset.to_pandas()
    new_ds = Dataset.from_pandas(ddf.iloc[idxs])
    return new_ds

promptB_ds = ds_subset(dataset, exclude_idxs)
promptB_ds

Dataset({
    features: ['sentence', 'label', '__index_level_0__'],
    num_rows: 2261
})

for example in dataset.select([0, 1, 292]):
  print(example)

{'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'label': 1}
{'sentence': "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .", 'label': 2}
{'sentence': 'Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .', 'label': 0}

promptB = """Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

Examples:

sentence: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
Respond with ONLY the sentiment label, no other text.
sentiment: positive

sentence: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
Respond with ONLY the sentiment label, no other text.
sentiment: negative

sentence: {sentence}
Respond with ONLY the sentiment label, no other text.
sentiment: """

formatted_prompt = promptB.format(sentence=promptB_ds['sentence'][0])
print(formatted_prompt)

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

Examples:

sentence: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
Respond with ONLY the sentiment label, no other text.
sentiment: positive

sentence: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
Respond with ONLY the sentiment label, no other text.
sentiment: negative

sentence: In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .
Respond with ONLY the sentiment label, no other text.
sentiment:

model = models[2]
model

'claude-3-haiku-20240307'

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat(formatted_prompt)

positive

id: msg_0194ehQYJKJ7u5SJVQRXGFSx
content: [{'text': 'positive', 'type': 'text'}]
model: claude-3-haiku-20240307
role: assistant
stop_reason: end_turn
stop_sequence: None
type: message
usage: {'input_tokens': 299, 'output_tokens': 4, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0}

promptB_ds['label'][0]

I’ll do a test run through 25 rows which should take less than a minute:

results = []
tokens = 0

for row in promptB_ds.select(range(25)):
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = promptB.format(sentence=row['sentence'])
  print(formatted_prompt)
  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

The results, at least in terms of formatting, look good.

results

['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive']

The results are also correct (all 25 are positive sentences):

promptB_ds.select(range(25))['label']

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

tokens

results = []
tokens = 0

for row in promptB_ds:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = promptB.format(sentence=row['sentence'])

  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens, len(results)

(650542, 2261)

df = promptB_ds.to_pandas()
df['label_text'] = df['label'].apply(lambda x: dataset.features['label'].names[x])
df['responses'] = results
df['responses'] = df['responses'].apply(lambda x: x.lower())
df['match'] = df['label_text'] == df['responses']
df.head()

	sentence	label	__index_level_0__	label_text	responses	match
0	In the third quarter of 2010 , net sales incre...	2	2	positive	positive	True
1	Operating profit rose to EUR 13.1 mn from EUR ...	2	3	positive	positive	True
2	Operating profit totalled EUR 21.1 mn , up fro...	2	4	positive	positive	True
3	Finnish Talentum reports its operating profit ...	2	5	positive	positive	True
4	Clothing retail chain Sepp+ñl+ñ 's sales incre...	2	6	positive	positive	True

With 3-shot prompting, Haiku achieves 92.4% accuracy. Not bad!

df['match'].mean()

0.9239274657231313

make_cm(df)

Sonnet: 3-Shot Prompt

Given the success of 3-shot prompting with Haiku, I’ll spend a couple of dollars and use that prompt for Sonnet:

model = models[1]
model

'claude-3-5-sonnet-20240620'

print(promptB)

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

Examples:

sentence: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
Respond with ONLY the sentiment label, no other text.
sentiment: positive

sentence: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
Respond with ONLY the sentiment label, no other text.
sentiment: negative

sentence: {sentence}
Respond with ONLY the sentiment label, no other text.
sentiment:

Testing out with 10 rows of data:

results = []
tokens = 0

for row in promptB_ds.select(range(10)):
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = promptB.format(sentence=row['sentence'])

  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

The outputs look good!

results

['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive']

Running inference on the full dataset:

model == 'claude-3-5-sonnet-20240620'

True

results = []
tokens = 0

idxs = [idx for idx in range(len(promptB_ds)) if idx > 2212]
idxs[0], idxs[-1], len(idxs)

(2213, 2260, 48)

for row in promptB_ds.select(idxs):
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = promptB.format(sentence=row['sentence'])

  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens, len(results)

(650584, 2261)

df = promptB_ds.to_pandas()
df['label_text'] = df['label'].apply(lambda x: dataset.features['label'].names[x])
df['responses'] = results
df['responses'] = df['responses'].apply(lambda x: x.lower())
df['match'] = df['label_text'] == df['responses']
df.head()

	sentence	label	__index_level_0__	label_text	responses	match
0	In the third quarter of 2010 , net sales incre...	2	2	positive	positive	True
1	Operating profit rose to EUR 13.1 mn from EUR ...	2	3	positive	positive	True
2	Operating profit totalled EUR 21.1 mn , up fro...	2	4	positive	positive	True
3	Finnish Talentum reports its operating profit ...	2	5	positive	positive	True
4	Clothing retail chain Sepp+ñl+ñ 's sales incre...	2	6	positive	positive	True

With 3-shot prompting, Sonnet-3.5 beats the 0-shot Opus result by 0.65%

df['match'].mean()

0.9478107032286599

Compared to 0-shot prompting, the 3-shot prompt for Sonnet had the same number of true positives for negative sentences (297), 23 more for neutral sentences (1302 vs. 1279) and 8 fewer for positive sentences (544 vs. 552).

make_cm(df)

6-shot Haiku

As a final experiment, I’ll double the number of examples provided in the prompt to 6 and see if that improves Haiku’s performance.

promptC = """Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

Examples:

sentence: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
Respond with ONLY the sentiment label, no other text.
sentiment: positive

sentence: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
Respond with ONLY the sentiment label, no other text.
sentiment: negative

sentence: At the request of Finnish media company Alma Media 's newspapers , research manager Jari Kaivo-oja at the Finland Futures Research Centre at the Turku School of Economics has drawn up a future scenario for Finland 's national economy by using a model developed by the University of Denver .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: STOCK EXCHANGE ANNOUNCEMENT 20 July 2006 1 ( 1 ) BASWARE SHARE SUBSCRIPTIONS WITH WARRANTS AND INCREASE IN SHARE CAPITAL A total of 119 850 shares have been subscribed with BasWare Warrant Program .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: A maximum of 666,104 new shares can further be subscribed for by exercising B options under the 2004 stock option plan .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: {sentence}
Respond with ONLY the sentiment label, no other text.
sentiment: """

formatted_prompt = promptC.format(sentence=dataset[10]['sentence'])
print(formatted_prompt)

Classify the sentiment of this financial news sentence as either negative, neutral, or positive. Respond with ONLY the sentiment label, no other text:

Examples:

sentence: According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .
Respond with ONLY the sentiment label, no other text.
sentiment: positive

sentence: Jan. 6 -- Ford is struggling in the face of slowing truck and SUV sales and a surfeit of up-to-date , gotta-have cars .
Respond with ONLY the sentiment label, no other text.
sentiment: negative

sentence: At the request of Finnish media company Alma Media 's newspapers , research manager Jari Kaivo-oja at the Finland Futures Research Centre at the Turku School of Economics has drawn up a future scenario for Finland 's national economy by using a model developed by the University of Denver .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: STOCK EXCHANGE ANNOUNCEMENT 20 July 2006 1 ( 1 ) BASWARE SHARE SUBSCRIPTIONS WITH WARRANTS AND INCREASE IN SHARE CAPITAL A total of 119 850 shares have been subscribed with BasWare Warrant Program .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: A maximum of 666,104 new shares can further be subscribed for by exercising B options under the 2004 stock option plan .
Respond with ONLY the sentiment label, no other text.
sentiment: neutral

sentence: Its board of directors will propose a dividend of EUR0 .12 per share for 2010 , up from the EUR0 .08 per share paid in 2009 .
Respond with ONLY the sentiment label, no other text.
sentiment:

exclude_idxs = [0, 1, 292, 37, 38, 39]
promptC_ds = ds_subset(dataset, exclude_idxs=exclude_idxs)
promptC_ds

Dataset({
    features: ['sentence', 'label', '__index_level_0__'],
    num_rows: 2258
})

results = []
tokens = 0

idxs = [idx for idx in range(len(promptC_ds)) if idx > (len(results) - 1)]
idxs[0], idxs[-1], len(idxs)

(1974, 2257, 284)

model = models[2]
model == 'claude-3-haiku-20240307'

True

for row in promptC_ds.select(idxs):
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = promptC.format(sentence=row['sentence'])
  r = chat(formatted_prompt)
  results.append(r.content[0].text)
  tokens += chat.use.total

tokens, len(results)

(1144141, 2258)

df = promptC_ds.to_pandas()
df['label_text'] = df['label'].apply(lambda x: dataset.features['label'].names[x])
df['responses'] = results
df['responses'] = df['responses'].apply(lambda x: x.lower())
df['match'] = df['label_text'] == df['responses']
df.head()

	sentence	label	__index_level_0__	label_text	responses	match
0	In the third quarter of 2010 , net sales incre...	2	2	positive	positive	True
1	Operating profit rose to EUR 13.1 mn from EUR ...	2	3	positive	positive	True
2	Operating profit totalled EUR 21.1 mn , up fro...	2	4	positive	positive	True
3	Finnish Talentum reports its operating profit ...	2	5	positive	positive	True
4	Clothing retail chain Sepp+ñl+ñ 's sales incre...	2	6	positive	positive	True

6-shot prompting actually worsened Haiku’s performance from 92.4% (3-shot) to 85%.

df['match'].mean()

0.849867139061116

Interesting that Haiku performed better on negative sentences (296 vs. 272) and positive sentences (564 vs. 550) but much worse on neutral sentences (1059 vs. 1267) even though there were 4 neutral examples in the prompt.

make_cm(df)

Final Thoughts

Here is a summary of results from this notebook. The best performing approach (3-Shot prompt with Sonnet) cost $2.27, cheaper than the $3.50 for Zero-Shot Opus:

Model	Prompt	Overall Accuracy	`negative`	`neutral`	`positive`
claude-3-5-sonnet-20240620	3-Shot	94.78%	98% (297/303)	94% (1302/1391)	95% (544/570)
claude-3-opus-20240229	Zero-Shot	94.13%	98% (297/303)	96% (1333/1391)	88% (501/570)
claude-3-5-sonnet-20240620	Zero-Shot	94%	98% (297/303)	92% (1279/1391)	97% (552/570)
claude-3-haiku-20240307	3-Shot	92.39%	90% (272/303)	91% (1267/1391)	96% (550/570)
claude-3-haiku-20240307	Zero-Shot	89.84%	96% (292/303)	85% (1183/1391)	98% (559/570)
claude-3-haiku-20240307	6-Shot	84.99%	98% (296/303)	76% (1059/1391)	99% (564/570)

Here are my takeaways:

Opus is pricey: I wasn’t planning on experimenting with all three models. In fact, I accidentally selected Opus for the first experiment instead of Sonnet. What gave it away? How much it cost! Someone on Twitter replied to my post about this saying that inference for a multi-turn conversation was costing them $1 per turn. I believe it.
1 Million tokens is not that much: I entered this experiment thinking that I’d sit well below the daily 1M token limit. I was wrong! This experiment ended up taking over 3.3M tokens (3,291,956 input / 71,075 output).
Haiku is competent: For about 10% of the cost, 3-Shot Haiku was only 2.4% less accurate than the best-performing 3-Shot Sonnet. I totally understand now why folks talk about using Haiku for simpler tasks. It’s so much cheaper!

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.