notebook_login()
Prompting LLMs Using Different Prompting Styles
Background
In this notebook, I test the performance of different LLMs on a small set of samples from math reasoning datasets using zero-shot, zero-shot-Chain-of-Thought and Plan-and-Solve+ (PS+) prompting techniques. The purpose of this exercise is for me to get some practice prompting LLMs.
In a previous blog post, I prompted questions from math reasoning datasets to different model chat interfaces (HuggingChat, Zephyr Chat and ChatGPT). However, when presenting these results to a fastai study group, one of the more experienced members noted that these model interfaces have built-in system prompts that will interfere with the prompts provided by the user. So, in order to get a sense of how these models perform without a system prompt, I decided to prompt these models directly using HuggingFace in this exercise.
Models Tested
I’ll test three models in this exercise:
Prompting Styles
I’ll show an example of the three different prompting styles that I’ll use for this evaluation.
Zero-shot
Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: The answer is
Zero-shot-Chain-of-Thought
Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let’s think step by step.
PS+
Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let’s first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let’s carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.
Loading and Trying Out the Models
I’ll start by loading each model and testing out a prompt for each manually:
::: {.cell _cell_guid=‘b1076dfc-b9ad-4769-8c92-a6c4dae69d19’ _uuid=‘8f2839f25d086af736a60e9eeb907d3b93b6e0e5’ execution=‘{“iopub.execute_input”:“2023-10-31T01:58:45.248095Z”,“iopub.status.busy”:“2023-10-31T01:58:45.247784Z”,“iopub.status.idle”:“2023-10-31T01:59:51.608104Z”,“shell.execute_reply”:“2023-10-31T01:59:51.607351Z”,“shell.execute_reply.started”:“2023-10-31T01:58:45.248068Z”}’ scrolled=‘true’ trusted=‘true’}
!pip install git+https://github.com/huggingface/transformers.git huggingface_hub
from transformers import pipeline
from huggingface_hub import notebook_login
import pandas as pd, torch
:::
Mistral-7B-Instruct-v0.1
= pipeline(
pipe "text-generation",
="mistralai/Mistral-7B-Instruct-v0.1",
model=torch.bfloat16,
torch_dtype="auto") device_map
= [
messages "role": "user", "content": """
{Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.
"""},
]
= False
pipe.use_default_system_prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
outputs print(outputs[0]["generated_text"])
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
<s>[INST]
Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.
[/INST] Plan:
1. Extract the relevant variables and their corresponding numerals.
2. Devise a plan to solve the problem.
3. Carry out the plan and calculate intermediate results.
4. Solve the problem step by step.
5. Show the answer.
Solution:
1. Variables and their corresponding numerals:
- Total bill (T)
- Number of people (N)
- Payment per person (P)
2. Plan:
We need to determine the total bill (T) based on the number of people (N) and the payment per person (P). We can use the following equation:
T = P * N
3. Carrying out the plan and calculating intermediate results:
We are given the payment per person (P) as 45 dollars. We need to find the total bill (T) by multiplying the payment per person by the number of people. We are also given the number of people (N) as Sally, Sam, and Alyssa.
T = 45 * N
4. Solving the problem step by step:
We can substitute the given values into the equation
zephyr-7b-alpha
= pipeline(
pipe "text-generation",
="HuggingFaceH4/zephyr-7b-alpha",
model=torch.bfloat16,
torch_dtype="auto") device_map
= [
messages "role": "user", "content": """
{Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.
"""},
]
= False
pipe.use_default_system_prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
outputs print(outputs[0]["generated_text"])
/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
warnings.warn(
<|user|>
Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.
</s>
<|assistant|>
Let's assume the total bill is x.
Each person paid 45 dollars, so each person's share of the bill is 45.
If the total bill is x, then the sum of each person's share is also x, as each person's share is one-third of the total bill.
Therefore, we have:
x = 45 + 45 + 45 (the sum of each person's share)
x = 135 (the total bill)
So the total bill was $135.
In summary, we identified the variables (total bill and each person's share), calculated intermediate results (total of each person's share), and solved the problem by setting up an equation and calculating the total bill.
Llama-2-7b-chat-hf
= pipeline(
pipe "text-generation",
="meta-llama/Llama-2-7b-chat-hf",
model=torch.bfloat16,
torch_dtype="auto") device_map
= [
messages "role": "user", "content": """
{Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.
"""},
]
= False
pipe.use_default_system_prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
outputs print(outputs[0]["generated_text"])
<s>[INST] Q: After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid 45 dollars, what was the total of the bill?
A: Let's first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let's carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer. [/INST] Great, let's dive into the problem!
After eating at the restaurant, Sally, Sam, and Alyssa decided to divide the bill evenly. If each person paid $45, what was the total of the bill?
To solve this problem, we need to identify the variables involved and determine their values. Let's do this by breaking down the information given in the problem:
Variables:
* Sally
* Sam
* Alyssa
* Total bill
Values:
* Sally paid $45
* Sam paid $45
* Alyssa paid $45
* Total bill = $45 + $45 + $45 = $135
Now, let's put this information together to find the total of the bill:
Total bill = $135
So, the total of the bill was $135.
Let me know if you have any questions or if you'd like me to explain the steps in more detail!
Testing All Prompts
I have entered 20 questions (from the Appendix of the paper Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models (Lei Wang, et al)) into a CSV file:
= pd.read_csv("/kaggle/input/questions5/questions.csv")
df df.head()
dataset | question | |
---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... |
4 | MultiArith | The school cafeteria ordered 42 red apples and... |
To prepare the questions for the prompts, I’ll prepend each question with "Q:"
and append each question with "\n\nA: <instruction>"
where <instruction>
is the given instruction based on the type of prompt I’m constructing. For example, for Zero-shot prompting:
"Q: " + df.question[1] + "\n\nA: The answer is"
'Q: If 27 bottles of soda cost A cents, how much will B bottles cost in dollars? Answer Choices: (A) AB/2700 (B) 27/AB (C) AB/270 (D) 2700/AB (E) 100AB/27\n\nA: The answer is'
I’ll use the same parameters for these models as the official chat interfaces published by the model authors.
I’ll also create a copy of the DataFrame
with questions in order to populate responses for each model.
= ['Mistral-7B-Instruct-v0.1', 'zephyr-7b-alpha', 'Llama-2-7b-chat-hf']
model_names = ['Zero-shot', 'Zero-shot-CoT', 'PS+']
prompting_styles = []
dfs
for name in model_names:
= []
model_dfs for style in prompting_styles:
# create a DataFrame for each prompting style
= df.copy()
temp_df 'model'] = name
temp_df['prompting_style'] = style
temp_df['response'] = None
temp_df[
model_dfs.append(temp_df)# Create a DataFrame for each model and store in a list
= pd.concat(model_dfs)
model_df dfs.append(model_df)
I now have three DataFrames
, one for each of my models, and each with 20 rows of questions for each prompting style.
0].head() dfs[
dataset | question | model | prompting_style | response | |
---|---|---|---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... | Mistral-7B-Instruct-v0.1 | Zero-shot | None |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... | Mistral-7B-Instruct-v0.1 | Zero-shot | None |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... | Mistral-7B-Instruct-v0.1 | Zero-shot | None |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... | Mistral-7B-Instruct-v0.1 | Zero-shot | None |
4 | MultiArith | The school cafeteria ordered 42 red apples and... | Mistral-7B-Instruct-v0.1 | Zero-shot | None |
1].head() dfs[
dataset | question | model | prompting_style | response | |
---|---|---|---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... | zephyr-7b-alpha | Zero-shot | None |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... | zephyr-7b-alpha | Zero-shot | None |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... | zephyr-7b-alpha | Zero-shot | None |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... | zephyr-7b-alpha | Zero-shot | None |
4 | MultiArith | The school cafeteria ordered 42 red apples and... | zephyr-7b-alpha | Zero-shot | None |
2].head() dfs[
dataset | question | model | prompting_style | response | |
---|---|---|---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... | Llama-2-7b-chat-hf | Zero-shot | None |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... | Llama-2-7b-chat-hf | Zero-shot | None |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... | Llama-2-7b-chat-hf | Zero-shot | None |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... | Llama-2-7b-chat-hf | Zero-shot | None |
4 | MultiArith | The school cafeteria ordered 42 red apples and... | Llama-2-7b-chat-hf | Zero-shot | None |
There should be three instances of each question, one for each prompting style:
0].groupby(dfs[0].question).count()['dataset'].unique(),\
dfs[1].groupby(dfs[1].question).count()['dataset'].unique(),\
dfs[2].groupby(dfs[2].question).count()['dataset'].unique() dfs[
(array([3]), array([3]), array([3]))
Each model’s DataFrame
should have 20 rows for each of the three prompting styles:
0].groupby(dfs[0].prompting_style).count()['dataset'].unique(),\
dfs[1].groupby(dfs[1].prompting_style).count()['dataset'].unique(),\
dfs[2].groupby(dfs[2].prompting_style).count()['dataset'].unique() dfs[
(array([20]), array([20]), array([20]))
And finally, I’ll check that each DataFrame
should have Zero-shot
, Zero-shot-CoT
and PS+
as the prompting styles:
0].prompting_style.unique(),\
dfs[1].prompting_style.unique(),\
dfs[2].prompting_style.unique() dfs[
(array(['Zero-shot', 'Zero-shot-CoT', 'PS+'], dtype=object),
array(['Zero-shot', 'Zero-shot-CoT', 'PS+'], dtype=object),
array(['Zero-shot', 'Zero-shot-CoT', 'PS+'], dtype=object))
Finally, I’ll create a dictionary to lookup the instruction for a given prompting style, to append to the question in the prompt:
= {
instructions 'Zero-shot': 'The answer is',
'Zero-shot-CoT': "Let's think step by step.",
'PS+': "Let’s first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let’s carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer."
}
'PS+'] instructions[
'Let’s first understand the problem, extract relevant variables and their corresponding numerals, and devise a plan. Then, let’s carry out the plan, calculate intermediate results (pay attention to calculation and common sense), solve the problem step by step, and show the answer.'
With my DataFrame
s with questions and prompting styles ready, I can prompt the models one at a time. The Kaggle GPUs were running out of memory if I tried to load more than one of these models, so I have to create a new session for each model run. I export the model responses into a CSV before stopping the session so I can combine all models’ responses into one DataFrame
at the end to analyze the results.
Mistral-7B-Instruct-v0.1
= pipeline(
pipe "text-generation",
="mistralai/Mistral-7B-Instruct-v0.1",
model=torch.bfloat16,
torch_dtype="auto") device_map
for i in range(len(dfs[0])):
= [
messages "role": "user", "content": "Q: " + dfs[0].question.iloc[i] + "\n\nA: " + instructions[dfs[0].prompting_style.iloc[i]] }
{
]
= False
pipe.use_default_system_prompt
= pipe.tokenizer.apply_chat_template(
prompt
messages,=False,
tokenize=True)
add_generation_prompt
= pipe(
outputs
prompt,=True,
do_sample= 0.1,
temperature= 2048,
max_new_tokens= 0.95,
top_p= 1.2,
repetition_penalty= 50,
top_k= False)
return_full_text
0].response.iloc[i] = outputs[0]["generated_text"] dfs[
0].head() dfs[
dataset | question | model | prompting_style | response | |
---|---|---|---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... | Mistral-7B-Instruct-v0.1 | Zero-shot | To solve this problem, we need to use the for... |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... | Mistral-7B-Instruct-v0.1 | Zero-shot | To convert the number of bottles from one uni... |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... | Mistral-7B-Instruct-v0.1 | Zero-shot | To find out how many total meters James runs ... |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... | Mistral-7B-Instruct-v0.1 | Zero-shot | Let's break down this problem step by step:\n... |
4 | MultiArith | The school cafeteria ordered 42 red apples and... | Mistral-7B-Instruct-v0.1 | Zero-shot | 33 extra apples.\n\nHere's the reasoning behi... |
0].to_csv('mistral_responses.csv') dfs[
zephyr-7b-alpha
= pipeline(
pipe "text-generation",
="HuggingFaceH4/zephyr-7b-alpha",
model=torch.bfloat16,
torch_dtype="auto") device_map
for i in range(len(dfs[1])):
= [
messages "role": "user", "content": "Q: " + dfs[1].question.iloc[i] + "\n\nA: " + instructions[dfs[1].prompting_style.iloc[i]] }
{
]
= False
pipe.use_default_system_prompt
= pipe.tokenizer.apply_chat_template(
prompt
messages,=False,
tokenize=True)
add_generation_prompt
= pipe(
outputs
prompt,=True,
do_sample= 0.7,
temperature= 1024,
max_new_tokens= 0.95,
top_p= 1.2,
repetition_penalty= 50,
top_k= False)
return_full_text
1].response.iloc[i] = outputs[0]["generated_text"] dfs[
1].head() dfs[
dataset | question | model | prompting_style | response | |
---|---|---|---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... | zephyr-7b-alpha | Zero-shot | The total earnings for the first seven days ar... |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... | zephyr-7b-alpha | Zero-shot | (A) AB/2700 \n\nExplanation: Let's say the pri... |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... | zephyr-7b-alpha | Zero-shot | James runs 3 sprints of 60 meters each, three ... |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... | zephyr-7b-alpha | Zero-shot | Let's calculate this using math:\n\nFirstly, l... |
4 | MultiArith | The school cafeteria ordered 42 red apples and... | zephyr-7b-alpha | Zero-shot | Let's calculate the total number of apples (re... |
1].to_csv('zephyr_responses.csv') dfs[
Llama-2-7b-chat-hf
= pipeline(
pipe "text-generation",
="meta-llama/Llama-2-7b-chat-hf",
model=torch.bfloat16,
torch_dtype="auto") device_map
for i in range(len(dfs[2])):
= [
messages "role": "user", "content": "Q: " + dfs[2].question.iloc[i] + "\n\nA: " + instructions[dfs[2].prompting_style.iloc[i]] }
{
]
= False
pipe.use_default_system_prompt
= pipe.tokenizer.apply_chat_template(
prompt
messages,=False,
tokenize=True)
add_generation_prompt
= pipe(
outputs
prompt,=True,
do_sample= 0.1,
temperature= 1024,
max_new_tokens= 0.95,
top_p= 1.2,
repetition_penalty= 50,
top_k= False)
return_full_text
2].response.iloc[i] = outputs[0]["generated_text"] dfs[
2].head() dfs[
dataset | question | model | prompting_style | response | |
---|---|---|---|---|---|
0 | AQuA | The average wages of a worker during a fortnig... | Llama-2-7b-chat-hf | Zero-shot | To find the wage of the worker on the 8th da... |
1 | AQuA | If 27 bottles of soda cost A cents, how much w... | Llama-2-7b-chat-hf | Zero-shot | To find the cost of B bottles in dollars, we... |
2 | GSM8K | James decides to run 3 sprints 3 times a week.... | Llama-2-7b-chat-hf | Zero-shot | To find out how many total meters James runs... |
3 | GSM8K | In a dance class of 20 students, 20% enrolled ... | Llama-2-7b-chat-hf | Zero-shot | To find out what percentage of the entire gr... |
4 | MultiArith | The school cafeteria ordered 42 red apples and... | Llama-2-7b-chat-hf | Zero-shot | Great question! To find out how many extra a... |
2].to_csv('llama2_responses.csv') dfs[
Grading the Responses
I really liked the way Johnathan Whitaker (on the Data Science Castnet YouTube channel) used gradio as a tool to check responses from an LLM, so I’m using the approach from his notebook below.
For each row of data, the gradio app displays the question, response and two buttons (“Correct” and “Incorrect”). If the answer is correct, I’ll press “Correct” and the is_correct
column value will be set to 1
. If it’s not, I’ll press “Incorrect” and is_correct
will stay 0
.
#install gradio (quick fix for install error on colab)
import locale
= lambda: "UTF-8"
locale.getpreferredencoding !pip install -q gradio
I’ll load the three CSVs (each with responses for 3 prompting styles across 20 questions from each model) into a single DataFrame
, and sort it by the question
column so it’s easier for me to grade the responses.
import gradio as gr, pandas as pd
from functools import partial
= []
dfs = ['llama2_responses.csv', 'mistral_responses.csv', 'zephyr_responses.csv']
files
for file in files:
= pd.read_csv(file)
df = df.drop('Unnamed: 0', axis=1)
df 'is_correct'] = 0
df[
dfs.append(df)
= pd.concat(dfs, axis=0, ignore_index=True)
df
# sort by question so it's easier to grade
= df.sort_values(by=['question'])
df = df.reset_index(drop=True) df
df.head()
dataset | question | model | prompting_style | response | is_correct | |
---|---|---|---|---|---|---|
0 | Coin Flip | 'A coin is heads up. Lorena does not flip the ... | zephyr-7b-alpha | PS+ | Problem understanding:\n- The question asks wh... | 0 |
1 | Coin Flip | 'A coin is heads up. Lorena does not flip the ... | zephyr-7b-alpha | Zero-shot | Yes, according to the given scenario, since no... | 0 |
2 | Coin Flip | 'A coin is heads up. Lorena does not flip the ... | Llama-2-7b-chat-hf | Zero-shot | No, the coin is no longer heads up. Since no... | 0 |
3 | Coin Flip | 'A coin is heads up. Lorena does not flip the ... | Mistral-7B-Instruct-v0.1 | Zero-shot | Yes, the coin is still heads up.\n\nHere's th... | 0 |
4 | Coin Flip | 'A coin is heads up. Lorena does not flip the ... | Llama-2-7b-chat-hf | PS+ | Great! Let's tackle this problem together. H... | 0 |
The gradio app consists of a Textbox
which will display the text of the question and response. I’ll modify the existing get_text
function, which takes an index and returns a concatenated string.
The label
function is called when the user clicks either of the buttons. If the button clicked has a value of "Correct"
then the choice
parameter will be "correct"
and the is_correct
column will be set to 1
. The label
function returns the next index and string to display.
def get_text(idx):
""" Combine the question and answer into a single string """
global df
= 'Q: ' + df.iloc[idx]['question'] + '\n\nA: ' + df.iloc[idx]['response']
text
return text
def label(idx, choice='correct'):
""" Set `is_correct` column to `1` if response is correct """
if choice == 'correct':
'is_correct'] = 1
df.loc[idx, return idx+1, get_text(idx+1) # Move on to next one
In order to avoid the gradio app from timing out, I have called queue
on the demo
object before I launch
it.
with gr.Blocks() as demo:
= 0
starting_idx with gr.Column():
= gr.Slider(value=starting_idx, label="idx") # Progress bar, borrowed the idea from https://www.kaggle.com/code/nbroad/create-science-wikipedia-dataset
idx = gr.Textbox(value=get_text(0), label="text")
text with gr.Row():
= gr.Button(value="Correct")
correct_btn = gr.Button(value="Incorrect")
incorrect_btn
=partial(label, choice='correct'), inputs=[idx], outputs=[idx, text])
correct_btn.click(fn=partial(label, choice='incorrect'), inputs=[idx], outputs=[idx, text])
incorrect_btn.click(fn
#demo.launch(debug=True, show_error=True)
=True, show_error=True) demo.queue().launch(share
Here is an example of what the gradio app looks like:
'graded_responses.csv', index=False) df.to_csv(
Results
With all of the model responses graded, I can now calculate the percentage of correct responses from each model for each dataset and prompting style.
'model', 'prompting_style'])['is_correct'].mean() * 100 df.groupby([
model prompting_style
Llama-2-7b-chat-hf PS+ 25.0
Zero-shot 45.0
Zero-shot-CoT 30.0
Mistral-7B-Instruct-v0.1 PS+ 60.0
Zero-shot 60.0
Zero-shot-CoT 45.0
zephyr-7b-alpha PS+ 30.0
Zero-shot 45.0
Zero-shot-CoT 45.0
Name: is_correct, dtype: float64
Overall, the best performing model and prompting style (60% correct) was the Mistral-7B-Instruct-v0.1 model using the PS+ and Zero-shot prompting styles. Mistral was the only model where PS+ outperformed Zero-shot and Zero-shot-CoT.
'model'])['is_correct'].mean() * 100 df.groupby([
model
Llama-2-7b-chat-hf 33.333333
Mistral-7B-Instruct-v0.1 55.000000
zephyr-7b-alpha 40.000000
Name: is_correct, dtype: float64
Overall the best performing model across all prompting styles was Mistral (55%) following by Zephyr (40%) and Llama-2 coming in last at 33.33%.
'dataset'])['is_correct'].mean() * 100 df.groupby([
dataset
AQuA 5.555556
AddSub 77.777778
CSQA 27.777778
Coin Flip 77.777778
GSM8K 50.000000
Last Letters 0.000000
MultiArith 38.888889
SVAMP 77.777778
SingleEq 72.222222
StrategyQA 0.000000
Name: is_correct, dtype: float64
Across all models and prompting styles, the highest performing datasets (78% correct) were AddSub, Coin Flip and SVAMP. The lowest performing datasets (0% correct) were Last Letters and StrategyQA.
Final Thoughts
I want to reiterate that the purpose of this exercise for me to get some practice prompting LLMs. I also got some experience setting up a gradio app to help me evaluate the responses of the models.
Given the small sample size for each model (20 questions and responses for each prompting style) I don’t think these results can be used to conclude anything about the performance of these prompting styles and models. Yes, Mistral and Zephyr out-performed Llama-2 on these 20 questions across these 3 particular prompting styles, but testing the models on another thousand questions might yield different results.
I hope you enjoyed this blog post!