Generating Full Text Search Keywords using claudette

python
RAG
information retrieval
LLM
fastbookRAG
In this blog post, I use Answer.AI’s claudette library to interface with the Claude-3.5 Sonnet API.
Author

Vishal Bakshi

Published

August 25, 2024

Background

In this notebook, I’ll use claudette to generate keywords from fastbook Questionnaire questions, which will then be used for SQLite full-text keyword search.

This notebook is part of a series of blog posts for a project I’m working on called fastbookRAG in which I’m building a hybrid search + LLM pipeline to answer questions from the end-of-chapter Questionnaires in the freely available fastai textbook.

Show imports
!pip install claudette
from claudette import *
import pandas as pd
models # available in claudette
('claude-3-opus-20240229',
 'claude-3-5-sonnet-20240620',
 'claude-3-haiku-20240307')

I’ll be using the Claude-3.5 Sonnet API.

model = models[1]
model
'claude-3-5-sonnet-20240620'

Testing out the Prompt

I have already created keywords for the Chapter 1 Questionnaire questions, so I’ll use a few of them as examples in my prompt.

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat.use
In: 0; Out: 0; Total: 0
Show prompt
prompt = """I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.

No yapping.

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep learning, state of the art, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural network"

question_text: {question_text}
keywords:"""
formatted_prompt = prompt.format(question_text="Why is it hard to understand why a deep learning model makes a particular prediction?")
print(formatted_prompt)
I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.

No yapping. 

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep learning, state of the art, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural network"

question_text: Why is it hard to understand why a deep learning model makes a particular prediction?
keywords:
r = chat(formatted_prompt)
r

“deep learning, prediction, understanding, model, interpretability”

  • id: msg_01GH67UgWfn8yTmufJt83Dyh
  • content: [{‘text’: ‘“deep learning, prediction, understanding, model, interpretability”’, ‘type’: ‘text’}]
  • model: claude-3-5-sonnet-20240620
  • role: assistant
  • stop_reason: end_turn
  • stop_sequence: None
  • type: message
  • usage: {‘input_tokens’: 229, ‘output_tokens’: 16}
r.content[0].text
'"deep learning, prediction, understanding, model, interpretability"'
chat.use
In: 229; Out: 16; Total: 245

Generating Keywords for One Chapter

I’m always cautious when I use an API, as even with cheap per token costs, things can add up quickly. I’ll first test out for one chapter’s questions. A single question required a total of 245 tokens.

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat.use
In: 0; Out: 0; Total: 0

The full set of questions (and “gold standard” answers, along with some metadata) is available in this gist that I created. Note that this only includes the chapters covered in Part 1 of the fastai course.

# get all questions
url = 'https://gist.githubusercontent.com/vishalbakshi/309fb3abb222d32446b2c4e29db753fe/raw/804510c62151142ea940faad9ce132c8c85585de/fastbookRAG_evals.csv'
df = pd.read_csv(url)
df.head()
chapter question_number question_text answer is_answerable
0 1 1 ""Do you need these for deep learning?nn- Lots... ""Lots of math - False\nLots of data - False\n... 1
1 1 2 ""Name five areas where deep learning is now t... ""Any five of the following:\nNatural Language... 1
2 1 3 ""What was the name of the first device that w... ""Mark I perceptron built by Frank Rosenblatt"" 1
3 1 4 ""Based on the book of the same name, what are... ""A set of processing units\nA state of activa... 1
4 1 5 ""What were the two theoretical misunderstandi... ""In 1969, Marvin Minsky and Seymour Papert de... 1
ch1_df = df.query("chapter == 1")
ch1_df.shape
(33, 5)
keyword_results = []
for row in ch1_df['question_text']:
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keyword_results.append(r.content[0].text)

These tokens add up fast! This got up to 100k+ tokens used. I think the issue is that each time the previous messages are included in the chat. I would expect the token count to be closer to 8000 (245 tokens used for one question times 33 questions).

245*33
8085
keyword_results[-5:]
['"architecture, neural network, model structure, design"',
 '"segmentation, image processing, object detection, pixel-level classification"',
 '"y_range, output range, regression, model prediction"',
 '"hyperparameters, model configuration, tuning, machine learning"',
 '"AI implementation, failure prevention, organizational strategy, best practices"']

I’ll do chapter one again but this time I’ll create a new Chat object for each question so I don’t rack up the tokens so quickly.

keyword_results2 = []
tokens = 0
for row in ch1_df['question_text']:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keyword_results2.append(r.content[0].text)
  tokens += chat.use.total

The token usage is much better!

tokens
8048

Although it definitely comes up with different keywords when it doesn’t use the accumulated chat history.

keyword_results2[-5:]
['"architecture, definition, structure, design"',
 '"segmentation, division, partitioning, classification"',
 '"y_range, purpose, usage, application"',
 '"hyperparameters, machine learning, model configuration, tuning"',
 '"AI failures, organization, best practices, risk mitigation, implementation"']

Improving the Prompt

Something that I want to make sure Claude does is prefer single keywords with commas where possible. Currently, it is grouping words together like "AI failures" or "risk mitigation" (which I want as "AI, failures", and "risk, mitigation").

Show prompt
prompt = """I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
Try to use single-word keywords when possible.

No yapping.

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep, learning, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural, network"

question_text: {question_text}
keywords:"""
formatted_prompt = prompt.format(question_text="""What's the best way to avoid failures when using AI in an organization?""")
print(formatted_prompt)
I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
Try to use single-word keywords when possible.

No yapping. 

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep, learning, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural, network"

question_text: What's the best way to avoid failures when using AI in an organization?
keywords:
chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat(formatted_prompt)

“AI, failures, avoid, organization, best practices”

  • id: msg_01FAMozL8CxeSu2ydhx1KdbC
  • content: [{‘text’: ‘“AI, failures, avoid, organization, best practices”’, ‘type’: ‘text’}]
  • model: claude-3-5-sonnet-20240620
  • role: assistant
  • stop_reason: end_turn
  • stop_sequence: None
  • type: message
  • usage: {‘input_tokens’: 236, ‘output_tokens’: 15}
keyword_results3 = []
tokens = 0
for row in ch1_df['question_text']:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keyword_results3.append(r.content[0].text)
  tokens += chat.use.total
tokens
8265

The keywords look promising. Their true effectiveness will be tested when used in full-text search. I’ll adjust the prompt as needed based on those results.

keyword_results3[-10:]
['"metric, loss, difference, measurement, evaluation"',
 '"pretrained, models, help, benefits"',
 '"head, model, neural, network"',
 '"CNN, layers, features, early, later"',
 '"image, models, photos, usefulness"',
 '"architecture, definition, structure, design"',
 '"segmentation, division, partition, categorization"',
 '"y_range, purpose, usage, necessity"',
 '"hyperparameters, machine, learning, parameters, model, configuration"',
 '"AI, failures, avoid, organization, best practices"']

So far I have used 48 cents in API credits.

Generating Keywords for All Chapters

It took about 43 minutes and 20 cents to generate keywords for 220 questions:

keywords = []
tokens = 0
for row in df['question_text']:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keywords.append(r.content[0].text)
  tokens += chat.use.total
tokens
55245
245*220 # estimated tokens used for 220 questions
53900
len(keywords)
220

Spot-checking the generated keywords and they look okay!

keywords[:5]
['"deep, learning, requirements, math, data, computers, PhD"',
 '"deep, learning, areas, best, world"',
 '"artificial, neuron, device, first, principle"',
 '"parallel, distributed, processing, PDP, requirements, book"',
 '"misunderstandings, neural, networks, theoretical, setbacks"']
keywords[-5:]
['"column, pixels, color, dim, plot, represent"',
 '"bad, training, color_dim, why"',
 '"batch, normalization, trainable, parameters, layer"',
 '"batch, normalization, statistics, training, validation"',
 '"batch, normalization, generalization, models"']

Final Thoughts

Using claudette to generate keywords for my fastbookRAG eval questions was really straightforward. I’ll use these keywords for full-text search to answer the questions and plan to revisit and refine the prompt based on the quality of the context retrieved.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.