Generating Full Text Search Keywords using `claudette`

python

RAG

information retrieval

LLM

fastbookRAG

In this blog post, I use Answer.AI’s claudette library to interface with the Claude-3.5 Sonnet API.

Author

Vishal Bakshi

Published

August 25, 2024

Background

In this notebook, I’ll use claudette to generate keywords from fastbook Questionnaire questions, which will then be used for SQLite full-text keyword search.

This notebook is part of a series of blog posts for a project I’m working on called fastbookRAG in which I’m building a hybrid search + LLM pipeline to answer questions from the end-of-chapter Questionnaires in the freely available fastai textbook.

Show imports

!pip install claudette
from claudette import *
import pandas as pd

models # available in claudette

('claude-3-opus-20240229',
 'claude-3-5-sonnet-20240620',
 'claude-3-haiku-20240307')

I’ll be using the Claude-3.5 Sonnet API.

model = models[1]
model

'claude-3-5-sonnet-20240620'

Testing out the Prompt

I have already created keywords for the Chapter 1 Questionnaire questions, so I’ll use a few of them as examples in my prompt.

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat.use

In: 0; Out: 0; Total: 0

Show prompt

prompt = """I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.

No yapping.

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep learning, state of the art, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural network"

question_text: {question_text}
keywords:"""

formatted_prompt = prompt.format(question_text="Why is it hard to understand why a deep learning model makes a particular prediction?")
print(formatted_prompt)

I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.

No yapping. 

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep learning, state of the art, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural network"

question_text: Why is it hard to understand why a deep learning model makes a particular prediction?
keywords:

r = chat(formatted_prompt)
r

“deep learning, prediction, understanding, model, interpretability”

id: msg_01GH67UgWfn8yTmufJt83Dyh
content: [{‘text’: ‘“deep learning, prediction, understanding, model, interpretability”’, ‘type’: ‘text’}]
model: claude-3-5-sonnet-20240620
role: assistant
stop_reason: end_turn
stop_sequence: None
type: message
usage: {‘input_tokens’: 229, ‘output_tokens’: 16}

r.content[0].text

'"deep learning, prediction, understanding, model, interpretability"'

chat.use

In: 229; Out: 16; Total: 245

Generating Keywords for One Chapter

I’m always cautious when I use an API, as even with cheap per token costs, things can add up quickly. I’ll first test out for one chapter’s questions. A single question required a total of 245 tokens.

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat.use

In: 0; Out: 0; Total: 0

The full set of questions (and “gold standard” answers, along with some metadata) is available in this gist that I created. Note that this only includes the chapters covered in Part 1 of the fastai course.

# get all questions
url = 'https://gist.githubusercontent.com/vishalbakshi/309fb3abb222d32446b2c4e29db753fe/raw/804510c62151142ea940faad9ce132c8c85585de/fastbookRAG_evals.csv'
df = pd.read_csv(url)
df.head()

	chapter	question_number	question_text	answer	is_answerable
0	1	1	""Do you need these for deep learning?nn- Lots...	""Lots of math - False\nLots of data - False\n...	1
1	1	2	""Name five areas where deep learning is now t...	""Any five of the following:\nNatural Language...	1
2	1	3	""What was the name of the first device that w...	""Mark I perceptron built by Frank Rosenblatt""	1
3	1	4	""Based on the book of the same name, what are...	""A set of processing units\nA state of activa...	1
4	1	5	""What were the two theoretical misunderstandi...	""In 1969, Marvin Minsky and Seymour Papert de...	1

ch1_df = df.query("chapter == 1")
ch1_df.shape

(33, 5)

keyword_results = []
for row in ch1_df['question_text']:
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keyword_results.append(r.content[0].text)

These tokens add up fast! This got up to 100k+ tokens used. I think the issue is that each time the previous messages are included in the chat. I would expect the token count to be closer to 8000 (245 tokens used for one question times 33 questions).

245*33

keyword_results[-5:]

['"architecture, neural network, model structure, design"',
 '"segmentation, image processing, object detection, pixel-level classification"',
 '"y_range, output range, regression, model prediction"',
 '"hyperparameters, model configuration, tuning, machine learning"',
 '"AI implementation, failure prevention, organizational strategy, best practices"']

I’ll do chapter one again but this time I’ll create a new Chat object for each question so I don’t rack up the tokens so quickly.

keyword_results2 = []
tokens = 0
for row in ch1_df['question_text']:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keyword_results2.append(r.content[0].text)
  tokens += chat.use.total

The token usage is much better!

tokens

Although it definitely comes up with different keywords when it doesn’t use the accumulated chat history.

keyword_results2[-5:]

['"architecture, definition, structure, design"',
 '"segmentation, division, partitioning, classification"',
 '"y_range, purpose, usage, application"',
 '"hyperparameters, machine learning, model configuration, tuning"',
 '"AI failures, organization, best practices, risk mitigation, implementation"']

Improving the Prompt

Something that I want to make sure Claude does is prefer single keywords with commas where possible. Currently, it is grouping words together like "AI failures" or "risk mitigation" (which I want as "AI, failures", and "risk, mitigation").

Show prompt

prompt = """I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
Try to use single-word keywords when possible.

No yapping.

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep, learning, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural, network"

question_text: {question_text}
keywords:"""

formatted_prompt = prompt.format(question_text="""What's the best way to avoid failures when using AI in an organization?""")
print(formatted_prompt)

I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
Try to use single-word keywords when possible.

No yapping. 

Examples:

question_text: Name five areas where deep learning is now the best in the world
keywords: "deep, learning, best, world"

question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"

question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural, network"

question_text: What's the best way to avoid failures when using AI in an organization?
keywords:

chat = Chat(model, sp="""You are a helpful and concise assistant.""")
chat(formatted_prompt)

“AI, failures, avoid, organization, best practices”

id: msg_01FAMozL8CxeSu2ydhx1KdbC
content: [{‘text’: ‘“AI, failures, avoid, organization, best practices”’, ‘type’: ‘text’}]
model: claude-3-5-sonnet-20240620
role: assistant
stop_reason: end_turn
stop_sequence: None
type: message
usage: {‘input_tokens’: 236, ‘output_tokens’: 15}

keyword_results3 = []
tokens = 0
for row in ch1_df['question_text']:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keyword_results3.append(r.content[0].text)
  tokens += chat.use.total

tokens

The keywords look promising. Their true effectiveness will be tested when used in full-text search. I’ll adjust the prompt as needed based on those results.

keyword_results3[-10:]

['"metric, loss, difference, measurement, evaluation"',
 '"pretrained, models, help, benefits"',
 '"head, model, neural, network"',
 '"CNN, layers, features, early, later"',
 '"image, models, photos, usefulness"',
 '"architecture, definition, structure, design"',
 '"segmentation, division, partition, categorization"',
 '"y_range, purpose, usage, necessity"',
 '"hyperparameters, machine, learning, parameters, model, configuration"',
 '"AI, failures, avoid, organization, best practices"']

So far I have used 48 cents in API credits.

Generating Keywords for All Chapters

It took about 43 minutes and 20 cents to generate keywords for 220 questions:

keywords = []
tokens = 0
for row in df['question_text']:
  chat = Chat(model, sp="""You are a helpful and concise assistant.""")
  formatted_prompt = prompt.format(question_text=row[2:-2])
  r = chat(formatted_prompt)
  keywords.append(r.content[0].text)
  tokens += chat.use.total

tokens

245*220 # estimated tokens used for 220 questions

len(keywords)

Spot-checking the generated keywords and they look okay!

keywords[:5]

['"deep, learning, requirements, math, data, computers, PhD"',
 '"deep, learning, areas, best, world"',
 '"artificial, neuron, device, first, principle"',
 '"parallel, distributed, processing, PDP, requirements, book"',
 '"misunderstandings, neural, networks, theoretical, setbacks"']

keywords[-5:]

['"column, pixels, color, dim, plot, represent"',
 '"bad, training, color_dim, why"',
 '"batch, normalization, trainable, parameters, layer"',
 '"batch, normalization, statistics, training, validation"',
 '"batch, normalization, generalization, models"']

Final Thoughts

Using claudette to generate keywords for my fastbookRAG eval questions was really straightforward. I’ll use these keywords for full-text search to answer the questions and plan to revisit and refine the prompt based on the quality of the context retrieved.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.