Show imports
!pip install claudette
from claudette import *
import pandas as pd
claudette
claudette
library to interface with the Claude-3.5 Sonnet API.
Vishal Bakshi
August 25, 2024
In this notebook, I’ll use claudette
to generate keywords from fastbook Questionnaire questions, which will then be used for SQLite full-text keyword search.
This notebook is part of a series of blog posts for a project I’m working on called fastbookRAG in which I’m building a hybrid search + LLM pipeline to answer questions from the end-of-chapter Questionnaires in the freely available fastai textbook.
('claude-3-opus-20240229',
'claude-3-5-sonnet-20240620',
'claude-3-haiku-20240307')
I’ll be using the Claude-3.5 Sonnet API.
I have already created keywords for the Chapter 1 Questionnaire questions, so I’ll use a few of them as examples in my prompt.
In: 0; Out: 0; Total: 0
prompt = """I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
No yapping.
Examples:
question_text: Name five areas where deep learning is now the best in the world
keywords: "deep learning, state of the art, best, world"
question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"
question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural network"
question_text: {question_text}
keywords:"""
formatted_prompt = prompt.format(question_text="Why is it hard to understand why a deep learning model makes a particular prediction?")
print(formatted_prompt)
I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
No yapping.
Examples:
question_text: Name five areas where deep learning is now the best in the world
keywords: "deep learning, state of the art, best, world"
question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"
question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural network"
question_text: Why is it hard to understand why a deep learning model makes a particular prediction?
keywords:
“deep learning, prediction, understanding, model, interpretability”
I’m always cautious when I use an API, as even with cheap per token costs, things can add up quickly. I’ll first test out for one chapter’s questions. A single question required a total of 245 tokens.
In: 0; Out: 0; Total: 0
The full set of questions (and “gold standard” answers, along with some metadata) is available in this gist that I created. Note that this only includes the chapters covered in Part 1 of the fastai course.
# get all questions
url = 'https://gist.githubusercontent.com/vishalbakshi/309fb3abb222d32446b2c4e29db753fe/raw/804510c62151142ea940faad9ce132c8c85585de/fastbookRAG_evals.csv'
df = pd.read_csv(url)
df.head()
chapter | question_number | question_text | answer | is_answerable | |
---|---|---|---|---|---|
0 | 1 | 1 | ""Do you need these for deep learning?nn- Lots... | ""Lots of math - False\nLots of data - False\n... | 1 |
1 | 1 | 2 | ""Name five areas where deep learning is now t... | ""Any five of the following:\nNatural Language... | 1 |
2 | 1 | 3 | ""What was the name of the first device that w... | ""Mark I perceptron built by Frank Rosenblatt"" | 1 |
3 | 1 | 4 | ""Based on the book of the same name, what are... | ""A set of processing units\nA state of activa... | 1 |
4 | 1 | 5 | ""What were the two theoretical misunderstandi... | ""In 1969, Marvin Minsky and Seymour Papert de... | 1 |
These tokens add up fast! This got up to 100k+ tokens used. I think the issue is that each time the previous messages are included in the chat. I would expect the token count to be closer to 8000 (245 tokens used for one question times 33 questions).
['"architecture, neural network, model structure, design"',
'"segmentation, image processing, object detection, pixel-level classification"',
'"y_range, output range, regression, model prediction"',
'"hyperparameters, model configuration, tuning, machine learning"',
'"AI implementation, failure prevention, organizational strategy, best practices"']
I’ll do chapter one again but this time I’ll create a new Chat
object for each question so I don’t rack up the tokens so quickly.
The token usage is much better!
Although it definitely comes up with different keywords when it doesn’t use the accumulated chat history.
['"architecture, definition, structure, design"',
'"segmentation, division, partitioning, classification"',
'"y_range, purpose, usage, application"',
'"hyperparameters, machine learning, model configuration, tuning"',
'"AI failures, organization, best practices, risk mitigation, implementation"']
Something that I want to make sure Claude does is prefer single keywords with commas where possible. Currently, it is grouping words together like "AI failures"
or "risk mitigation"
(which I want as "AI, failures"
, and "risk, mitigation"
).
prompt = """I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
Try to use single-word keywords when possible.
No yapping.
Examples:
question_text: Name five areas where deep learning is now the best in the world
keywords: "deep, learning, best, world"
question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"
question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural, network"
question_text: {question_text}
keywords:"""
formatted_prompt = prompt.format(question_text="""What's the best way to avoid failures when using AI in an organization?""")
print(formatted_prompt)
I am working on a keyword search project and i need to create 3-6 keywords for each `question_text` that I provide you.
Do not generate keywords that stray too far in meaning from the `question_text`. Only respond with the comma-separated list of keywords surrounded by double quotes.
Try to use single-word keywords when possible.
No yapping.
Examples:
question_text: Name five areas where deep learning is now the best in the world
keywords: "deep, learning, best, world"
question_text: Why is it hard to use a traditional computer program to recognize images in a photo?
keywords: "image, recognize, recognition, traditional, computer, program"
question_text: What were the two theoretical misunderstandings that held back the field of neural networks?
keywords: "theoretical, misunderstandings, held, back, field, neural, network"
question_text: What's the best way to avoid failures when using AI in an organization?
keywords:
“AI, failures, avoid, organization, best practices”
The keywords look promising. Their true effectiveness will be tested when used in full-text search. I’ll adjust the prompt as needed based on those results.
['"metric, loss, difference, measurement, evaluation"',
'"pretrained, models, help, benefits"',
'"head, model, neural, network"',
'"CNN, layers, features, early, later"',
'"image, models, photos, usefulness"',
'"architecture, definition, structure, design"',
'"segmentation, division, partition, categorization"',
'"y_range, purpose, usage, necessity"',
'"hyperparameters, machine, learning, parameters, model, configuration"',
'"AI, failures, avoid, organization, best practices"']
So far I have used 48 cents in API credits.
It took about 43 minutes and 20 cents to generate keywords for 220 questions:
Spot-checking the generated keywords and they look okay!
['"deep, learning, requirements, math, data, computers, PhD"',
'"deep, learning, areas, best, world"',
'"artificial, neuron, device, first, principle"',
'"parallel, distributed, processing, PDP, requirements, book"',
'"misunderstandings, neural, networks, theoretical, setbacks"']
Using claudette
to generate keywords for my fastbookRAG eval questions was really straightforward. I’ll use these keywords for full-text search to answer the questions and plan to revisit and refine the prompt based on the quality of the context retrieved.
I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.