Using Cosine Similarity to Retrieve Context to Answer fastbook Questionnaire
python
RAG
information retrieval
fastbookRAG
In this blog post I am able to answer 76% of the fastbook Chapter 1 Questionnaire using cosine similarity between the question and the chunked chapter text.
Author
Vishal Bakshi
Published
August 6, 2024
Background
In this notebook I’ll establish a baseline performance of embedding cosine similarity for retrieving relevant context needed to answer the fastbook Chapter 1 Questionnaire.
This is part of a larger project that I’m calling fastbookRAG, where I’ll be building a keyword+semantic search-to-LLM RAG pipeline. The goal of this pipeline will be to accurately answer questions from the fastbook end-of-chapter Questionnaires. fastbook is the freely available fastai textbook.
Here are the results from this notebook. “Retrieved context relevancy” means the percentage of the 33 questions in the Questionnaire I was able to answer given the retrieved notebook chunks. Eventually in my pipeline, I’ll replace me with an LLM that will be using this retrieved context to answer the questions. “Top-n” means the selected chunks used for context had the top n cosine similiarity scores. The best performing approach was using the top-5 (cosine similarity value) small-sized chunks.
Top-n
Chunk Size
Retrieved Context Relevancy
Top-5
Small
76%
Top-3
Small
70%
Top-3
Large
67%
Top-1
Small
51%
Top-1
Large
48%
Show imports
import sqlite3import jsonimport reimport pandas as pd, numpy as npimport textwrapimport torchfrom torch import tensorimport torch.nn.functional as F!pip install sentence-transformers -Uqqfrom sentence_transformers import SentenceTransformeremb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")def wrapper(line, width): textwrap.TextWrapper( width=width, replace_whitespace=False, drop_whitespace=False)(line)def print_wrap_text(text, width):print("\n".join(wrapper.fill(line) for line in text.splitlines()))
Chunking fastbook Chapter 1 into Paragraphs
As I did with my full text search demo, I’ll start by chunking the fastbook Chapter 1 Jupyter Notebook into paragraphs (that include the header from the section it’s in).
Show the chunking code
def get_chunks(notebook_path):withopen(notebook_path, 'r', encoding='utf-8') asfile: notebook = json.load(file) chunks = [] current_header =""def add_chunk(content):if content.strip(): chunks.append(f"{current_header}\n\n{content.strip()}")for cell in notebook['cells']:if cell['cell_type'] =='markdown': content =''.join(cell['source']) header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)if header_match: # Check if the cell starts with a header current_header = header_match.group(1)# Add any content after the header in the same cell remaining_content = content[len(current_header):].strip()if remaining_content: paragraphs = re.split(r'\n\s*\n', remaining_content)for paragraph in paragraphs: add_chunk(paragraph)else: paragraphs = re.split(r'\n\s*\n', content)for paragraph in paragraphs: add_chunk(paragraph)elif cell['cell_type'] =='code': code_content ='```python\n'+''.join(cell['source']) +'\n```' add_chunk(code_content)return chunksdef filter_chunks(chunks, exclude_headers): filtered_chunks = []for chunk in chunks: lines = chunk.split('\n')# Check if the first line (header) is in the exclude listifnotany(header in lines[0] for header in exclude_headers): filtered_chunks.append(chunk)return filtered_chunksexclude_headers = ["Questionnaire", "Further Research"]
Retrieving Context (Small Chunks) Using Cosine Similarity
I’ll use cosine similarity between the first question and the full set of notebook chunks to determine the best match between query and context:
res = F.cosine_similarity(q_embs[0], data_embs, dim=1).sort(descending=True)res[0][0], res[1][0]
(tensor(0.8431), tensor(4))
# get the chunk with the highest cosine similarity valuefiltered_chunks[res[1][0]]
'## Deep Learning Is for Everyone\n\n```asciidoc\n[[myths]]\n.What you don\'t need to do deep learning\n[options="header"]\n|======\n| Myth (don\'t need) | Truth\n| Lots of math | Just high school math is sufficient\n| Lots of data | We\'ve seen record-breaking results with <50 items of data\n| Lots of expensive computers | You can get what you need for state of the art work for free\n|======\n```'
That’s the correct context needed to answer this question! I’ll now loop through each question and store the response. Then, I’ll download the CSV of question/context pairs and score them manually in Excel to calculate the retrieved context relevancy (i.e. the percentage of retrieved contexts that are relevant and sufficient to answer the question).
results = []for q in q_embs: res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True) results.append(filtered_chunks[res[1][0]])
assertlen(results) ==33
df['cos_sim_res'] = pd.Series(results)df.head(3)
chapter
question_number
question_text
answer
keywords
cos_sim_res
0
1
1
Do you need these for deep learning?\\n\\n- Lo...
Lots of math - False\\nLots of data - False\\n...
math, data, expensive computers, PhD
## Deep Learning Is for Everyone\n\n```asciido...
1
1
2
Name five areas where deep learning is now the...
Any five of the following:\\nNatural Language ...
deep learning, state of the art, best, world
## Deep Learning Is for Everyone\n\nHere's a l...
2
1
3
What was the name of the first device that was...
Mark I perceptron built by Frank Rosenblatt
first, device, artificial, neuron
## Neural Networks: A Brief History\n\nRosenbl...
df.to_csv('cos_sim_results.csv', index=False)
Selecting the notebook chunk with the highest cosine similarity score resulted in a 51% retrieved context relevancy. The best I achieved with full text search (using sqlite) was 72% (with the retrieval of the top-3 BM25 ranked larger notebook chunks).
Selecting the Top-3 Cosine Similarity Small Chunks
Instead of selecting the chunk with the highest cosine similarity with the query, I’ll choose the top-3 and see if that allows me to answer more questions.
results = []for q in q_embs: res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True) ctx =''for idx in res[1][:3]: ctx += filtered_chunks[idx] +'\n' results.append(ctx)
Increased the number of context retrieved to 3 improve resulted in a retrieved context relevancy of 70%. I noted that cosine similarity was able to retrieve the right context for some questions where full text search did not, and vice versa, which leads me to believe that a hybrid approach will be optimal.
Increasing Chunk Size
As I did when experimenting with full text search, I’ll now increase the chunk size (to three paragraphs instead of 1) and see if retrieving the top-1 cosine similarity chunk answers more questions (than retrieving the top-1 single paragraph chunk).
larger_chunks = ["\n".join(filtered_chunks[i:i+3]) for i inrange(0, len(filtered_chunks), 3)]
len(larger_chunks)
103
Note that since my chunks’ content has changed, I’ll have to create new embeddings for them:
Interestingly enough, increasing the chunk size actually decreased the performance of retrievel using cosine similarity. I was able to answer only 48% of the questions with the retrieved context.
Selecting the Top-3 Larger Chunks
I’ll see if retrieving the top-3 larger chunks yields a better result.
Show cosine similarity for-loop
results = []for q in q_embs: res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True) ctx =''for idx in res[1][:3]: ctx += larger_chunks[idx] +'\n' results.append(ctx)
The chunks retrieved with this approach allowed me to answer 76% of the questions! That’s the best performance I have reached so far (with cosine similarity or full text search). I did not try retrieving top-5 small chunks for full text search, so that’s something I’ll try in the future.
Using NumPy to Calculate Cosine Similarity
I’m curious if I can implement cosine similarity using NumPy (as eventually I want to host this project and not have to install PyTorch and dependencies on the server).
First, I’ll recreate my embeddings for questions and context, returning NumPy arrays and not PyTorch tensors:
# thanks Claudedef cosine_similarity_multiple(a, b):# Ensure a is a 2D array a = np.atleast_2d(a)# Compute dot product for each row of b with a dot_product = np.dot(a, b.T)# Compute magnitudes a_norm = np.linalg.norm(a, axis=1) b_norm = np.linalg.norm(b, axis=1)# Compute cosine similarity similarity = dot_product / (a_norm[:, np.newaxis] * b_norm) similarity = similarity.flatten()return np.sort(similarity)[::-1], np.argsort(similarity)[::-1]
res = cosine_similarity_multiple(q_embs[0], data_embs)assert res[1][0] ==4assert res[0][0] -0.8431<1e-4
Show pytorch cosine similarity for-loop
pt_results = []for q in q_embs: res = F.cosine_similarity(tensor(q), tensor(data_embs), dim=1).sort(descending=True) ctx =''for idx in res[1][:5]: ctx += filtered_chunks[idx] +'\n' pt_results.append(ctx)
Show numpy cosine similarity loop
np_results = []for q in q_embs: res = cosine_similarity_multiple(q, data_embs) ctx =''for idx in res[1][:5]: ctx += filtered_chunks[idx] +'\n' np_results.append(ctx)
Applying cosine similarity with NumPy yields the exact same results (retrieved contexts) as PyTorch.
pt_results == np_results
True
Summary of Results
Here are the results from this notebook. “Retrieved context relevancy” means the percentage of the 33 questions in the Questionnaire I was able to answer given the retrieved notebook chunks. Eventually in my pipeline, I’ll replace me with an LLM that will be using this retrieved context to answer the questions. “Top-n” means the selected chunks used for context had the top n cosine similiarity scores. The best performing approach was using the top-5 (cosine similarity value) small-sized chunks.
Top-n
Chunk Size
Retrieved Context Relevancy
Top-5
Small
76%
Top-3
Small
70%
Top-3
Large
67%
Top-1
Small
51%
Top-1
Large
48%
Final Thoughts
This exercise was fun. I enjoy the simplicity of implementing cosine similarity to yield decent results. A few things I was thinking about while looking at the data and results in this notebook:
Some questions may not have relevant chunks in the dataset. For example, some of the questions ask the reader to do some activity. These should not be included in the evaluation set.
Cosine similarity didn’t perform well on questions I expected it to. For example, one of the Questionnaire questions is:
Are image models only useful for photos?
The answer to this question is “No” and there is a section in the chapter explicitly titled “Image Recognizers Can Tackle Non-Image Tasks” wherein the first paragraph reads:
An image recognizer can, as its name suggests, only recognize images. But a lot of things can be represented as images, which means that an image recognizer can learn to complete many tasks.
I would expect the cosine similarity between this question and context to be high. In fact, let’s take a look:
q = emb_model.encode("Are image models only useful for photos?")c = emb_model.encode("An image recognizer can, as its name suggests, only recognize images. But a lot of things can be represented as images, which means that an image recognizer can learn to complete many tasks.")q.shape, c.shape
((384,), (384,))
F.cosine_similarity(tensor(q), tensor(c), dim=0)
tensor(0.7046)
However, this cosine similarity is less than the combined top-5 results:
Such behavior is fascinating to me as it shows opportunity to explore better solutions. For example, a hybrid approach of cosine similarity with full text search. Or experimenting with different keywords for full text search. Or, experimenting with different chunking strategies. These are the kinds of things I’ll be exploring in future notebooks as I expand beyond Chapter 1 and tackle the rest of the Questionnaires in part 1 of the fastai course.
I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.