Using Cosine Similarity to Retrieve Context to Answer fastbook Questionnaire

python

RAG

information retrieval

fastbookRAG

In this blog post I am able to answer 76% of the fastbook Chapter 1 Questionnaire using cosine similarity between the question and the chunked chapter text.

Author

Vishal Bakshi

Published

August 6, 2024

Background

In this notebook I’ll establish a baseline performance of embedding cosine similarity for retrieving relevant context needed to answer the fastbook Chapter 1 Questionnaire.

This is part of a larger project that I’m calling fastbookRAG, where I’ll be building a keyword+semantic search-to-LLM RAG pipeline. The goal of this pipeline will be to accurately answer questions from the fastbook end-of-chapter Questionnaires. fastbook is the freely available fastai textbook.

Here are the results from this notebook. “Retrieved context relevancy” means the percentage of the 33 questions in the Questionnaire I was able to answer given the retrieved notebook chunks. Eventually in my pipeline, I’ll replace me with an LLM that will be using this retrieved context to answer the questions. “Top-n” means the selected chunks used for context had the top n cosine similiarity scores. The best performing approach was using the top-5 (cosine similarity value) small-sized chunks.

Top-n	Chunk Size	Retrieved Context Relevancy
Top-5	Small	76%
Top-3	Small	70%
Top-3	Large	67%
Top-1	Small	51%
Top-1	Large	48%

Show imports

import sqlite3
import json
import re
import pandas as pd, numpy as np
import textwrap
import torch
from torch import tensor
import torch.nn.functional as F

!pip install sentence-transformers -Uqq
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

def wrapper(line, width):
  textwrap.TextWrapper(
    width=width,
    replace_whitespace=False,
    drop_whitespace=False)(line)

def print_wrap_text(text, width):
  print("\n".join(wrapper.fill(line) for line in text.splitlines()))

Chunking fastbook Chapter 1 into Paragraphs

As I did with my full text search demo, I’ll start by chunking the fastbook Chapter 1 Jupyter Notebook into paragraphs (that include the header from the section it’s in).

Show the chunking code

def get_chunks(notebook_path):
    with open(notebook_path, 'r', encoding='utf-8') as file:
        notebook = json.load(file)

    chunks = []
    current_header = ""

    def add_chunk(content):
        if content.strip():
            chunks.append(f"{current_header}\n\n{content.strip()}")

    for cell in notebook['cells']:
        if cell['cell_type'] == 'markdown':
            content = ''.join(cell['source'])
            header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)
            if header_match:  # Check if the cell starts with a header
                current_header = header_match.group(1)
                # Add any content after the header in the same cell
                remaining_content = content[len(current_header):].strip()
                if remaining_content:
                    paragraphs = re.split(r'\n\s*\n', remaining_content)
                    for paragraph in paragraphs:
                        add_chunk(paragraph)
            else:
                paragraphs = re.split(r'\n\s*\n', content)
                for paragraph in paragraphs:
                    add_chunk(paragraph)
        elif cell['cell_type'] == 'code':
            code_content = '```python\n' + ''.join(cell['source']) + '\n```'
            add_chunk(code_content)

    return chunks

def filter_chunks(chunks, exclude_headers):
  filtered_chunks = []
  for chunk in chunks:
      lines = chunk.split('\n')
      # Check if the first line (header) is in the exclude list
      if not any(header in lines[0] for header in exclude_headers):
          filtered_chunks.append(chunk)
  return filtered_chunks

exclude_headers = ["Questionnaire", "Further Research"]

notebook_path = '01_intro.ipynb'
chunks = get_chunks(notebook_path)
assert len(chunks) == 315
filtered_chunks = filter_chunks(chunks, exclude_headers)
assert len(filtered_chunks) == 307

print(filtered_chunks[-3])

### Use Judgment in Defining Test Sets

Now that you have gotten a taste of how to build a model, you can decide what you want to dig into next.

Embedding the Notebook Chunks and Questionnaire Questions

I’ll embed the notebook chunks using the bge-small-en-v1.5 model.

data_embs = emb_model.encode(filtered_chunks, convert_to_tensor=True)
data_embs.shape

torch.Size([307, 384])

I’ll grab the questions from a gist I created:

df = pd.read_csv("https://gist.githubusercontent.com/vishalbakshi/309fb3abb222d32446b2c4e29db753fe/raw/bc6cd2ab15b64a92ec23796c61702f413fdd2b40/fastbookRAG_evals.csv")
df.head()

	chapter	question_number	question_text	answer	keywords
0	1	1	Do you need these for deep learning?\\n\\n- Lo...	Lots of math - False\\nLots of data - False\\n...	math, data, expensive computers, PhD
1	1	2	Name five areas where deep learning is now the...	Any five of the following:\\nNatural Language ...	deep learning, state of the art, best, world
2	1	3	What was the name of the first device that was...	Mark I perceptron built by Frank Rosenblatt	first, device, artificial, neuron
3	1	4	Based on the book of the same name, what are t...	A set of processing units\\nA state of activat...	parallel, distributed, processing, requirement...
4	1	5	What were the two theoretical misunderstanding...	In 1969, Marvin Minsky and Seymour Papert demo...	theoretical, misunderstandings, held, back, fi...

And embed them:

q_embs = emb_model.encode(df['question_text'], convert_to_tensor=True)
q_embs.shape

torch.Size([33, 384])

Retrieving Context (Small Chunks) Using Cosine Similarity

I’ll use cosine similarity between the first question and the full set of notebook chunks to determine the best match between query and context:

res = F.cosine_similarity(q_embs[0], data_embs, dim=1).sort(descending=True)
res[0][0], res[1][0]

(tensor(0.8431), tensor(4))

# get the chunk with the highest cosine similarity value
filtered_chunks[res[1][0]]

'## Deep Learning Is for Everyone\n\n```asciidoc\n[[myths]]\n.What you don\'t need to do deep learning\n[options="header"]\n|======\n| Myth (don\'t need) | Truth\n| Lots of math | Just high school math is sufficient\n| Lots of data | We\'ve seen record-breaking results with <50 items of data\n| Lots of expensive computers | You can get what you need for state of the art work for free\n|======\n```'

That’s the correct context needed to answer this question! I’ll now loop through each question and store the response. Then, I’ll download the CSV of question/context pairs and score them manually in Excel to calculate the retrieved context relevancy (i.e. the percentage of retrieved contexts that are relevant and sufficient to answer the question).

results = []

for q in q_embs:
  res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True)
  results.append(filtered_chunks[res[1][0]])

assert len(results) == 33

df['cos_sim_res'] = pd.Series(results)
df.head(3)

	chapter	question_number	question_text	answer	keywords	cos_sim_res
0	1	1	Do you need these for deep learning?\\n\\n- Lo...	Lots of math - False\\nLots of data - False\\n...	math, data, expensive computers, PhD	## Deep Learning Is for Everyone\n\n```asciido...
1	1	2	Name five areas where deep learning is now the...	Any five of the following:\\nNatural Language ...	deep learning, state of the art, best, world	## Deep Learning Is for Everyone\n\nHere's a l...
2	1	3	What was the name of the first device that was...	Mark I perceptron built by Frank Rosenblatt	first, device, artificial, neuron	## Neural Networks: A Brief History\n\nRosenbl...

df.to_csv('cos_sim_results.csv', index=False)

Selecting the notebook chunk with the highest cosine similarity score resulted in a 51% retrieved context relevancy. The best I achieved with full text search (using sqlite) was 72% (with the retrieval of the top-3 BM25 ranked larger notebook chunks).

Selecting the Top-3 Cosine Similarity Small Chunks

Instead of selecting the chunk with the highest cosine similarity with the query, I’ll choose the top-3 and see if that allows me to answer more questions.

results = []

for q in q_embs:
  res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True)
  ctx = ''
  for idx in res[1][:3]:
    ctx += filtered_chunks[idx] + '\n'
  results.append(ctx)

df['cos_sim_res'] = pd.Series(results)
df.head(3)

	chapter	question_number	question_text	answer	keywords	cos_sim_res
0	1	1	Do you need these for deep learning?\\n\\n- Lo...	Lots of math - False\\nLots of data - False\\n...	math, data, expensive computers, PhD	## Deep Learning Is for Everyone\n\n```asciido...
1	1	2	Name five areas where deep learning is now the...	Any five of the following:\\nNatural Language ...	deep learning, state of the art, best, world	## Deep Learning Is for Everyone\n\nHere's a l...
2	1	3	What was the name of the first device that was...	Mark I perceptron built by Frank Rosenblatt	first, device, artificial, neuron	## Neural Networks: A Brief History\n\nRosenbl...

df.to_csv('top-3_cos_sim_results.csv', index=False)

Increased the number of context retrieved to 3 improve resulted in a retrieved context relevancy of 70%. I noted that cosine similarity was able to retrieve the right context for some questions where full text search did not, and vice versa, which leads me to believe that a hybrid approach will be optimal.

Increasing Chunk Size

As I did when experimenting with full text search, I’ll now increase the chunk size (to three paragraphs instead of 1) and see if retrieving the top-1 cosine similarity chunk answers more questions (than retrieving the top-1 single paragraph chunk).

larger_chunks = ["\n".join(filtered_chunks[i:i+3]) for i in range(0, len(filtered_chunks), 3)]

len(larger_chunks)

Note that since my chunks’ content has changed, I’ll have to create new embeddings for them:

data_embs = emb_model.encode(larger_chunks, convert_to_tensor=True)
data_embs.shape

torch.Size([103, 384])

Show cosine similarity for-loop

results = []

for q in q_embs:
  res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True)
  results.append(larger_chunks[res[1][0]])

assert len(results) == 33

df['cos_sim_res'] = pd.Series(results)
df.head(3)

	chapter	question_number	question_text	answer	keywords	cos_sim_res
0	1	1	Do you need these for deep learning?\\n\\n- Lo...	Lots of math - False\\nLots of data - False\\n...	math, data, expensive computers, PhD	## How to Learn Deep Learning\n\n> : A PhD is ...
1	1	2	Name five areas where deep learning is now the...	Any five of the following:\\nNatural Language ...	deep learning, state of the art, best, world	## Deep Learning Is for Everyone\n\nDeep learn...
2	1	3	What was the name of the first device that was...	Mark I perceptron built by Frank Rosenblatt	first, device, artificial, neuron	## Neural Networks: A Brief History\n\nRosenbl...

df.to_csv('larger_cos_sim_results.csv', index=False)

Interestingly enough, increasing the chunk size actually decreased the performance of retrievel using cosine similarity. I was able to answer only 48% of the questions with the retrieved context.

Selecting the Top-3 Larger Chunks

I’ll see if retrieving the top-3 larger chunks yields a better result.

Show cosine similarity for-loop

results = []

for q in q_embs:
  res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True)
  ctx = ''
  for idx in res[1][:3]:
    ctx += larger_chunks[idx] + '\n'
  results.append(ctx)

df['cos_sim_res'] = pd.Series(results)
df.head(3)

	chapter	question_number	question_text	answer	keywords	cos_sim_res
0	1	1	Do you need these for deep learning?\\n\\n- Lo...	Lots of math - False\\nLots of data - False\\n...	math, data, expensive computers, PhD	## How to Learn Deep Learning\n\n> : A PhD is ...
1	1	2	Name five areas where deep learning is now the...	Any five of the following:\\nNatural Language ...	deep learning, state of the art, best, world	## Deep Learning Is for Everyone\n\nDeep learn...
2	1	3	What was the name of the first device that was...	Mark I perceptron built by Frank Rosenblatt	first, device, artificial, neuron	## Neural Networks: A Brief History\n\nRosenbl...

df.to_csv('top-3-large_cos_sim_results.csv', index=False)

Retrieving the top-3 large chunks helped me answer 67% of the questions, performing worse (1 question less) than using top-3 small chunks.

Selecting the Top-5 Cosine Similarity Small Chunks

I am getting better results using smaller chunks so I’ll increase the number of small chunks retrieved to 5. I’ll have to re-embed the smaller chunks:

data_embs = emb_model.encode(filtered_chunks, convert_to_tensor=True)
data_embs.shape

torch.Size([307, 384])

Show cosine similarity for-loop

results = []

for q in q_embs:
  res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True)
  ctx = ''
  for idx in res[1][:5]:
    ctx += filtered_chunks[idx] + '\n'
  results.append(ctx)

df['cos_sim_res'] = pd.Series(results)
df.head(3)

	chapter	question_number	question_text	answer	keywords	cos_sim_res
0	1	1	Do you need these for deep learning?\\n\\n- Lo...	Lots of math - False\\nLots of data - False\\n...	math, data, expensive computers, PhD	## Deep Learning Is for Everyone\n\n```asciido...
1	1	2	Name five areas where deep learning is now the...	Any five of the following:\\nNatural Language ...	deep learning, state of the art, best, world	## Deep Learning Is for Everyone\n\nHere's a l...
2	1	3	What was the name of the first device that was...	Mark I perceptron built by Frank Rosenblatt	first, device, artificial, neuron	## Neural Networks: A Brief History\n\nRosenbl...

df.to_csv('top-5_cos_sim_results.csv', index=False)

The chunks retrieved with this approach allowed me to answer 76% of the questions! That’s the best performance I have reached so far (with cosine similarity or full text search). I did not try retrieving top-5 small chunks for full text search, so that’s something I’ll try in the future.

Using NumPy to Calculate Cosine Similarity

I’m curious if I can implement cosine similarity using NumPy (as eventually I want to host this project and not have to install PyTorch and dependencies on the server).

First, I’ll recreate my embeddings for questions and context, returning NumPy arrays and not PyTorch tensors:

data_embs = emb_model.encode(filtered_chunks)
data_embs.shape

(307, 384)

q_embs = emb_model.encode(df['question_text'])
q_embs.shape

(33, 384)

type(data_embs), type(q_embs)

(numpy.ndarray, numpy.ndarray)

# thanks Claude
def cosine_similarity_multiple(a, b):
    # Ensure a is a 2D array
    a = np.atleast_2d(a)

    # Compute dot product for each row of b with a
    dot_product = np.dot(a, b.T)

    # Compute magnitudes
    a_norm = np.linalg.norm(a, axis=1)
    b_norm = np.linalg.norm(b, axis=1)

    # Compute cosine similarity
    similarity = dot_product / (a_norm[:, np.newaxis] * b_norm)
    similarity = similarity.flatten()

    return np.sort(similarity)[::-1], np.argsort(similarity)[::-1]

res = cosine_similarity_multiple(q_embs[0], data_embs)
assert res[1][0] == 4
assert res[0][0] - 0.8431 < 1e-4

Show pytorch cosine similarity for-loop

pt_results = []

for q in q_embs:
  res = F.cosine_similarity(tensor(q), tensor(data_embs), dim=1).sort(descending=True)
  ctx = ''
  for idx in res[1][:5]:
    ctx += filtered_chunks[idx] + '\n'
  pt_results.append(ctx)

Show numpy cosine similarity loop

np_results = []

for q in q_embs:
  res = cosine_similarity_multiple(q, data_embs)
  ctx = ''
  for idx in res[1][:5]:
    ctx += filtered_chunks[idx] + '\n'
  np_results.append(ctx)

Applying cosine similarity with NumPy yields the exact same results (retrieved contexts) as PyTorch.

pt_results == np_results

True

Summary of Results

Top-n	Chunk Size	Retrieved Context Relevancy
Top-5	Small	76%
Top-3	Small	70%
Top-3	Large	67%
Top-1	Small	51%
Top-1	Large	48%

Final Thoughts

This exercise was fun. I enjoy the simplicity of implementing cosine similarity to yield decent results. A few things I was thinking about while looking at the data and results in this notebook:

Some questions may not have relevant chunks in the dataset. For example, some of the questions ask the reader to do some activity. These should not be included in the evaluation set.

Cosine similarity didn’t perform well on questions I expected it to. For example, one of the Questionnaire questions is:

Are image models only useful for photos?

The answer to this question is “No” and there is a section in the chapter explicitly titled “Image Recognizers Can Tackle Non-Image Tasks” wherein the first paragraph reads:

An image recognizer can, as its name suggests, only recognize images. But a lot of things can be represented as images, which means that an image recognizer can learn to complete many tasks.

I would expect the cosine similarity between this question and context to be high. In fact, let’s take a look:

q = emb_model.encode("Are image models only useful for photos?")
c = emb_model.encode("An image recognizer can, as its name suggests, only recognize images. But a lot of things can be represented as images, which means that an image recognizer can learn to complete many tasks.")
q.shape, c.shape

((384,), (384,))

F.cosine_similarity(tensor(q), tensor(c), dim=0)

tensor(0.7046)

However, this cosine similarity is less than the combined top-5 results:

F.cosine_similarity(tensor(q), tensor(emb_model.encode(pt_results[27])), dim=0)

tensor(0.7703)

Such behavior is fascinating to me as it shows opportunity to explore better solutions. For example, a hybrid approach of cosine similarity with full text search. Or experimenting with different keywords for full text search. Or, experimenting with different chunking strategies. These are the kinds of things I’ll be exploring in future notebooks as I expand beyond Chapter 1 and tackle the rest of the Questionnaires in part 1 of the fastai course.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.