Using Hybrid Search to Answer fastai the Chapter 1 Questionnaire

python
RAG
information retrieval
fastbookRAG
In this blog post I use different approaches to combine FTS5 (keyword search) and Cosine Similarity (semantic search) to retrieve context necessary to answer questions about Chapter 1 of the fastai textbook.
Author

Vishal Bakshi

Published

August 11, 2024

Background

In this blog post I’ll work through creating a hybrid search (keyword search + semantic search) baseline for a project I’m working on called fastbookRAG. In this project, I’m building a hybrid search + LLM pipeline to answer questions from the end-of-chapter Questionnaires in the freely available fastai textbook. For now, I’m taking the place of the LLM in the pipeline (using the context retrieved using keyword/semantic search to answer questions). In this notebook, I’ll focus on retrieving the relevant context needed to answer questions from Chapter 1. In future notebooks, I’ll be expanding this to all 8 lessons in Part 1 of the fastai course.

In two previous blog posts I was able to answer 72% of the Chapter 1 Questionnairse questions using full text search and 76% of questions using cosine similarity. I’ll try to beat that result using a combined approach.

I’ll summarize my results from this notebook in the table below. Cosine Similarity is represented as “CS” and full text search as “BM25”:

Approach Chunk Size (Paragraphs) Questions Answered
Top-5 CS + Top-3 BM25 1 (CS), 3 (BM25) 85%
Top-3 CS + Top-2 BM25 1 (CS), 3 (BM25) 76%
Top-5 Weighted Average 1 76%
Top-3 Weighted Average 1 64%
Top-3 CS of Top-10 BM25 Results 1 58%
Top-3 CS of Top-10 BM25 Results 3 58%
Top-1 CS of Top-10 BM25 Results 1 52%
Top-1 CS of Top-12 BM25 Results 3 48%
Top-1 Weighted Average 1 48%
Show imports
import sqlite3
import json
import re
import pandas as pd, numpy as np
import textwrap
import torch
from torch import tensor
import torch.nn.functional as F

!pip install sentence-transformers -Uqq
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

Chunking the Chapter 1 Notebook into Paragraphs

As usual, I’ll start by chunking the chapter 1 notebook into paragraphs and store them into a sqlite database. I’ll wrap the databse loading code into a function so that I can easily reuse it for different chunking strategies.

Show the chunking code
def get_chunks(notebook_path):
    with open(notebook_path, 'r', encoding='utf-8') as file:
        notebook = json.load(file)

    chunks = []
    current_header = ""

    def add_chunk(content):
        if content.strip():
            chunks.append(f"{current_header}\n\n{content.strip()}")

    for cell in notebook['cells']:
        if cell['cell_type'] == 'markdown':
            content = ''.join(cell['source'])
            header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)
            if header_match:  # Check if the cell starts with a header
                current_header = header_match.group(1)
                # Add any content after the header in the same cell
                remaining_content = content[len(current_header):].strip()
                if remaining_content:
                    paragraphs = re.split(r'\n\s*\n', remaining_content)
                    for paragraph in paragraphs:
                        add_chunk(paragraph)
            else:
                paragraphs = re.split(r'\n\s*\n', content)
                for paragraph in paragraphs:
                    add_chunk(paragraph)
        elif cell['cell_type'] == 'code':
            code_content = '```python\n' + ''.join(cell['source']) + '\n```'
            add_chunk(code_content)

    return chunks

def filter_chunks(chunks, exclude_headers):
  filtered_chunks = []
  for chunk in chunks:
      lines = chunk.split('\n')
      # Check if the first line (header) is in the exclude list
      if not any(header in lines[0] for header in exclude_headers):
          filtered_chunks.append(chunk)
  return filtered_chunks

exclude_headers = ["Questionnaire", "Further Research"]
notebook_path = '01_intro.ipynb'
chunks = get_chunks(notebook_path)
assert len(chunks) == 315
filtered_chunks = filter_chunks(chunks, exclude_headers)
assert len(filtered_chunks) == 307
Show the db loading function
conn = sqlite3.connect('/content/fastbook.db')

def load_data(filtered_chunks):
  conn = sqlite3.connect('/content/fastbook.db')
  cur = conn.cursor()
  res = cur.execute("""

  CREATE VIRTUAL TABLE fastbook_text
  USING FTS5(text);
  """)

  for string in filtered_chunks:
    cur.execute(f"INSERT INTO fastbook_text(text) VALUES (?)", (string,))

  conn.commit()
  res = cur.execute("SELECT * from fastbook_text").fetchall()
  conn.close()
  return len(res) == len(filtered_chunks)
load_data(filtered_chunks)
True

Retrieving Top-1 Cosine Similarity of Top-10 BM25-Ranked Keyword Search Results

To give me some flexibility in which chunks are retrieved, I’ll use cosine similarity on the top-10 BM25-ranked full text search results. In previous experiments my best performing cosine similarity approach used top-5 small chunks and my best performing full text search approach used top-3 large chunks.

I’ll first create embeddings for the chunked data and the questions:

data_embs = emb_model.encode(filtered_chunks, convert_to_tensor=True)
data_embs.shape
torch.Size([307, 384])
# Chapter 1 Questionnaire questions, answers and keywords
df = pd.read_csv("https://gist.githubusercontent.com/vishalbakshi/309fb3abb222d32446b2c4e29db753fe/raw/bc6cd2ab15b64a92ec23796c61702f413fdd2b40/fastbookRAG_evals.csv")
df.head(3)
chapter question_number question_text answer keywords
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron
q_embs = emb_model.encode(df['question_text'], convert_to_tensor=True)
q_embs.shape
torch.Size([33, 384])

I’ll use ORDER BY rank and LIMIT 10 in my SQL query to get the top-10 BM25-ranked results:

Show the for-loop + query
results = []
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text
    WHERE fastbook_text MATCH '{words}'
    ORDER BY rank
    LIMIT 10

    """
    res = cur.execute(q).fetchall()
    res = [item[0] for item in res]
    results.append(res)
  else:
    # if keywords == "No Answer"
    res = "No answer"
    results.append(res)
len(results), len(results[0])
(33, 10)

I’ll embed these chunks of context (up to 10 for each of the 33 questions) so that I can perform cosine similarity between them and the question embeddings. Note that not all questions have 10 chunks as some keyword searches resulted in less than 10 retrieved chunks.

len(results[-3])
2
results_embs = {}

for idx, sublist in enumerate(results):
  results_embs[idx] = emb_model.encode(sublist, convert_to_tensor=True)

I’ll apply cosine similarity between the chunk embeddings for the given question’s keyword search result and the question embedding and select the top result:

cs_results = []

for i, q in enumerate(q_embs):
  if results[i] != "No answer":
    res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True)
    cs_results.append(results[i][res[1][0]])
  else:
    cs_results.append('No answer')
len(cs_results)
33
df['retrieved_context'] = pd.Series(cs_results)
df.head(3)
chapter question_number question_text answer keywords retrieved_context
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD ## Deep Learning Is for Everyone\n\n```asciido...
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world ## Deep Learning Is for Everyone\n\nHere's a l...
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron ## Neural Networks: A Brief History\n\nRosenbl...
df.to_csv('top1_top10_results.csv', index=False)

Using the retrieved chunks I was able to answer 17/33 or 52% of the Chapter 1 questions.

I’m curious: what was the BM25-rank of the context with the top-1 cosine similarity? In other words, did cosine similarity pick the highest ranked keyword search result?

cs_ranks = []

for i, q in enumerate(q_embs):
  if results[i] != "No answer":
    res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True)
    cs_ranks.append(res[1][0].item())
  else:
    cs_ranks.append(None)

For 16 of the 33 questions, the context with the highest cosine similarity with the question was also the highest BM25-ranked keyword search result. 24/33 highest cosine similarity results were top-3 BM25-ranked chunkcs. For 6 out of 33 questions, the highest cosine similarity was for a context that was not top-3 BM25 ranked.

cs_ranks.value_counts()
count
0.0 16
1.0 5
2.0 3
8.0 2
4.0 2
7.0 1
3.0 1

Retrieving Top-3 Cosine Similarity of Top-10 BM25-Ranked Keyword Search Results

I’ll now retrieve 3 chunks (for each question) that have the top-3 highest cosine similarity values. I expect this to improve my ability to answer questions.

cs_results = []

for i, q in enumerate(q_embs):
  if results[i] != "No answer":
    res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True)
    res = '\n'.join([results[i][idx] for idx in res[1][:3]])
    cs_results.append(res)
  else:
    cs_results.append('No answer')
df['retrieved_context'] = pd.Series(cs_results)
df.head(3)
chapter question_number question_text answer keywords retrieved_context
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD ## Deep Learning Is for Everyone\n\n```asciido...
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world ## Deep Learning Is for Everyone\n\nHere's a l...
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron ## Neural Networks: A Brief History\n\nRosenbl...
df.to_csv('top3_top10_results.csv', index=False)

This approach slightly improves the performance. Now I can answer 19/33 or 58% of the questions with the given retrieved context. This still underperforms each individual approach.

Concatenating Keyword Search Result Before Applying Cosine Similarity

Next, I’ll try a different approach: I’ll concatenate, three at a time, keyword search results and then perform cosine similarity on those larger chunks. I’ll pick the larger chunk with the highest cosine similarity.

Since I’m concatenating 3 keyword search results at a time, I’ll increase my LIMIT to 12 (a multiple of 3).

Show the for-loop + query
results = []
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text
    WHERE fastbook_text MATCH '{words}'
    ORDER BY rank
    LIMIT 12

    """

    res = cur.execute(q).fetchall()
    concatenated_chunks = []
    for i in range(0, len(res), 3):
        # Select three tuples at a time
        chunk = res[i:i+3]
        # Extract strings and concatenate them
        concatenated_chunk = '\n'.join([t[0] for t in chunk])
        concatenated_chunks.append(concatenated_chunk)

    results.append(concatenated_chunks)
  else:
    # if keywords == "No Answer"
    res = "No answer"
    results.append(res)
len(results)
33
Show the embedding of results
results_embs = {}

for idx, sublist in enumerate(results):
  results_embs[idx] = emb_model.encode(sublist, convert_to_tensor=True)
cs_results = []

for i, q in enumerate(q_embs):
  if results[i] != "No answer":
    res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True)
    cs_results.append(results[i][res[1][0]])
  else:
    cs_results.append('No answer')
len(cs_results)
33
df['retrieved_context'] = pd.Series(cs_results)
df.head(3)
chapter question_number question_text answer keywords retrieved_context
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD ## Deep Learning Is for Everyone\n\n```asciido...
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world ## Deep Learning Is for Everyone\n\nHere's a l...
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron ## Neural Networks: A Brief History\n\n<img al...
df.to_csv('top1_group-by-3_results.csv', index=False)

This approach led to a worse performance: I was able to answer only 16 out of the 33 questions, or 48%.

Storing Larger Chunks in the Database

So far, I haven’t been able to beat the individual performance of full text search or cosine similarity. I’ll try something that improved full text search: storing larger chunks of data. To do this, I’ll concatenate three paragraphs at a time before loading them into the database:

larger_chunks = ["\n".join(filtered_chunks[i:i+3]) for i in range(0, len(filtered_chunks), 3)]
len(larger_chunks)
103

Note that I’m just deleting my sqlite database file and recreating from scratch when I run load_data:

load_data(larger_chunks)
True
Show the for-loop + query
results = []
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text
    WHERE fastbook_text MATCH '{words}'
    ORDER BY rank
    LIMIT 10

    """
    res = cur.execute(q).fetchall()
    res = [item[0] for item in res]
    results.append(res)
  else:
    # if keywords == "No Answer"
    res = "No answer"
    results.append(res)
len(results)
33
Show the cosine similarity code
results_embs = {}

for idx, sublist in enumerate(results):
  results_embs[idx] = emb_model.encode(sublist, convert_to_tensor=True)

cs_results = []

for i, q in enumerate(q_embs):
  if results[i] != "No answer":
    res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True)
    cs_results.append(results[i][res[1][0]])
  else:
    cs_results.append('No answer')
len(cs_results)
33
df['retrieved_context'] = pd.Series(cs_results)
df.head(3)
chapter question_number question_text answer keywords retrieved_context
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD ## How to Learn Deep Learning\n\n> : A PhD is ...
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world ## Deep Learning Is for Everyone\n\nDeep learn...
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron ## Neural Networks: A Brief History\n\nRosenbl...
df.to_csv('top1_top10-larger_results.csv', index=False)

With this approach (top-1 cosine similarity for the top-10 larger chunks retrieved by keyword search) I was able to answer 19 out of 33 questions, or 58%.

Weighted Average Between Cosine Similarity and BM25 Score

So far I’ve been applying cosine similarity to the top-n retrieved chunks based on BM25 score. I’ll now try a different approach: pick the top-n chunks based on a weighted average between cosine similarity and BM25 score.

To do this, I’ll revert back to the smaller chunked database. I’ll get the top-10 chunks based on BM25, then I’ll get the top-10 chunks based on cosine similarity. Finally, I’ll normalize each score within each group and then take a weighted average between the two. I’ll then pick top-1, top-3 and top-5 chunks and see how many questions I can answer with the retrieved context.

load_data(filtered_chunks)
True
Show the for-loop + query
results = []
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text
    WHERE fastbook_text MATCH '{words}'
    ORDER BY rank
    LIMIT 10

    """
    res = cur.execute(q).fetchall()
    results.append(res)
  else:
    # if keywords == "No Answer"
    res = "No answer"
    results.append(res)
Show code to embed the chunks
data_embs = emb_model.encode(filtered_chunks, convert_to_tensor=True)
data_embs.shape
torch.Size([307, 384])
Show code calculating cosine similarity
cs_results = []

for q in q_embs:
  res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True)
  top10_chunks = [filtered_chunks[el.item()] for el in res[1][:10]]
  top10_cs = [el.item() for el in res[0][:10]]
  res = list(zip(top10_chunks, top10_cs))
  cs_results.append(res)
len(results), len(cs_results)
(33, 33)
Show code to create DataFrame from results
def create_dataframe(nested_list, label):
    rows = []
    for question_num, question_data in enumerate(nested_list, start=1):
        if question_data != "No answer":
          for chunk, score in question_data:
            rows.append({
                'label': label,
                'Question': question_num,
                'chunk': chunk,
                'score': score
            })
        else:
          rows.append({
              'label': label,
              'Question': question_num,
              'chunk': 'No answer',
              'score': np.nan
          })
    return pd.DataFrame(rows)
bm25 = create_dataframe(results, 'BM25')
cs = create_dataframe(cs_results, 'CS')
bm25.shape, cs.shape
((277, 4), (330, 4))
Show code to normalize scores within each group
def normalize_scores(group):
    min_score = group['score'].min()
    max_score = group['score'].max()
    group['normalized_score'] = (group['score'] - min_score) / (max_score - min_score)
    return group

bm25['score'] = -1 * bm25['score']
bm25 = bm25.groupby('Question').apply(normalize_scores).reset_index(drop=True)
cs = cs.groupby('Question').apply(normalize_scores).reset_index(drop=True)
Show code to calculate prep data for weighted average calcs
# Function to process each dataframe
def process_df(df):
    # Ensure 'Question' is treated as a string to avoid any numeric mismatch
    df['Question'] = df['Question'].astype(str)
    # Create a unique identifier for each question/chunk combination
    df['question_chunk'] = df['Question'] + '_' + df['chunk']
    return df

# Process both dataframes
bm25 = process_df(bm25)
cs = process_df(cs)

# Create a full set of all question/chunk combinations
all_combinations = set(bm25['question_chunk']).union(set(cs['question_chunk']))


# Function to get scores, using 0 for missing combinations
def get_scores(df, all_combinations):
    scores = df.set_index('question_chunk')['normalized_score']
    return pd.Series(index=all_combinations).fillna(0).add(scores, fill_value=0)

# Get scores for both dataframes
bm25_scores = get_scores(bm25, all_combinations)
cs_scores = get_scores(cs, all_combinations)
weight_bm25 = 0.3
weight_cs = 0.7
Show code to calculate weighted average
weighted_avg = (weight_bm25 * bm25_scores + weight_cs * cs_scores) / (weight_bm25 + weight_cs)

# Create the final dataframe
result = pd.DataFrame({
    'question_chunk': weighted_avg.index,
    'weighted_score': weighted_avg.values
})

# Split 'question_chunk' back into 'Question' and 'chunk'
result[['Question', 'chunk']] = result['question_chunk'].str.split('_', n=1, expand=True)

# Reorder columns
result = result[['Question', 'chunk', 'weighted_score']]
Show code to get top-n results
def get_top_n_chunks(df, n=5):
    # Function to get top N chunks and concatenate them
    def top_n_concat(group):
        top_n = group.nlargest(n, 'weighted_score')
        return pd.Series({
            'chunk': ' '.join(top_n['chunk']),
            'weighted_score': top_n['weighted_score'].mean()
        })

    # Apply the function to each question group
    result = df.groupby('Question').apply(top_n_concat).reset_index()

    # Sort the results by Question
    result = result.sort_values('Question')

    return result
get_top_n_chunks(result, n=1).to_csv('top1_weighted_average.csv', index=False)

Using the top-1 weighted average between BM25 and Cosine Similarity yielded chunks that allowed me to answer 16 out of 33 questions, or 48%.

get_top_n_chunks(result, n=3).to_csv('top3_weighted_average.csv', index=False)

Using the top-3 weighted average between BM25 and Cosine Similarity yielded chunks that allowed me to answer 21 out of 33 questions, or 64% which is the best performing hybrid approach so far.

get_top_n_chunks(result, n=5).to_csv('top5_weighted_average.csv', index=False)

Using the top-5 weighted average between BM25 and Cosine Similarity allowed me to answer 25 out of 33 questions, or 76%, which is the best performing hybrid approach so far and matches the best Cosine Similarity-only approach.

Retrieving top-3 Cosine Similarity and top-2 Keyword Search Results

The next approach I’ll try: returning the top-3 chunks retrieved using Cosine Similarity and the top-2 chunks retrieved using Keyword Search.

Show code to get top-n results
def get_top_n_chunks(df, n=5):
    # Function to get top N chunks and concatenate them
    def top_n_concat(group):
        top_n = group.nlargest(n, 'normalized_score')
        return pd.Series({
            'chunk': ' '.join(top_n['chunk']),
            'normalized_score': top_n['normalized_score'].mean()
        })

    # Apply the function to each question group
    result = df.groupby('Question').apply(top_n_concat).reset_index()

    # Sort the results by Question
    result = result.sort_values('Question')

    return result
top2_bm25 = get_top_n_chunks(bm25,n=2)
top3_cs = get_top_n_chunks(cs, n=3)
top5_combined = pd.concat([top2_bm25, top3_cs]).groupby('Question').agg({
        'chunk': ' '.join,  # Concatenate all chunks
        'normalized_score': 'mean'  # Take the mean of the normalized scores
    }).reset_index()
top5_combined.to_csv('top5_combined.csv', index=False)

Using the top-3 Cosine Similarity + top-2 BM25 chunks allowed me to answer 25 out of 33 questions, or 76%. These were the same 25 questions I could answer using the top-5 chunks (by weighted average).

Retrieving top-5 Cosine Similarity and top-3 Keyword Search Results

I’m hesitant to use more than 5 chunks of context (for the eventual LLM in this pipeline) because while there’s a higher chance relevant data is included, it also includes a lot of irrelevant data in the context, and I would worry that the LLM (especially a relatively small one like phi-3) may get distracted by this irrelevant context. That being said, how the LLM behaves to different contexts is something I’ll experiment with in the future to determine whether or not something is “too long” of a context for the model.

The last hybrid approach I’ll pursue is combining the two best-performing individual approaches:

  • BM25: Use the top-3 large (3-paragraph) chunks
  • Cosine Similarity: Use the top-5 small (1-paragraph) chunks

I’ll rewrite the database table to contain the 3-paragraph-long chunks to use for keyword search:

load_data(larger_chunks)
True
Show the for-loop + query
results = []
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text
    WHERE fastbook_text MATCH '{words}'
    ORDER BY rank
    LIMIT 3

    """
    res = cur.execute(q).fetchall()
    results.append(res)
  else:
    # if keywords == "No Answer"
    res = "No answer"
    results.append(res)
bm25 = create_dataframe(results, 'BM25')
bm25['normalized_score'] = -1 * bm25['score']
top3_bm25 = get_top_n_chunks(bm25,n=3)
top3_bm25.shape
(33, 3)
top5_cs = get_top_n_chunks(cs, n=5)
top5_cs['Question'] = top5_cs['Question'].astype(int)
top5_cs.shape
(33, 3)
top8_combined = pd.concat([top3_bm25, top5_cs]).groupby('Question').agg({
        'chunk': ' '.join,  # Concatenate all chunks
        'normalized_score': 'mean'  # Take the mean of the normalized scores
    }).reset_index()
top8_combined.shape
(33, 3)
top8_combined.to_csv('top8_combined.csv', index=False)

This hybrid approach resulted in the best performance thus far (individual or hybrid)! With the top-8 chunks retrieved for each question (top-5 small chunks for Cosine Similarity, top-3 large chunks for keyword search) I was able to answer 28 out of 33 questions, or 85%. Factoring in that 3 of the questions are exercises to be done by the reader (and aren’t answerable using the Chapter 1 content), the true percentage is 93%.

Final Thoughts

I’ll summarize my results from this notebook in the table below. Cosine Similarity is represented as “CS” and full text search as “BM25”:

Approach Chunk Size (Paragraphs) Questions Answered
Top-5 CS + Top-3 BM25 1 (CS), 3 (BM25) 85%
Top-3 CS + Top-2 BM25 1 (CS), 3 (BM25) 76%
Top-5 Weighted Average 1 76%
Top-3 Weighted Average 1 64%
Top-3 CS of Top-10 BM25 Results 1 58%
Top-3 CS of Top-10 BM25 Results 3 58%
Top-1 CS of Top-10 BM25 Results 1 52%
Top-1 CS of Top-12 BM25 Results 3 48%
Top-1 Weighted Average 1 48%

For the Chapter 1 Questionnaire, the most effective strategy is to combine the top-5 1-paragraph chunks (from Cosine Similarity) with the top-3 3-paragraph chunks (from BM25). Each method alone answered about 72% (BM25) and 76% (Cosine Similarity) of the questions, so combining them increases coverage, as each approach catches questions the other might miss.

With this baseline established for Chapter 1, I’ll now move on to the rest of the chapters covered in Part 1 of the fastai course.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.