Evaluating 4 Retrieval Methods with 6 Chunking Strategies on my fastbook-benchmark Dataset
python
fastbookRAG
information retrieval
In this blog post, I perform retrieval on the fastbook chapter documents using 24 different retrieval method-chunking strategy combinations, auto-scoring using my fastbook-benchmark dataset.
Author
Vishal Bakshi
Published
November 26, 2024
Background
In this notebook I re-run my full text search and semantic search retrieval methods for fastbook questions and use my newly curated fastbook-benchmark dataset to calculate Answer Component MRR@10 and Answer Component Recall@10 retrieval metrics. Due to how my benchmark dataset is structured, I had to modify (with Claude’s help) the classic MRR and Recall functions:
Answer Component MRR@10: Returns the rank of the n-th passage needed to satisfy all answer_components for the question. So, if a question has 4 answer_components and their relevant contexts were contained across the first 5 retrieved passages, MRR would be 1/5 = 0.2.
Answer Component Recall@10: Measures the proportion of answer components for which at least one supporting context was retrieved. Using the same example, if the top-10 passages only contain contexts relevant to 2 answer_components, Recall would be 2/4 = 0.5
See the section below to see how my benchmark dataset is structured.
There are four retrieval methods I implement in this notebook:
Full text search (using sqlite and Claude-generated keywords)
3-paragraph (w/headers, w/o HTML tags, w/o punctuation)
Here are the results from this notebook:
Answer Component MRR@10
Retrieval Method
A
B
C
D
E
F
Full text search
0.30
0.46
0.29
0.44
0.46
0.46
Single-vector cosine similiarity
0.38
0.50
0.35
0.46
0.50
0.49
ColBERTv2
0.46
0.49
0.41
0.50
0.49
0.44
answerai-colbert-small-v1
0.48
0.52
0.45
0.52
0.52
0.45
Answer Component Recall@10
Retrieval Method
A
B
C
D
E
F
Full text search
65%
83%
65%
82%
83%
83%
Single-vector cosine similiarity
71%
85%
72%
82%
87%
86%
ColBERTv2
80%
80%
74%
80%
81%
71%
answerai-colbert-small-v1
82%
84%
77%
82%
84%
73%
The best-performing retrieval method and chunking strategies:
Metric Name
Retrieval Method
Chunking Strategies
Metric Value
Answer Component MRR@10
answerai-colbert-small-v1
B, D, E
0.52
Answer Component Recall@10
Single-vector cosine similiarty
E
87%
The fastbook-benchmark Dataset
The fastbook-benchmark dataset contains a list of items (questions). Each item looks something like this:
{
"chapter": 1,
"question_number": 1,
"question_text": "Do you need these for deep learning?\n\n- Lots of math..."",
"answer_context": [
{
"answer_component": "\"Lots of math..."",
"scoring_type": "simple",
"context": [
"...Lots of math..."
],
"explicit_context": "true",
"extraneous_answer": "false"
}
],
"question_context": []
},
I have broken down each gold_standard_answer into separate answer_components, each of which has associated with it one or more contexts from the chapter text that address that answer_component. Here’s an example of a question with two answer_components:
{
"chapter": 1,
"question_number": 5,
"question_text": "What were the two theoretical misunderstandings that held back the field of neural networks?",
"gold_standard_answer": "\"In 1969..."",
"answer_context": [
{
"answer_component": "\"In 1969..."",
"scoring_type": "simple",
"context": [
"An MIT professor named..."
],
"explicit_context": "true",
"extraneous_answer": "false"
},
{
"answer_component": "\"\n\nIn the 1980's..."",
"scoring_type": "simple",
"context": [
"In the 1980's..."
],
"explicit_context": "true",
"extraneous_answer": "false"
}
],
"question_context": []
},
Any one of the contexts is sufficient to address the associated answer_component. For example, for Chapter 1 Question 12:
{
"answer_component": "We instead use the term parameters.",
"scoring_type": "simple",
"context": [
"By the way, what Samuel called \"weights\" are most generally referred to as model *parameters* these days",
"The *weights* are called *parameters*"
],
"explicit_context": "true",
"extraneous_answer": "false"
}
For some questions’ gold_standard_answer I found that some answer_components were extraneous to the goal of the question. These have been marked with the flag "extraneous_answer": "false".
For some answer_components I found the corresponding context implicitly addressing it. These have been marked with the flag "explicit_context": "false".
Finally, for some answer_components I did not find relevant context in the given chapter so the context field is assigned an empty list [].
All of the design decisions for this fastbook-benchmark dataset have largely been driven by one goal: don’t change the gold_standard_answer. I have been using the fastai Forums’ Wiki solutions page for each chapter as the gold standard answer set (example).
import sqlite3import jsonimport reimport osimport pandas as pd, numpy as npimport requestsimport torchimport torch.nn.functional as Ffrom ftfy import fix_textfrom sentence_transformers import SentenceTransformerfrom ragatouille import RAGPretrainedModelemb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
Show chunking code
def get_chunks(notebook_path):withopen(notebook_path, 'r', encoding='utf-8') asfile: notebook = json.load(file) chunks = [] current_header =""def add_chunk(content):if content.strip(): chunks.append(f"{current_header}\n\n{content.strip()}")for cell in notebook['cells']:if cell['cell_type'] =='markdown': content =''.join(cell['source'])# see if the cell starts with a markdown header header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)if header_match:# grab the header current_header = header_match.group(1)# add any content after the header in the same cell remaining_content = content[len(current_header):].strip()if remaining_content:# split content into paragraphs paragraphs = re.split(r'\n\s*\n', remaining_content)# append the paragraph to the list of chunksfor paragraph in paragraphs: add_chunk(paragraph)else:# split content into paragraphs paragraphs = re.split(r'\n\s*\n', content)# append the paragraph to the list of chunksfor paragraph in paragraphs: add_chunk(paragraph)elif cell['cell_type'] =='code': code_content ='```python\n'+''.join(cell['source']) +'\n```'# include the output of the code cell output_content =''if'outputs'in cell and cell['outputs']:for output in cell['outputs']:if'text'in output: output_content +=''.join(output['text'])elif'data'in output and'text/plain'in output['data']: output_content +=''.join(output['data']['text/plain'])# combine code and output in the same chunk combined_content = code_content +'\n\nOutput:\n'+ output_content if output_content else code_content add_chunk(combined_content)def filter_chunks(chunks, exclude_headers=["Questionnaire", "Further Research"]): filtered_chunks = []for chunk in chunks: lines = chunk.split('\n')# check if the first line (header) is in the exclude listifnotany(header in lines[0] for header in exclude_headers): filtered_chunks.append(chunk)return filtered_chunksreturn filter_chunks(chunks)
Show chunking code
def combine_chunks(chunks, num_p=3): combined_chunks = [] current_header =None current_group = []for chunk in chunks:# Extract header from chunk header = chunk.split('\n\n')[0]if header != current_header:iflen(current_group) >1: # Only add if group has content besides header# Add current group to combined chunks if header changes combined_chunks.append('\n\n'.join(current_group))# Update current header current_header = header# Start new group with header and content of current chunk current_group = [header, chunk.split('\n\n', 1)[1] iflen(chunk.split('\n\n')) >1else'']else:iflen(current_group) < num_p +1: # +1 to account for header# Add chunk content (without header) to current group current_group.append(chunk.split('\n\n', 1)[1] iflen(chunk.split('\n\n')) >1else'')iflen(current_group) == num_p +1: # +1 to account for header# Add full group to combined chunks combined_chunks.append('\n\n'.join(current_group))# Reset current group, keeping the header current_group = [current_header]iflen(current_group) >1: # Only add if group has content besides header# Add any remaining group to combined chunks combined_chunks.append('\n\n'.join(current_group))return combined_chunks
Show the load_data function used for full text search
def load_data(chunks, db_path, chapter=1):try:# create virtual table if database doesn't existifnot os.path.exists(db_path):with sqlite3.connect(db_path) as conn: cur = conn.cursor() cur.execute(""" CREATE VIRTUAL TABLE fastbook_text USING FTS5(chapter, text); """) conn.commit()# load in the chunks for each chapterwith sqlite3.connect(db_path) as conn: cur = conn.cursor()for chunk in chunks: cur.execute("INSERT INTO fastbook_text(chapter, text) VALUES (?, ?)", (chapter, chunk)) conn.commit() res = cur.execute("SELECT * FROM fastbook_text WHERE chapter = ?", (chapter,)).fetchall()# make sure all the data was loaded into the databaseiflen(res) !=len(chunks):raiseValueError(f"Number of inserted chunks ({len(res)}) doesn't match input chunks ({len(chunks)})")returnTrueexcept sqlite3.Error as e:print(f"An error occurred: {e}")returnFalseexceptExceptionas e:print(f"An unexpected error occurred: {e}")returnFalse
Show the db_search function used for full text search
def db_search(df, limit=1): results = []with sqlite3.connect('fastbook.db') as conn: cur = conn.cursor()# concatenate the keywords into a string "keyword1 OR keyword 2 OR keyword3 ..."for _, row in df.iterrows(): keywords =' OR '.join([f'"{keyword.strip(",")}"'for keyword in row['keywords'].replace('"', '').split()]) q =f""" SELECT text, rank FROM fastbook_text WHERE fastbook_text MATCH ? AND chapter = ? ORDER BY rank LIMIT ? """ res = cur.execute(q, (keywords, str(row['chapter']), limit)).fetchall()# grab the retrieved chunk from the query results res = [item[0] for item in res]# if there are multiple chunks retrieved, combine them into a single string results.append(res)return results
Download chapter ipynb files
urls = {'01_intro.ipynb': 'https://drive.google.com/uc?export=view&id=1mmBjFH_plndPBC4iRZHChfMazgBxKK4_','02_production.ipynb': 'https://drive.google.com/uc?export=view&id=1Cf5QHthHy1z13H0iu3qrzAWgquCfqVHk','04_mnist_basics.ipynb': 'https://drive.google.com/uc?export=view&id=113909_BNulzyLIKUNJHdya0Hhoqie30I','08_collab.ipynb': 'https://drive.google.com/uc?export=view&id=1BtvStgFjUtvtqbSZNrL7Y2N-ey3seNZU','09_tabular.ipynb': 'https://drive.google.com/uc?export=view&id=1rHFvwl_l-AJLg_auPjBpNrOgG9HDnfqg','10_nlp.ipynb': 'https://drive.google.com/uc?export=view&id=1pg1pH7jMMElzrXS0kBBz14aAuDsi2DEP','13_convolutions.ipynb': 'https://drive.google.com/uc?export=view&id=19P-eEHpAO3WrOvdxgXckyhHhfv_R-hnS'}def download_file(url, filename):# Send a GET request to the URL response = requests.get(url)# Check if the request was successfulif response.status_code ==200:# Open the file in write-binary modewithopen(filename, 'wb') asfile:# Write the content of the response to the filefile.write(response.content)print(f"File downloaded successfully: {filename}")else:print(f"Failed to download file. Status code: {response.status_code}")for fname, url in urls.items(): download_file(url, fname)
def calculate_mrr(question, retrieved_passages, cutoff=10): retrieved_passages = retrieved_passages[:cutoff] highest_rank =0for ans_comp in question["answer_context"]: contexts = ans_comp.get("context", []) component_found =Falsefor rank, passage inenumerate(retrieved_passages, start=1):ifany(fix_text(context) in fix_text(passage) for context in contexts): highest_rank =max(highest_rank, rank) component_found =Truebreakifnot component_found:return0.0return1.0/highest_rank if highest_rank >0else0.0
Show calculate_recall function
def calculate_recall(question, retrieved_passages, cutoff=10): retrieved_passages = retrieved_passages[:cutoff]# Track if we've found at least one context for each answer component ans_comp_found = []for ans_comp in question["answer_context"]: contexts = ans_comp.get("context", []) found =False# Check if any context for this answer component appears in retrieved passagesfor passage in retrieved_passages:ifany(fix_text(context) in fix_text(passage) for context in contexts): found =Truebreak ans_comp_found.append(found)# Recall is ratio of answer components with at least one found contextreturnsum(ans_comp_found) /len(ans_comp_found)
Show fts_retrieval function
def fts_retrieval(data, questions):if os.path.exists("fastbook.db"): os.remove("fastbook.db")for chapter, chunks in data.items():print(f"Chapter {chapter}:", load_data(chunks, 'fastbook.db', chapter))print("Retrieving passages...") results = db_search(questions, limit=10)assertlen(results) ==191for res in results:assertlen(res) <=10print("Retrieval complete.")return results
Show single_vector_retrieval function
def single_vector_retrieval(data, benchmark):# Group questions by chapter questions = {}for q in benchmark["questions"]: chapter =str(q["chapter"])if chapter notin questions: questions[chapter] = [] questions[chapter].append(q['question_text'].strip('"\'')) q_embs = {}print("Encoding Questions...")for chapter, _ in data.items(): qs = questions[chapter] q_embs[chapter] = emb_model.encode(qs, convert_to_tensor=True) data_embs = {}print("Encoding Data...")for chapter, chunks in data.items(): data_embs[chapter] = emb_model.encode(chunks, convert_to_tensor=True) results = []print("Retrieving passages...")for chapter in ['1', '2', '4', '8', '9', '10', '13']:# Compute cosine similarity and get top 10 indices for each row idxs = F.cosine_similarity(q_embs[chapter].unsqueeze(1), data_embs[chapter].unsqueeze(0), dim=2).sort(descending=True)[1] top_10_idxs = idxs[:, :10] # Get the top 10 indices for each row# Extract top 10 chunks for each row top_10_chunks = [ [data[chapter][idx.item()] for idx in row_idxs]for row_idxs in top_10_idxs ] results.extend(top_10_chunks)assertlen(results) ==191for res in results:assertlen(res) <=10print("Retrieval complete.")return results
Show ragetouille_retrieval function
def ragatouille_retrieval(data, benchmark, model_nm="colbert-ir/colbertv2.0"):# Group questions by chapter questions_by_chapter = {}for q in benchmark["questions"]: chapter =str(q["chapter"])if chapter notin questions_by_chapter: questions_by_chapter[chapter] = [] questions_by_chapter[chapter].append(q)# Dictionary to store results per chapter chapter_results = {} chapter_metrics = {}# Initialize ColBERTv2 RAG = RAGPretrainedModel.from_pretrained(model_nm)# Process each chapter separatelyfor chapter in nbs.keys():print(f"\nProcessing Chapter {chapter}")# Create chapter-specific index index_path = RAG.index( index_name=f"chapter_{chapter}_index", collection=data[chapter], document_ids=[f"{chapter}_{i}"for i inrange(len(data[chapter]))] )# Get questions for this chapter chapter_questions = questions_by_chapter[chapter]# Perform retrieval for each question in this chapter results = []for q in chapter_questions: retrieved = RAG.search(q["question_text"].strip('"\''), k=10) results.append(retrieved)# Store results chapter_results[chapter] = results results = []for chapter, res in chapter_results.items(): results.extend(res)assertlen(results) ==191 final_results = []for res in results:assertlen(res) <=10 intermediate_results = [r['content'] for r in res] final_results.append(intermediate_results)print("Retrieval complete.")return final_results
Next, I’ll remove markdown headers from each chunk.
# chunking each notebookdata = {}for chapter, nb in nbs.items(): data[chapter] = get_chunks(nb)for chapter, chunks in data.items(): data[chapter] = [re.sub(r'^#+\s+[^\n]+\n*', '', c) for c in data[chapter]]total_chunks =0for chapter, chunks in data.items():print(chapter, len(chunks)) total_chunks +=len(chunks)assert total_chunks ==1967# 1-paragraph chunks
def combine_chunks2(chunks, num_p=3):""" Combines text chunks into groups of specified size (num_p). If chunks have no headers, treats them as standalone content. """ combined_chunks = [] current_group = []for chunk in chunks:iflen(current_group) < num_p: current_group.append(chunk)iflen(current_group) == num_p: combined_chunks.append('\n\n'.join(current_group)) current_group = []# Add any remaining chunksif current_group: combined_chunks.append('\n\n'.join(current_group))return combined_chunks
for chapter, chunks in data.items(): data[chapter] = combine_chunks2(chunks, num_p=3)total_chunks =0for chapter, chunks in data.items():print(chapter, len(chunks)) total_chunks +=len(chunks)assert total_chunks ==659
Chunking Strategy E: (3-paragraph w/headers, w/o HTML tags)
I’ll add headers back, but will remove HTML tags.
# chunking each notebookdata = {}for chapter, nb in nbs.items(): data[chapter] = get_chunks(nb)total_chunks =0for chapter, chunks in data.items(): total_chunks +=len(chunks)assert total_chunks ==1967# 1-paragraph chunksfor chapter, chunks in data.items(): data[chapter] = combine_chunks(chunks, num_p=3)total_chunks =0for chapter, chunks in data.items(): total_chunks +=len(chunks)assert total_chunks ==713
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n\nA convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>.\n\n<img src="images/chapter9_conv_basic.png" id="basic_conv" caption="Applying a kernel to one location" alt="Applying a kernel to one location" width="700">'
for chapter, chunks in data.items(): data[chapter] = [clean_html(chunk) for chunk in chunks]total_chunks =0for chapter, chunks in data.items(): total_chunks +=len(chunks)assert total_chunks ==713
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n\nA convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>.\n\n'
Chunking Strategy F: (w/headers, w/o HTML tags, w/o punctuation)
Finally, I’ll keep headers, remove HTML tags and remove all punctuation.
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n\nA convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>.\n\n'
def remove_punctuation(text):import stringreturn''.join(char if char.isalnum() or char =='#'else' 'if char in string.punctuation else char for char in text)remove_punctuation(chunks[3])
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision and is surprisingly straightforward To do it we use something called a convolution A convolution requires nothing more than multiplication and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book \n\nA convolution applies a kernel across an image A kernel is a little matrix such as the 3×3 matrix in the top right of basic conv \n\n'
for chapter, chunks in data.items(): data[chapter] = [remove_punctuation(chunk) for chunk in chunks]total_chunks =0for chapter, chunks in data.items(): total_chunks +=len(chunks)assert total_chunks ==713
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision and is surprisingly straightforward To do it we use something called a convolution A convolution requires nothing more than multiplication and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book \n\nA convolution applies a kernel across an image A kernel is a little matrix such as the 3×3 matrix in the top right of basic conv \n\n'
Since I’m removing punctuation from the contexts, I need to do the same for the benchmark dataset. I think a better solution would be to modify the scoring functions by removing the punctuation there, but I’m saving some time and space by just copying the benchmark dataset and removing punctuation from each context string in it:
def process_contexts(data):# Process questionsfor question in data['questions']:# Process only answer_contextif'answer_context'in question:for context_item in question['answer_context']:if'context'in context_item:ifisinstance(context_item['context'], list):# If context is a list, process each string in the list context_item['context'] = [ remove_punctuation(text) if text else textfor text in context_item['context'] ]elifisinstance(context_item['context'], str):# If context is a single string, process it directly context_item['context'] = remove_punctuation(context_item['context'])return datamodified_benchmark = process_contexts(benchmark)
['An MIT professor named Marvin Minsky who was a grade behind Rosenblatt at the same high school along with Seymour Papert wrote a book called Perceptrons MIT Press about Rosenblatt s invention They showed that a single layer of these devices was unable to learn some simple but critical mathematical functions such as XOR In the same book they also showed that using multiple layers of the devices would allow these limitations to be addressed Unfortunately only the first of these insights was widely recognized As a result the global academic community nearly entirely gave up on neural networks for the next two decades ']
Here are the definitions of the metrics, retrieval methods and chunking strategies that I am using in this benchmark evaluation:
Metrics
Answer Component MRR@10: Returns the rank of the n-th passage needed to satisfy all answer_components for the question. So, if a question has 4 answer_components and their relevant contexts were contained across the first 5 retrieved passages, MRR would be 1/5 = 0.2.
Answer Component Recall@10: Measures the proportion of answer components for which at least one supporting context was retrieved. Using the same example, if the top-10 passages only contain contexts relevant to 2 answer_components, Recall would be 2/4 = 0.5
Retrieval Methods
Full text search (using sqlite and Claude-generated keywords)
3-paragraph (w/headers, w/o HTML tags, w/o punctuation)
Here are the results from this notebook:
Answer Component MRR@10
Retrieval Method
A
B
C
D
E
F
Full text search
0.30
0.46
0.29
0.44
0.46
0.46
Single-vector cosine similiarity
0.38
0.50
0.35
0.46
0.50
0.49
ColBERTv2
0.46
0.49
0.41
0.50
0.49
0.44
answerai-colbert-small-v1
0.48
0.52
0.45
0.52
0.52
0.45
Answer Component Recall@10
Retrieval Method
A
B
C
D
E
F
Full text search
65%
83%
65%
82%
83%
83%
Single-vector cosine similiarity
71%
85%
72%
82%
87%
86%
ColBERTv2
80%
80%
74%
80%
81%
71%
answerai-colbert-small-v1
82%
84%
77%
82%
84%
73%
The best-performing retrieval method and chunking strategies:
Metric Name
Retrieval Method
Chunking Strategies
Metric Value
Answer Component MRR@10
answerai-colbert-small-v1
B, D, E
0.52
Answer Component Recall@10
Single-vector cosine similiarty
E
87%
I was quite surprised that single-vector cosine similarity yielded the best Recall. I was less surprised that answerai-colbert-small-v1 had the best MRR@10 since it was better than the other retrieval methods for 5 out of 6 chunking strategies. Other noteworthy observations:
ColBERTv2 and answerai-colbert-small-v1 both experienced a considerable performance drop when punctuation was removed from the documents.
Full text search was very competitive after the chunk size was increased to 3-paragraphs (B, D, E, F). It yielded the second-highest MRR@10 for Chunking Strategy F (3-paragraph, w/headers, w/o HTMl tags, w/o punctuation).
Removing HTML tags (Chunking Strategy E) improved the performance of all four retrieval methods than when they were included (Chunking Strategy D). The biggest beneficiary of removing them was single-vector cosine similarity (82% –> 87%).
A couple of notes about my process:
Having a benchmark dataset saved me about 15-20 hours of manual evaluation.
Refactoring the code (into a do_retrieval function) made it easier for me to iterate quickly different chunking strategies.
Before I move on to experimenting with hybrid approaches (full text search + semantic search) I want to research and apply chunking strategies that are particularly suited to ColBERTv2 and answerai-colbert-small-v1 to see if I can improve on the overall-best Recall@10 of 87% and MRR@10 of 0.52.