Using Hybrid Search to Answer fastai the Chapter 1 Questionnaire
python
RAG
information retrieval
fastbookRAG
In this blog post I use different approaches to combine FTS5 (keyword search) and Cosine Similarity (semantic search) to retrieve context necessary to answer questions about Chapter 1 of the fastai textbook.
Author
Vishal Bakshi
Published
August 11, 2024
Background
In this blog post I’ll work through creating a hybrid search (keyword search + semantic search) baseline for a project I’m working on called fastbookRAG. In this project, I’m building a hybrid search + LLM pipeline to answer questions from the end-of-chapter Questionnaires in the freely available fastai textbook. For now, I’m taking the place of the LLM in the pipeline (using the context retrieved using keyword/semantic search to answer questions). In this notebook, I’ll focus on retrieving the relevant context needed to answer questions from Chapter 1. In future notebooks, I’ll be expanding this to all 8 lessons in Part 1 of the fastai course.
In two previous blog posts I was able to answer 72% of the Chapter 1 Questionnairse questions using full text search and 76% of questions using cosine similarity. I’ll try to beat that result using a combined approach.
I’ll summarize my results from this notebook in the table below. Cosine Similarity is represented as “CS” and full text search as “BM25”:
Approach
Chunk Size (Paragraphs)
Questions Answered
Top-5 CS + Top-3 BM25
1 (CS), 3 (BM25)
85%
Top-3 CS + Top-2 BM25
1 (CS), 3 (BM25)
76%
Top-5 Weighted Average
1
76%
Top-3 Weighted Average
1
64%
Top-3 CS of Top-10 BM25 Results
1
58%
Top-3 CS of Top-10 BM25 Results
3
58%
Top-1 CS of Top-10 BM25 Results
1
52%
Top-1 CS of Top-12 BM25 Results
3
48%
Top-1 Weighted Average
1
48%
Show imports
import sqlite3import jsonimport reimport pandas as pd, numpy as npimport textwrapimport torchfrom torch import tensorimport torch.nn.functional as F!pip install sentence-transformers -Uqqfrom sentence_transformers import SentenceTransformeremb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
Chunking the Chapter 1 Notebook into Paragraphs
As usual, I’ll start by chunking the chapter 1 notebook into paragraphs and store them into a sqlite database. I’ll wrap the databse loading code into a function so that I can easily reuse it for different chunking strategies.
Show the chunking code
def get_chunks(notebook_path):withopen(notebook_path, 'r', encoding='utf-8') asfile: notebook = json.load(file) chunks = [] current_header =""def add_chunk(content):if content.strip(): chunks.append(f"{current_header}\n\n{content.strip()}")for cell in notebook['cells']:if cell['cell_type'] =='markdown': content =''.join(cell['source']) header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)if header_match: # Check if the cell starts with a header current_header = header_match.group(1)# Add any content after the header in the same cell remaining_content = content[len(current_header):].strip()if remaining_content: paragraphs = re.split(r'\n\s*\n', remaining_content)for paragraph in paragraphs: add_chunk(paragraph)else: paragraphs = re.split(r'\n\s*\n', content)for paragraph in paragraphs: add_chunk(paragraph)elif cell['cell_type'] =='code': code_content ='```python\n'+''.join(cell['source']) +'\n```' add_chunk(code_content)return chunksdef filter_chunks(chunks, exclude_headers): filtered_chunks = []for chunk in chunks: lines = chunk.split('\n')# Check if the first line (header) is in the exclude listifnotany(header in lines[0] for header in exclude_headers): filtered_chunks.append(chunk)return filtered_chunksexclude_headers = ["Questionnaire", "Further Research"]
conn = sqlite3.connect('/content/fastbook.db')def load_data(filtered_chunks): conn = sqlite3.connect('/content/fastbook.db') cur = conn.cursor() res = cur.execute(""" CREATE VIRTUAL TABLE fastbook_text USING FTS5(text); """)for string in filtered_chunks: cur.execute(f"INSERT INTO fastbook_text(text) VALUES (?)", (string,)) conn.commit() res = cur.execute("SELECT * from fastbook_text").fetchall() conn.close()returnlen(res) ==len(filtered_chunks)
load_data(filtered_chunks)
True
Retrieving Top-1 Cosine Similarity of Top-10 BM25-Ranked Keyword Search Results
To give me some flexibility in which chunks are retrieved, I’ll use cosine similarity on the top-10 BM25-ranked full text search results. In previous experiments my best performing cosine similarity approach used top-5 small chunks and my best performing full text search approach used top-3 large chunks.
I’ll first create embeddings for the chunked data and the questions:
I’ll use ORDER BY rank and LIMIT 10 in my SQL query to get the top-10 BM25-ranked results:
Show the for-loop + query
results = []conn = sqlite3.connect('fastbook.db')cur = conn.cursor()for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 10 """ res = cur.execute(q).fetchall() res = [item[0] for item in res] results.append(res)else:# if keywords == "No Answer" res ="No answer" results.append(res)
len(results), len(results[0])
(33, 10)
I’ll embed these chunks of context (up to 10 for each of the 33 questions) so that I can perform cosine similarity between them and the question embeddings. Note that not all questions have 10 chunks as some keyword searches resulted in less than 10 retrieved chunks.
I’ll apply cosine similarity between the chunk embeddings for the given question’s keyword search result and the question embedding and select the top result:
cs_results = []for i, q inenumerate(q_embs):if results[i] !="No answer": res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True) cs_results.append(results[i][res[1][0]])else: cs_results.append('No answer')
Using the retrieved chunks I was able to answer 17/33 or 52% of the Chapter 1 questions.
I’m curious: what was the BM25-rank of the context with the top-1 cosine similarity? In other words, did cosine similarity pick the highest ranked keyword search result?
cs_ranks = []for i, q inenumerate(q_embs):if results[i] !="No answer": res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True) cs_ranks.append(res[1][0].item())else: cs_ranks.append(None)
For 16 of the 33 questions, the context with the highest cosine similarity with the question was also the highest BM25-ranked keyword search result. 24/33 highest cosine similarity results were top-3 BM25-ranked chunkcs. For 6 out of 33 questions, the highest cosine similarity was for a context that was not top-3 BM25 ranked.
cs_ranks.value_counts()
count
0.0
16
1.0
5
2.0
3
8.0
2
4.0
2
7.0
1
3.0
1
Retrieving Top-3 Cosine Similarity of Top-10 BM25-Ranked Keyword Search Results
I’ll now retrieve 3 chunks (for each question) that have the top-3 highest cosine similarity values. I expect this to improve my ability to answer questions.
cs_results = []for i, q inenumerate(q_embs):if results[i] !="No answer": res = F.cosine_similarity(q, results_embs[i], dim=-1).sort(descending=True) res ='\n'.join([results[i][idx] for idx in res[1][:3]]) cs_results.append(res)else: cs_results.append('No answer')
This approach slightly improves the performance. Now I can answer 19/33 or 58% of the questions with the given retrieved context. This still underperforms each individual approach.
Concatenating Keyword Search Result Before Applying Cosine Similarity
Next, I’ll try a different approach: I’ll concatenate, three at a time, keyword search results and then perform cosine similarity on those larger chunks. I’ll pick the larger chunk with the highest cosine similarity.
Since I’m concatenating 3 keyword search results at a time, I’ll increase my LIMIT to 12 (a multiple of 3).
Show the for-loop + query
results = []conn = sqlite3.connect('fastbook.db')cur = conn.cursor()for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 12 """ res = cur.execute(q).fetchall() concatenated_chunks = []for i inrange(0, len(res), 3):# Select three tuples at a time chunk = res[i:i+3]# Extract strings and concatenate them concatenated_chunk ='\n'.join([t[0] for t in chunk]) concatenated_chunks.append(concatenated_chunk) results.append(concatenated_chunks)else:# if keywords == "No Answer" res ="No answer" results.append(res)
This approach led to a worse performance: I was able to answer only 16 out of the 33 questions, or 48%.
Storing Larger Chunks in the Database
So far, I haven’t been able to beat the individual performance of full text search or cosine similarity. I’ll try something that improved full text search: storing larger chunks of data. To do this, I’ll concatenate three paragraphs at a time before loading them into the database:
larger_chunks = ["\n".join(filtered_chunks[i:i+3]) for i inrange(0, len(filtered_chunks), 3)]len(larger_chunks)
103
Note that I’m just deleting my sqlite database file and recreating from scratch when I run load_data:
load_data(larger_chunks)
True
Show the for-loop + query
results = []conn = sqlite3.connect('fastbook.db')cur = conn.cursor()for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 10 """ res = cur.execute(q).fetchall() res = [item[0] for item in res] results.append(res)else:# if keywords == "No Answer" res ="No answer" results.append(res)
With this approach (top-1 cosine similarity for the top-10 larger chunks retrieved by keyword search) I was able to answer 19 out of 33 questions, or 58%.
Weighted Average Between Cosine Similarity and BM25 Score
So far I’ve been applying cosine similarity to the top-n retrieved chunks based on BM25 score. I’ll now try a different approach: pick the top-n chunks based on a weighted average between cosine similarity and BM25 score.
To do this, I’ll revert back to the smaller chunked database. I’ll get the top-10 chunks based on BM25, then I’ll get the top-10 chunks based on cosine similarity. Finally, I’ll normalize each score within each group and then take a weighted average between the two. I’ll then pick top-1, top-3 and top-5 chunks and see how many questions I can answer with the retrieved context.
load_data(filtered_chunks)
True
Show the for-loop + query
results = []conn = sqlite3.connect('fastbook.db')cur = conn.cursor()for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 10 """ res = cur.execute(q).fetchall() results.append(res)else:# if keywords == "No Answer" res ="No answer" results.append(res)
cs_results = []for q in q_embs: res = F.cosine_similarity(q, data_embs, dim=1).sort(descending=True) top10_chunks = [filtered_chunks[el.item()] for el in res[1][:10]] top10_cs = [el.item() for el in res[0][:10]] res =list(zip(top10_chunks, top10_cs)) cs_results.append(res)
Show code to calculate prep data for weighted average calcs
# Function to process each dataframedef process_df(df):# Ensure 'Question' is treated as a string to avoid any numeric mismatch df['Question'] = df['Question'].astype(str)# Create a unique identifier for each question/chunk combination df['question_chunk'] = df['Question'] +'_'+ df['chunk']return df# Process both dataframesbm25 = process_df(bm25)cs = process_df(cs)# Create a full set of all question/chunk combinationsall_combinations =set(bm25['question_chunk']).union(set(cs['question_chunk']))# Function to get scores, using 0 for missing combinationsdef get_scores(df, all_combinations): scores = df.set_index('question_chunk')['normalized_score']return pd.Series(index=all_combinations).fillna(0).add(scores, fill_value=0)# Get scores for both dataframesbm25_scores = get_scores(bm25, all_combinations)cs_scores = get_scores(cs, all_combinations)
weight_bm25 =0.3weight_cs =0.7
Show code to calculate weighted average
weighted_avg = (weight_bm25 * bm25_scores + weight_cs * cs_scores) / (weight_bm25 + weight_cs)# Create the final dataframeresult = pd.DataFrame({'question_chunk': weighted_avg.index,'weighted_score': weighted_avg.values})# Split 'question_chunk' back into 'Question' and 'chunk'result[['Question', 'chunk']] = result['question_chunk'].str.split('_', n=1, expand=True)# Reorder columnsresult = result[['Question', 'chunk', 'weighted_score']]
Show code to get top-n results
def get_top_n_chunks(df, n=5):# Function to get top N chunks and concatenate themdef top_n_concat(group): top_n = group.nlargest(n, 'weighted_score')return pd.Series({'chunk': ' '.join(top_n['chunk']),'weighted_score': top_n['weighted_score'].mean() })# Apply the function to each question group result = df.groupby('Question').apply(top_n_concat).reset_index()# Sort the results by Question result = result.sort_values('Question')return result
Using the top-3 weighted average between BM25 and Cosine Similarity yielded chunks that allowed me to answer 21 out of 33 questions, or 64% which is the best performing hybrid approach so far.
Using the top-5 weighted average between BM25 and Cosine Similarity allowed me to answer 25 out of 33 questions, or 76%, which is the best performing hybrid approach so far and matches the best Cosine Similarity-only approach.
Retrieving top-3 Cosine Similarity and top-2 Keyword Search Results
The next approach I’ll try: returning the top-3 chunks retrieved using Cosine Similarity and the top-2 chunks retrieved using Keyword Search.
Show code to get top-n results
def get_top_n_chunks(df, n=5):# Function to get top N chunks and concatenate themdef top_n_concat(group): top_n = group.nlargest(n, 'normalized_score')return pd.Series({'chunk': ' '.join(top_n['chunk']),'normalized_score': top_n['normalized_score'].mean() })# Apply the function to each question group result = df.groupby('Question').apply(top_n_concat).reset_index()# Sort the results by Question result = result.sort_values('Question')return result
top5_combined = pd.concat([top2_bm25, top3_cs]).groupby('Question').agg({'chunk': ' '.join, # Concatenate all chunks'normalized_score': 'mean'# Take the mean of the normalized scores }).reset_index()
Using the top-3 Cosine Similarity + top-2 BM25 chunks allowed me to answer 25 out of 33 questions, or 76%. These were the same 25 questions I could answer using the top-5 chunks (by weighted average).
Retrieving top-5 Cosine Similarity and top-3 Keyword Search Results
I’m hesitant to use more than 5 chunks of context (for the eventual LLM in this pipeline) because while there’s a higher chance relevant data is included, it also includes a lot of irrelevant data in the context, and I would worry that the LLM (especially a relatively small one like phi-3) may get distracted by this irrelevant context. That being said, how the LLM behaves to different contexts is something I’ll experiment with in the future to determine whether or not something is “too long” of a context for the model.
The last hybrid approach I’ll pursue is combining the two best-performing individual approaches:
BM25: Use the top-3 large (3-paragraph) chunks
Cosine Similarity: Use the top-5 small (1-paragraph) chunks
I’ll rewrite the database table to contain the 3-paragraph-long chunks to use for keyword search:
load_data(larger_chunks)
True
Show the for-loop + query
results = []conn = sqlite3.connect('fastbook.db')cur = conn.cursor()for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 3 """ res = cur.execute(q).fetchall() results.append(res)else:# if keywords == "No Answer" res ="No answer" results.append(res)
top8_combined = pd.concat([top3_bm25, top5_cs]).groupby('Question').agg({'chunk': ' '.join, # Concatenate all chunks'normalized_score': 'mean'# Take the mean of the normalized scores }).reset_index()top8_combined.shape
This hybrid approach resulted in the best performance thus far (individual or hybrid)! With the top-8 chunks retrieved for each question (top-5 small chunks for Cosine Similarity, top-3 large chunks for keyword search) I was able to answer 28 out of 33 questions, or 85%. Factoring in that 3 of the questions are exercises to be done by the reader (and aren’t answerable using the Chapter 1 content), the true percentage is 93%.
Final Thoughts
I’ll summarize my results from this notebook in the table below. Cosine Similarity is represented as “CS” and full text search as “BM25”:
Approach
Chunk Size (Paragraphs)
Questions Answered
Top-5 CS + Top-3 BM25
1 (CS), 3 (BM25)
85%
Top-3 CS + Top-2 BM25
1 (CS), 3 (BM25)
76%
Top-5 Weighted Average
1
76%
Top-3 Weighted Average
1
64%
Top-3 CS of Top-10 BM25 Results
1
58%
Top-3 CS of Top-10 BM25 Results
3
58%
Top-1 CS of Top-10 BM25 Results
1
52%
Top-1 CS of Top-12 BM25 Results
3
48%
Top-1 Weighted Average
1
48%
For the Chapter 1 Questionnaire, the most effective strategy is to combine the top-5 1-paragraph chunks (from Cosine Similarity) with the top-3 3-paragraph chunks (from BM25). Each method alone answered about 72% (BM25) and 76% (Cosine Similarity) of the questions, so combining them increases coverage, as each approach catches questions the other might miss.
With this baseline established for Chapter 1, I’ll now move on to the rest of the chapters covered in Part 1 of the fastai course.
I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.