Using Full Text Search to Answer the fastbook Chapter 1 Questionnaire
python
RAG
information retrieval
fastbookRAG
In this blog post I’ll walk through my experiments of using sqlite full text search to retrieve context relevant to answering chapter review questions. This is part of a larger fastbookRAG proejct I’m work on.
This is part of a larger “fastbookRAG” project that I’m working on, where I’ll use BM25, cosine similarity and an LLM (probably phi-3) to answer questions from each of the book chapters’ “Questionnaire” section (from Part 1 of the fastai course).
Show imports
import sqlite3import jsonimport reimport pandas as pd, numpy as npimport textwrapwrapper = textwrap.TextWrapper( width=50, replace_whitespace=False, drop_whitespace=False)def print_wrap_text(text):print("\n".join(wrapper.fill(line) for line in text.splitlines()))
Here is a summary of the results of this notebook’s experiments:
Top BM25 Ranked Chunks Retrieved
Chunk Size
Retrieved Context Relevancy*
top-3
Large
72%
top-1
Large
54%
top-3
Small
54%
top-1
Small
40%
*Retrieved Context Relevancy:The percentage of questions for which the retrieved context was relevant and sufficient for me to answer the question.
Chunking the Chapter 1 Jupyter Notebook by Paragraph
The first task at hand is to load the Chapter 1 Jupyter Notebook into a sqlite database (so that I can perform full text search on it). There are a few different ways to do this:
Load in the entire chapter as a single string of text
Chunk the chapter text based on headers
Chunk the chapter text based on paragraphs (text between line breaks)
To start, I’ll combine the second and third options and chunk the text based on paragraphs plus include the header at the start of the string. So for example the following text:
This is a Header
This is pargraph one. It has two sentences.
This is paragraph two. It has two sentences as well.
will get chunked into the following strings:
# string 1"""### This is a HeaderThis is pargraph one. It has two sentences."""# string 2"""### This is a HeaderThis is paragraph two. It has two sentences as well."""
In this way I am capturing the granular information (the paragraph text) along with the big picture theme (the header). I suppose this is one way of capturing metadata about the text.
After a few iterations of feedback, I got the following code from Claude Sonnet-3.5 to chunk a .ipynb file into a list of strings.
Show the chunking code
def get_chunks(notebook_path):withopen(notebook_path, 'r', encoding='utf-8') asfile: notebook = json.load(file) chunks = [] current_header =""def add_chunk(content):if content.strip(): chunks.append(f"{current_header}\n\n{content.strip()}")for cell in notebook['cells']:if cell['cell_type'] =='markdown': content =''.join(cell['source']) header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)if header_match: # Check if the cell starts with a header current_header = header_match.group(1)# Add any content after the header in the same cell remaining_content = content[len(current_header):].strip()if remaining_content: paragraphs = re.split(r'\n\s*\n', remaining_content)for paragraph in paragraphs: add_chunk(paragraph)else: paragraphs = re.split(r'\n\s*\n', content)for paragraph in paragraphs: add_chunk(paragraph)elif cell['cell_type'] =='code': code_content ='```python\n'+''.join(cell['source']) +'\n```' add_chunk(code_content)return chunks
Using this chunking strategy, Chapter 1 has 315 chunks of text. For reference, this chapter has 24 headers (of different levels) across 52 pages. That’s about 13 chunks per header and 6 chunks per page.
## Who We Are
Sylvain, on the other hand, knows a lot about
formal technical education. In fact, he has
written 10 math textbooks, covering the entire
advanced French maths curriculum!
Load the Chunks into SQLite Database
I’ll load the list of chunks into a sqlite database virtual table with a single column text that has sqlite’s full text search (FTS5) enabled.
I’ll filter out the two sections that I don’t want to show up in the results: the “Questionnaire” section (as the keyword search will match the questions closely) and the “Further Research” sections (as it comes after the “Questionnaire” section and is not part of the main body of the chapter).
def filter_chunks(chunks, exclude_headers): filtered_chunks = []for chunk in chunks: lines = chunk.split('\n')# Check if the first line (header) is in the exclude listifnotany(header in lines[0] for header in exclude_headers): filtered_chunks.append(chunk)return filtered_chunksexclude_headers = ["Questionnaire", "Further Research"]filtered_chunks = filter_chunks(chunks, exclude_headers)
[chunk for chunk in filtered_chunks if'Questionnaire'in chunk]
[]
[chunk for chunk in filtered_chunks if'Further Research'in chunk]
[]
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()res = cur.execute("""CREATE VIRTUAL TABLE fastbook_textUSING FTS5(text);""")
for string in filtered_chunks: cur.execute(f"INSERT INTO fastbook_text(text) VALUES (?)", (string,))
res = cur.execute("SELECT * from fastbook_text LIMIT 40")print_wrap_text(res.fetchall()[30][0])
## Who We Are
Sylvain, on the other hand, knows a lot about
formal technical education. In fact, he has
written 10 math textbooks, covering the entire
advanced French maths curriculum!
Retrieving Context for an LLM with Keyword Search
In the fastai textbook, each chapter ends with a “Questionnaire” section. It’s like a review quiz for the chapter content. While the answers to these questions are not always fixed, I’ll use official solutions provided on the fastai forums as the “gold standard” for my evals. I have saved a CSV with the questionnaire question text, gold standard answer and a list of keywords I wrote for each question in this gist. Here’s a sample:
Here’s the keyword search for the first question using the keywords I came up with (“math, data, expensive computers, PhD”):
res = cur.execute("""SELECT *, rank from fastbook_textWHERE fastbook_text MATCH '"math" OR "data" OR "expensive computers" OR "PhD"'ORDER BY rankLIMIT 5""")
## Deep Learning Is for Everyone
asciidoc
[[myths]]
.What you don't need to do deep learning
[options="header"]
|======
| Myth (don't need) | Truth
| Lots of math | Just high school math is
sufficient
| Lots of data | We've seen record-breaking
results with <50 items of data
| Lots of expensive computers | You can get what
you need for state of the art work for free
|======
I would imagine that given the question and this corresponding chunk of context from the textbook, an LLM could answer the question correctly.
And that’s the metric that I’ll use to evaluate keyword search for the fastai textbook: can a reasonably capable LLM be able to answer the question given this context?
I’ll iterate through the column of keywords (one set for each question) and store the top BM25-ranked result as the “retrieved context” from my database:
Show the for-loop code
results = []for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 5 """ res = cur.execute(q) results.append(res.fetchall()[0][0])else:# if keywords == "No Answer" res ="No answer" results.append(res)
Using the top-ranked BM25 result for the given keyword search, I was able to retrieve the context necessary to answer the question 13 out of 33 times, or 40% of the time.
Looking through these 33 question/context pairs, I saw some patterns noted by the following examples.
In some cases, like for question #5, the retrieved context answered half of the problem.
df.iloc[4]
4
chapter
1
question_number
5
question_text
What were the two theoretical misunderstandings that held back the field of neural networks?
answer
In 1969, Marvin Minsky and Seymour Papert demonstrated in their book, “Perceptrons”, that a single layer of artificial neurons cannot learn simple, critical mathematical functions like XOR logic gate. While they subsequently demonstrated in the same book that additional layers can solve this problem, only the first insight was recognized, leading to the start of the first AI winter.\\n\\nIn the 1980’s, models with two layers were being explored. Theoretically, it is possible to approximate any mathematical function using two layers of artificial neurons. However, in practices, these networks were too big and too slow. While it was demonstrated that adding additional layers improved performance, this insight was not acknowledged, and the second AI winter began. In this past decade, with increased data availability, and improvements in computer hardware (both in CPU performance but more importantly in GPU performance), neural networks are finally living up to its potential.
## Neural Networks: A Brief History\n\nIn the 1980's most models were built with a second layer of neurons, thus avoiding the problem that had been identified by Minsky and Papert (this was their ""pattern of connectivity among units,"" to use the framework above). And indeed, neural networks were widely used during the '80s and '90s for real, practical projects. However, again a misunderstanding of the theoretical issues held back the field. In theory, adding just one extra layer of neurons was enough to allow any mathematical function to be approximated with these neural networks, but in practice such networks were often too big and too slow to be useful.
The retrieved context for question #16 was not wrong per se, but it didn’t answer the question in the same way as the gold standard answer. The “correct” answer to “What do you need in order to train a model?” is from the perspective of the model builder, whereas the retrieved context answers the same question but from the perspective of a business or organization.
df.iloc[15]
15
chapter
1
question_number
16
question_text
What do you need in order to train a model?
answer
You will need an architecture for the given problem. You will need data to input to your model. For most use-cases of deep learning, you will need labels for your data to compare your model predictions to. You will need a loss function that will quantitatively measure the performance of your model. And you need a way to update the parameters of the model in order to improve its performance (this is known as an optimizer).
keywords
train, model, need
fts5_result
### Limitations Inherent To Machine Learning\n\n- A model cannot be created without data.\n- A model can only learn to operate on the patterns seen in the input data used to train it.\n- This learning approach only creates *predictions*, not recommended *actions*.\n- It's not enough to just have examples of input data; we need *labels* for that data too (e.g., pictures of dogs and cats aren't enough to train a model; we need a label for each one, saying which ones are dogs, and which are cats).
In some cases, I was on the fence about the relevancy or accuracy of the retrieved context. I erred on the side of caution and considered the following question/answer/context triple as insufficient context:
df.iloc[24]
24
chapter
1
question_number
25
question_text
How can pretrained models help?
answer
Pretrained models have been trained on other problems that may be quite similar to the current task. For example, pretrained image recognition models were often trained on the ImageNet dataset, which has 1000 classes focused on a lot of different types of visual objects. Pretrained models are useful because they have already learned how to handle a lot of simple features like edge and color detection. However, since the model was trained for a different task than already used, this model cannot be used as is.
keywords
pretrained, model, help
fts5_result
### What Our Image Recognizer Learned\n\nWhen we fine-tuned our pretrained model earlier, we adapted what those last layers focus on (flowers, humans, animals) to specialize on the cats versus dogs problem. More generally, we could specialize such a pretrained model on many different tasks. Let's have a look at some examples.
For a number of questions, the keyword search resulted in an HTML img tag or code snippet since it contained the necessary keywords:
df.iloc[2]
2
chapter
1
question_number
3
question_text
What was the name of the first device that was based on the principle of the artificial neuron?
answer
Mark I perceptron built by Frank Rosenblatt
keywords
first, device, artificial, neuron
fts5_result
## Neural Networks: A Brief History\n\n<img alt=""Natural and artificial neurons"" width=""500"" caption=""Natural and artificial neurons"" src=""images/chapter7_neuron.png"" id=""neuron""/>
df.iloc[17]
17
chapter
1
question_number
18
question_text
Do we always have to use 224×224-pixel images with the cat recognition model?
answer
No we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption.
In some cases, the retrieved context seemed to be the chunk right before the chunk that would describe the solution in the text. This makes me wonder if my chunk sizes are too small?
df.iloc[9]
9
chapter
1
question_number
10
question_text
Why is it hard to use a traditional computer program to recognize images in a photo?
answer
For us humans, it is easy to identify images in a photos, such as identifying cats vs dogs in a photo. This is because, subconsciously our brains have learned which features define a cat or a dog for example. But it is hard to define set rules for a traditional computer program to recognize a cat or a dog. Can you think of a universal rule to determine if a photo contains a cat or dog? How would you encode that as a computer program? This is very difficult because cats, dogs, or other objects, have a wide variety of shapes, textures, colors, and other features, and it is close to impossible to manually encode this in a traditional computer program.
keywords
image, recognize, recognition, traditional, computer, program
fts5_result
### What Is Machine Learning?\n\n```python\n#hide_input\n#caption A traditional program\n#id basic_program\n#alt Pipeline inputs, program, results\ngv('''program[shape=box3d width=1 height=0.7]\ninputs->program->results''')\n```
df.iloc[10]
10
chapter
1
question_number
11
question_text
What did Samuel mean by """"""""weight assignment""""""""?
answer
“weight assignment” refers to the current values of the model parameters. Arthur Samuel further mentions an “ automatic means of testing the effectiveness of any current weight assignment ” and a “ mechanism for altering the weight assignment so as to maximize the performance ”. This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.
keywords
Samuel, weight, assignment
fts5_result
### What Is Machine Learning?\n\nLet us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.
Including More Chunks During Retrieval
Based on the initial keyword search results, some of the retrieved chunks of context either partially answer the question, or only begin to setup the answer to the question. I see two routes of remediating this:
Increase the chunk size (i.e. choose a different chunking strategy)
Increase the number of chunks selected during keyword search
The second approachs seems easier to implement. I like the idea of retrieving more than 1 small chunks than a single large chunk. I can imagine a couple of trade-offs:
A few small chunks may not capture information that is spread across a long paragraph/section needed for the LLM to answer the question sufficiently.
Single large chunk may include information irrelevant to the question and thus introduce noise into the answer, confusing the LLM.
I’ll retrieve the top 3 BM25-ranked results and then evaluate them.
Show the for-loop code
results = []for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text WHERE fastbook_text MATCH '{words}' ORDER BY rank LIMIT 3 """ res = cur.execute(q) results.append(res.fetchall())else:# if keywords == "No Answer" res = ("No answer") results.append(res)
df['fts5_result2'] = results
df.head(3)
chapter
question_number
question_text
answer
keywords
fts5_result2
0
1
1
Do you need these for deep learning?\\n\\n- Lo...
Lots of math - False\\nLots of data - False\\n...
math, data, expensive computers, PhD
[(## Deep Learning Is for Everyone\n\n```ascii...
1
1
2
Name five areas where deep learning is now the...
Any five of the following:\\nNatural Language ...
deep learning, state of the art, best, world
[(## Deep Learning Is for Everyone\n\nHere's a...
2
1
3
What was the name of the first device that was...
Mark I perceptron built by Frank Rosenblatt
first, device, artificial, neuron
[(## Neural Networks: A Brief History\n\n<img ...
top_3_results = df['fts5_result2'].apply(pd.Series)top_3_results.columns = [f'fts5_result2_{i+1}'for i inrange(top_3_results.shape[1])]
for col in ['fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3']: top_3_results[col] = top_3_results[col].apply(lambda x: x[0] ifisinstance(x, tuple) else x)
Using the top 3 BM25 ranked chunks improved the results! 18 out of the 33 questions, or 54%, are now answerable with the given retrieved context.
Here’s an example of a question that I couldn’t answer with the top-1 result but I can answer with the second-ranked result (fts5_result2_2):
df.iloc[19]
19
chapter
1
question_number
20
question_text
What is a validation set? What is a test set? Why do we need them?
answer
The validation set is the portion of the dataset that is not used for training the model, but for evaluating the model during training, in order to prevent overfitting. This ensures that the model performance is not due to “cheating” or memorization of the dataset, but rather because it learns the appropriate features to use for prediction. However, it is possible that we overfit the validation data as well. This is because the human modeler is also part of the training process, adjusting hyperparameters (see question 32 for definition) and training procedures according to the validation performance. Therefore, another unseen portion of the dataset, the test set, is used for final evaluation of the model. This splitting of the dataset is necessary to ensure that the model generalizes to unseen data.
keywords
validation, set, test
fts5_result2
[(## Validation Sets and Test Sets\n\nTo avoid this, our first step was to split our dataset into two sets: the *training set* (which our model sees in training) and the *validation set*, also known as the *development set* (which is used only for evaluation). This lets us test that the model learns lessons from the training data that generalize to new data, the validation data., -9.62686734459166), (## Validation Sets and Test Sets\n\nHaving two levels of "reserved data"—a validation set and a test set, with one level representing data that you are virtually hiding from yourself—may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set—if you have very little data, you may need to just have a validation set—but generally it's best to use one if at all possible., -9.21394932308204), (### Use Judgment in Defining Test Sets\n\nInstead, use the earlier data as your training set (and the later data for the validation set), as shown in <<timeseries3>>., -9.06398171911968)]
fts5_result2_1
## Validation Sets and Test Sets\n\nTo avoid this, our first step was to split our dataset into two sets: the *training set* (which our model sees in training) and the *validation set*, also known as the *development set* (which is used only for evaluation). This lets us test that the model learns lessons from the training data that generalize to new data, the validation data.
fts5_result2_2
## Validation Sets and Test Sets\n\nHaving two levels of "reserved data"—a validation set and a test set, with one level representing data that you are virtually hiding from yourself—may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set—if you have very little data, you may need to just have a validation set—but generally it's best to use one if at all possible.
fts5_result2_3
### Use Judgment in Defining Test Sets\n\nInstead, use the earlier data as your training set (and the later data for the validation set), as shown in <<timeseries3>>.
Increasing Chunk Size
Using 3 chunks of context instead of 1 increased the performance of retrieval from 40% to 54%, meaning that I would be able to answer the Chapter 1 questions with the retrieved context 54% of the time. I’ll call this metric “retrieved context relevancy”.
I’ll now increase the chunk size and see how that affects performance:
larger_chunks = ["\n".join(filtered_chunks[i:i+3]) for i inrange(0, len(filtered_chunks), 3)]
Now I’ll create a separate table, fastbook_text_large in my database to hold these chunks:
cur = conn.cursor()res = cur.execute("""CREATE VIRTUAL TABLE fastbook_text_largeUSING FTS5(text);""")
for string in larger_chunks: cur.execute(f"INSERT INTO fastbook_text_large(text) VALUES (?)", (string,))res = cur.execute("SELECT * from fastbook_text_large LIMIT 2")
The outputs (which include Markdown) were messing up my quarto blog rendering so I’ve excluded it.
I’ll now iterate through my list of questions, passing the corresponding keywords to the query to conduct a full text search:
Show the for-loop code
results = []for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text_large WHERE fastbook_text_large MATCH '{words}' ORDER BY rank LIMIT 1 """ res = cur.execute(q) results.append(res.fetchall()[0][0])else:# if keywords == "No Answer" res = ("No answer") results.append(res)
“weight assignment” refers to the current values of the model parameters. Arthur Samuel further mentions an “ automatic means of testing the effectiveness of any current weight assignment ” and a “ mechanism for altering the weight assignment so as to maximize the performance ”. This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.
large_chunk_result
### What Is Machine Learning?\n\n- The idea of a "weight assignment" \n- The fact that every weight assignment has some "actual performance"\n- The requirement that there be an "automatic means" of testing that performance, \n- The need for a "mechanism" (i.e., another automatic process) for improving the performance by changing the weight assignments\n### What Is Machine Learning?\n\nLet us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.\n### What Is Machine Learning?\n\nWeights are just variables, and a weight assignment is a particular choice of values for those variables. The program's inputs are values that it processes in order to produce its results—for instance, taking image pixels as inputs, and returning the classification "dog" as a result. The program's weight assignments are other values that define how the program will operate.
“weight assignment” refers to the current values of the model parameters. Arthur Samuel further mentions an “ automatic means of testing the effectiveness of any current weight assignment ” and a “ mechanism for altering the weight assignment so as to maximize the performance ”. This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.
fts5_result2_1
### What Is Machine Learning?\n\nLet us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.
fts5_result2_2
### What Is Machine Learning?\n\n```python\n#hide_input\n#caption A program using weight assignment\n#id weight_assignment\ngv('''model[shape=box3d width=1 height=0.7]\ninputs->model->results; weights->model''')\n```
fts5_result2_3
### What Is Machine Learning?\n\n- The idea of a "weight assignment" \n- The fact that every weight assignment has some "actual performance"\n- The requirement that there be an "automatic means" of testing that performance, \n- The need for a "mechanism" (i.e., another automatic process) for improving the performance by changing the weight assignments
Here’s another question where larger chunks resulted in the correct retrieval (whereas 3 smaller chunks did not):
Do we always have to use 224×224-pixel images with the cat recognition model?
answer
No we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption.
large_chunk_result
### How Our Image Recognizer Works\n\nFinally, we define the `Transform`s that we need. A `Transform` contains code that is applied automatically during training; fastai includes many predefined `Transform`s, and adding new ones is as simple as creating a Python function. There are two kinds: `item_tfms` are applied to each item (in this case, each item is resized to a 224-pixel square), while `batch_tfms` are applied to a *batch* of items at a time using the GPU, so they're particularly fast (we'll see many examples of these throughout this book).\n### How Our Image Recognizer Works\n\nWhy 224 pixels? This is the standard size for historical reasons (old pretrained models require this size exactly), but you can pass pretty much anything. If you increase the size, you'll often get a model with better results (since it will be able to focus on more details), but at the price of speed and memory consumption; the opposite is true if you decrease the size.\n### How Our Image Recognizer Works\n\n> Note: Classification and Regression: _classification_ and _regression_ have very specific meanings in machine learning. These are the two main types of model that we will be investigating in this book. A classification model is one which attempts to predict a class, or category. That is, it's predicting from a number of discrete possibilities, such as "dog" or "cat." A regression model is one which attempts to predict one or more numeric quantities, such as a temperature or a location. Sometimes people use the word _regression_ to refer to a particular kind of model called a _linear regression model_; this is a bad practice, and we won't be using that terminology in this book!
Do we always have to use 224×224-pixel images with the cat recognition model?
answer
No we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption.
### Running Your First Notebook\n\n```python\n#id first_training\n#caption Results from the first training\n# CLICK ME\nfrom fastai.vision.all import *\npath = untar_data(URLs.PETS)/'images'\n\ndef is_cat(x): return x[0].isupper()\ndls = ImageDataLoaders.from_name_func(\n path, get_image_files(path), valid_pct=0.2, seed=42,\n label_func=is_cat, item_tfms=Resize(224))\n\nlearn = vision_learner(dls, resnet34, metrics=error_rate)\nlearn.fine_tune(1)\n```
fts5_result2_3
### How Our Image Recognizer Works\n\nFinally, we define the `Transform`s that we need. A `Transform` contains code that is applied automatically during training; fastai includes many predefined `Transform`s, and adding new ones is as simple as creating a Python function. There are two kinds: `item_tfms` are applied to each item (in this case, each item is resized to a 224-pixel square), while `batch_tfms` are applied to a *batch* of items at a time using the GPU, so they're particularly fast (we'll see many examples of these throughout this book).
Even after reviewing each of the question/answer/context triplets for three approaches I’m still not getting a strong sense of intuition of what works best. I’m hoping that after I have done this exercise for eight chapters, I’ll have built some of that intuition.
Including More Large Chunks During Retrieval
The final experiment I’ll run is retrieving the top 3 BM25 ranked large chunks for each question. Using 3 small chunks and using 1 large chunk both resulted in a retrieved context relevancy of 54%. However they answered a different set of 18 questions. Perhaps if I combine both approaches (use larger chunk size AND use the top 3 BM25-ranked results for evaluation) I’ll obtain a higher retrieved context relevancy.
Show the for-loop code
results = []for keywords in df['keywords']:if keywords !='No answer': words =' OR '.join([f'"{word.strip(",")}"'for word in keywords.split()]) q =f""" SELECT *, rank from fastbook_text_large WHERE fastbook_text_large MATCH '{words}' ORDER BY rank LIMIT 3 """ res = cur.execute(q) results.append(res.fetchall())else:# if keywords == "No Answer" res = ("No answer") results.append(res)
The combined approach (more, larger chunks retrieved) resulted in a retrieved context relevancy of 72%!! This is an increase of 18% from the previous two approaches (retrieving 3 small chunks, retrieving 1 large chunk). However, I’m concerned that the large amount of irrelevant text also included may distract the model from answering the question correctly and concisely—something I’ll have to rigorously experiment with once I add an LLM to the pipeline.
Here are is an example question for which the combined approach provided relevant context (whereas the previous two methods did not):
For the question:
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
Using the top 3 highest BM25-ranked small chunks did not provide enough context to answer the question:
# not relevanttop_3_small.iloc[3][['question_text', 'fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3']]
3
question_text
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
fts5_result2_1
## Neural Networks: A Brief History\n\nIn fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:
fts5_result2_2
## Neural Networks: A Brief History\n\nPerhaps the most pivotal work in neural networks in the last 50 years was the multi-volume *Parallel Distributed Processing* (PDP) by David Rumelhart, James McClellan, and the PDP Research Group, released in 1986 by MIT Press. Chapter 1 lays out a similar hope to that shown by Rosenblatt:
fts5_result2_3
## Neural Networks: A Brief History\n\nWe will see in this book that modern neural networks handle each of these requirements.
Neither did using the top-1 larger chunk:
# not relevanttop_1_large.iloc[3][['question_text', 'large_chunk_result']]
3
question_text
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
large_chunk_result
## Neural Networks: A Brief History\n\n> : People are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. ...We will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain.\n## Neural Networks: A Brief History\n\nThe premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had been (at that point) so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claimed that the PDP approach was "closer \nthan other frameworks" to how the brain works, and therefore it might be better able to handle these kinds of tasks.\n## Neural Networks: A Brief History\n\nIn fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:
However, using the top-3 larger chunks included the necessary context across the first and second-highest ranked chunks:
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
result_1
## Neural Networks: A Brief History\n\n> : People are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. ...We will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain.\n## Neural Networks: A Brief History\n\nThe premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had been (at that point) so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claimed that the PDP approach was "closer \nthan other frameworks" to how the brain works, and therefore it might be better able to handle these kinds of tasks.\n## Neural Networks: A Brief History\n\nIn fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:
result_2
## Neural Networks: A Brief History\n\nRosenblatt further developed the artificial neuron to give it the ability to learn. Even more importantly, he worked on building the first device that actually used these principles, the Mark I Perceptron. In "The Design of an Intelligent Automaton" Rosenblatt wrote about this work: "We are now about to witness the birth of such a machine–-a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control." The perceptron was built, and was able to successfully recognize simple shapes.\n## Neural Networks: A Brief History\n\nAn MIT professor named Marvin Minsky (who was a grade behind Rosenblatt at the same high school!), along with Seymour Papert, wrote a book called _Perceptrons_ (MIT Press), about Rosenblatt's invention. They showed that a single layer of these devices was unable to learn some simple but critical mathematical functions (such as XOR). In the same book, they also showed that using multiple layers of the devices would allow these limitations to be addressed. Unfortunately, only the first of these insights was widely recognized. As a result, the global academic community nearly entirely gave up on neural networks for the next two decades.\n## Neural Networks: A Brief History\n\nPerhaps the most pivotal work in neural networks in the last 50 years was the multi-volume *Parallel Distributed Processing* (PDP) by David Rumelhart, James McClellan, and the PDP Research Group, released in 1986 by MIT Press. Chapter 1 lays out a similar hope to that shown by Rosenblatt:
Here is a summary of the results of this notebook’s experiments:
Top BM25 Ranked Chunks Retrieved
Chunk Size
Retrieved Context Relevancy*
top-3
Large
72%
top-1
Large
54%
top-3
Small
54%
top-1
Small
40%
*Retrieved Context Relevancy:The percentage of questions for which the retrieved context was relevant and sufficient for me to answer the question.
Final Thoughts
The experiments in this notebook are promising: using BM25 to retrieve the context necessary to answer Chapter 1 Questionnaire questions works 72% of the time. Of course, I still had to interpret the retrieved chunks to extract the answer, but that’s something that can be easily done with an LLM down the road. In my next notebook, I’ll use cosine similarity, between the embeddings of the question text and the embeddings of the chunks, and see how that compares to BM25. In the notebook after that, I’ll combine both and see how a hybrid approach performs.
Something else I will also experiment with is the list of keywords that I came up with, as they are critical to the performance of full text search.
Once I’ve established a baseline that I’m confident in, I’ll start introducing an LLM into the pipeline—first to generate keywords from the question for use in full text key search, and then to extract the answer from the retrieved context.
I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.