Using Full Text Search to Answer the fastbook Chapter 1 Questionnaire

python
RAG
information retrieval
fastbookRAG
In this blog post I’ll walk through my experiments of using sqlite full text search to retrieve context relevant to answering chapter review questions. This is part of a larger fastbookRAG proejct I’m work on.
Author

Vishal Bakshi

Published

August 4, 2024

Background

In this notebook I’ll use BM25 (full text search in sqlite) to answer questions about Chapter 1 of the freely available fastai textbook.

This is part of a larger “fastbookRAG” project that I’m working on, where I’ll use BM25, cosine similarity and an LLM (probably phi-3) to answer questions from each of the book chapters’ “Questionnaire” section (from Part 1 of the fastai course).

Show imports
import sqlite3
import json
import re
import pandas as pd, numpy as np
import textwrap

wrapper = textwrap.TextWrapper(
    width=50,
    replace_whitespace=False,
    drop_whitespace=False)

def print_wrap_text(text):
  print("\n".join(wrapper.fill(line) for line in text.splitlines()))

Here is a summary of the results of this notebook’s experiments:

Top BM25 Ranked Chunks Retrieved Chunk Size Retrieved Context Relevancy*
top-3 Large 72%
top-1 Large 54%
top-3 Small 54%
top-1 Small 40%

*Retrieved Context Relevancy:The percentage of questions for which the retrieved context was relevant and sufficient for me to answer the question.

Chunking the Chapter 1 Jupyter Notebook by Paragraph

The first task at hand is to load the Chapter 1 Jupyter Notebook into a sqlite database (so that I can perform full text search on it). There are a few different ways to do this:

  • Load in the entire chapter as a single string of text
  • Chunk the chapter text based on headers
  • Chunk the chapter text based on paragraphs (text between line breaks)

To start, I’ll combine the second and third options and chunk the text based on paragraphs plus include the header at the start of the string. So for example the following text:


This is a Header

This is pargraph one. It has two sentences.

This is paragraph two. It has two sentences as well.


will get chunked into the following strings:

# string 1
"""### This is a Header

This is pargraph one. It has two sentences.
"""

# string 2
"""### This is a Header

This is paragraph two. It has two sentences as well.
"""

In this way I am capturing the granular information (the paragraph text) along with the big picture theme (the header). I suppose this is one way of capturing metadata about the text.

After a few iterations of feedback, I got the following code from Claude Sonnet-3.5 to chunk a .ipynb file into a list of strings.

Show the chunking code
def get_chunks(notebook_path):
    with open(notebook_path, 'r', encoding='utf-8') as file:
        notebook = json.load(file)

    chunks = []
    current_header = ""

    def add_chunk(content):
        if content.strip():
            chunks.append(f"{current_header}\n\n{content.strip()}")

    for cell in notebook['cells']:
        if cell['cell_type'] == 'markdown':
            content = ''.join(cell['source'])
            header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)
            if header_match:  # Check if the cell starts with a header
                current_header = header_match.group(1)
                # Add any content after the header in the same cell
                remaining_content = content[len(current_header):].strip()
                if remaining_content:
                    paragraphs = re.split(r'\n\s*\n', remaining_content)
                    for paragraph in paragraphs:
                        add_chunk(paragraph)
            else:
                paragraphs = re.split(r'\n\s*\n', content)
                for paragraph in paragraphs:
                    add_chunk(paragraph)
        elif cell['cell_type'] == 'code':
            code_content = '```python\n' + ''.join(cell['source']) + '\n```'
            add_chunk(code_content)

    return chunks

Using this chunking strategy, Chapter 1 has 315 chunks of text. For reference, this chapter has 24 headers (of different levels) across 52 pages. That’s about 13 chunks per header and 6 chunks per page.

notebook_path = '01_intro.ipynb'
chunks = get_chunks(notebook_path)
len(chunks)
315

Here’s an example of one of the chunks:

print_wrap_text(chunks[30])
## Who We Are

Sylvain, on the other hand, knows a lot about 
formal technical education. In fact, he has 
written 10 math textbooks, covering the entire 
advanced French maths curriculum!

Load the Chunks into SQLite Database

I’ll load the list of chunks into a sqlite database virtual table with a single column text that has sqlite’s full text search (FTS5) enabled.

I’ll filter out the two sections that I don’t want to show up in the results: the “Questionnaire” section (as the keyword search will match the questions closely) and the “Further Research” sections (as it comes after the “Questionnaire” section and is not part of the main body of the chapter).

def filter_chunks(chunks, exclude_headers):
    filtered_chunks = []
    for chunk in chunks:
        lines = chunk.split('\n')
        # Check if the first line (header) is in the exclude list
        if not any(header in lines[0] for header in exclude_headers):
            filtered_chunks.append(chunk)
    return filtered_chunks

exclude_headers = ["Questionnaire", "Further Research"]
filtered_chunks = filter_chunks(chunks, exclude_headers)
[chunk for chunk in filtered_chunks if 'Questionnaire' in chunk]
[]
[chunk for chunk in filtered_chunks if 'Further Research' in chunk]
[]
conn = sqlite3.connect('fastbook.db')
cur = conn.cursor()
res = cur.execute("""

CREATE VIRTUAL TABLE fastbook_text
USING FTS5(text);

""")
for string in filtered_chunks:
  cur.execute(f"INSERT INTO fastbook_text(text) VALUES (?)", (string,))
res = cur.execute("SELECT * from fastbook_text LIMIT 40")
print_wrap_text(res.fetchall()[30][0])
## Who We Are

Sylvain, on the other hand, knows a lot about 
formal technical education. In fact, he has 
written 10 math textbooks, covering the entire 
advanced French maths curriculum!

Evaluating Retrieved Context

Using the top-ranked BM25 result for the given keyword search, I was able to retrieve the context necessary to answer the question 13 out of 33 times, or 40% of the time.

Looking through these 33 question/context pairs, I saw some patterns noted by the following examples.

In some cases, like for question #5, the retrieved context answered half of the problem.

df.iloc[4]
4
chapter 1
question_number 5
question_text What were the two theoretical misunderstandings that held back the field of neural networks?
answer In 1969, Marvin Minsky and Seymour Papert demonstrated in their book, “Perceptrons”, that a single layer of artificial neurons cannot learn simple, critical mathematical functions like XOR logic gate. While they subsequently demonstrated in the same book that additional layers can solve this problem, only the first insight was recognized, leading to the start of the first AI winter.\\n\\nIn the 1980’s, models with two layers were being explored. Theoretically, it is possible to approximate any mathematical function using two layers of artificial neurons. However, in practices, these networks were too big and too slow. While it was demonstrated that adding additional layers improved performance, this insight was not acknowledged, and the second AI winter began. In this past decade, with increased data availability, and improvements in computer hardware (both in CPU performance but more importantly in GPU performance), neural networks are finally living up to its potential.
keywords theoretical, misunderstandings, held, back, field, neural network
fts5_result ## Neural Networks: A Brief History\n\nIn the 1980's most models were built with a second layer of neurons, thus avoiding the problem that had been identified by Minsky and Papert (this was their ""pattern of connectivity among units,"" to use the framework above). And indeed, neural networks were widely used during the '80s and '90s for real, practical projects. However, again a misunderstanding of the theoretical issues held back the field. In theory, adding just one extra layer of neurons was enough to allow any mathematical function to be approximated with these neural networks, but in practice such networks were often too big and too slow to be useful.


The retrieved context for question #16 was not wrong per se, but it didn’t answer the question in the same way as the gold standard answer. The “correct” answer to “What do you need in order to train a model?” is from the perspective of the model builder, whereas the retrieved context answers the same question but from the perspective of a business or organization.

df.iloc[15]
15
chapter 1
question_number 16
question_text What do you need in order to train a model?
answer You will need an architecture for the given problem. You will need data to input to your model. For most use-cases of deep learning, you will need labels for your data to compare your model predictions to. You will need a loss function that will quantitatively measure the performance of your model. And you need a way to update the parameters of the model in order to improve its performance (this is known as an optimizer).
keywords train, model, need
fts5_result ### Limitations Inherent To Machine Learning\n\n- A model cannot be created without data.\n- A model can only learn to operate on the patterns seen in the input data used to train it.\n- This learning approach only creates *predictions*, not recommended *actions*.\n- It's not enough to just have examples of input data; we need *labels* for that data too (e.g., pictures of dogs and cats aren't enough to train a model; we need a label for each one, saying which ones are dogs, and which are cats).


In some cases, I was on the fence about the relevancy or accuracy of the retrieved context. I erred on the side of caution and considered the following question/answer/context triple as insufficient context:

df.iloc[24]
24
chapter 1
question_number 25
question_text How can pretrained models help?
answer Pretrained models have been trained on other problems that may be quite similar to the current task. For example, pretrained image recognition models were often trained on the ImageNet dataset, which has 1000 classes focused on a lot of different types of visual objects. Pretrained models are useful because they have already learned how to handle a lot of simple features like edge and color detection. However, since the model was trained for a different task than already used, this model cannot be used as is.
keywords pretrained, model, help
fts5_result ### What Our Image Recognizer Learned\n\nWhen we fine-tuned our pretrained model earlier, we adapted what those last layers focus on (flowers, humans, animals) to specialize on the cats versus dogs problem. More generally, we could specialize such a pretrained model on many different tasks. Let's have a look at some examples.


For a number of questions, the keyword search resulted in an HTML img tag or code snippet since it contained the necessary keywords:

df.iloc[2]
2
chapter 1
question_number 3
question_text What was the name of the first device that was based on the principle of the artificial neuron?
answer Mark I perceptron built by Frank Rosenblatt
keywords first, device, artificial, neuron
fts5_result ## Neural Networks: A Brief History\n\n<img alt=""Natural and artificial neurons"" width=""500"" caption=""Natural and artificial neurons"" src=""images/chapter7_neuron.png"" id=""neuron""/>

df.iloc[17]
17
chapter 1
question_number 18
question_text Do we always have to use 224×224-pixel images with the cat recognition model?
answer No we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption.
keywords 224, pixel, image, cat, recognition, model
fts5_result ### How Our Image Recognizer Works\n\n```python\ndls = ImageDataLoaders.from_name_func(\n path, get_image_files(path), valid_pct=0.2, seed=42,\n label_func=is_cat, item_tfms=Resize(224))\n```


In some cases, the retrieved context seemed to be the chunk right before the chunk that would describe the solution in the text. This makes me wonder if my chunk sizes are too small?

df.iloc[9]
9
chapter 1
question_number 10
question_text Why is it hard to use a traditional computer program to recognize images in a photo?
answer For us humans, it is easy to identify images in a photos, such as identifying cats vs dogs in a photo. This is because, subconsciously our brains have learned which features define a cat or a dog for example. But it is hard to define set rules for a traditional computer program to recognize a cat or a dog. Can you think of a universal rule to determine if a photo contains a cat or dog? How would you encode that as a computer program? This is very difficult because cats, dogs, or other objects, have a wide variety of shapes, textures, colors, and other features, and it is close to impossible to manually encode this in a traditional computer program.
keywords image, recognize, recognition, traditional, computer, program
fts5_result ### What Is Machine Learning?\n\n```python\n#hide_input\n#caption A traditional program\n#id basic_program\n#alt Pipeline inputs, program, results\ngv('''program[shape=box3d width=1 height=0.7]\ninputs->program->results''')\n```

df.iloc[10]
10
chapter 1
question_number 11
question_text What did Samuel mean by """"""""weight assignment""""""""?
answer “weight assignment” refers to the current values of the model parameters. Arthur Samuel further mentions an “ automatic means of testing the effectiveness of any current weight assignment ” and a “ mechanism for altering the weight assignment so as to maximize the performance ”. This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.
keywords Samuel, weight, assignment
fts5_result ### What Is Machine Learning?\n\nLet us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.

Including More Chunks During Retrieval

Based on the initial keyword search results, some of the retrieved chunks of context either partially answer the question, or only begin to setup the answer to the question. I see two routes of remediating this:

  • Increase the chunk size (i.e. choose a different chunking strategy)
  • Increase the number of chunks selected during keyword search

The second approachs seems easier to implement. I like the idea of retrieving more than 1 small chunks than a single large chunk. I can imagine a couple of trade-offs:

  • A few small chunks may not capture information that is spread across a long paragraph/section needed for the LLM to answer the question sufficiently.
  • Single large chunk may include information irrelevant to the question and thus introduce noise into the answer, confusing the LLM.

I’ll retrieve the top 3 BM25-ranked results and then evaluate them.

Show the for-loop code
results = []

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text
    WHERE fastbook_text MATCH '{words}'
    ORDER BY rank
    LIMIT 3

    """

    res = cur.execute(q)
    results.append(res.fetchall())
  else:
    # if keywords == "No Answer"
    res = ("No answer")
    results.append(res)
df['fts5_result2'] = results
df.head(3)
chapter question_number question_text answer keywords fts5_result2
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD [(## Deep Learning Is for Everyone\n\n```ascii...
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world [(## Deep Learning Is for Everyone\n\nHere's a...
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron [(## Neural Networks: A Brief History\n\n<img ...
top_3_results = df['fts5_result2'].apply(pd.Series)
top_3_results.columns = [f'fts5_result2_{i+1}' for i in range(top_3_results.shape[1])]
for col in ['fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3']:
    top_3_results[col] = top_3_results[col].apply(lambda x: x[0] if isinstance(x, tuple) else x)
top_3_results.head()
fts5_result2_1 fts5_result2_2 fts5_result2_3
0 ## Deep Learning Is for Everyone\n\n```asciido... ## How to Learn Deep Learning\n\nPaul Lockhart... ## Who We Are\n\nAll this means that between u...
1 ## Deep Learning Is for Everyone\n\nHere's a l... ## How to Learn Deep Learning\n\n- Teaching th... ## Deep Learning Is for Everyone\n\n```asciido...
2 ## Neural Networks: A Brief History\n\n<img al... ## Neural Networks: A Brief History\n\nRosenbl... ## Neural Networks: A Brief History\n\nIn 1943...
3 ## Neural Networks: A Brief History\n\nIn fact... ## Neural Networks: A Brief History\n\nPerhaps... ## Neural Networks: A Brief History\n\nWe will...
4 ## Neural Networks: A Brief History\n\nIn the ... ### What Is a Neural Network?\n\nHaving zoomed... ## Deep Learning Is for Everyone\n\nBut neural...
df = pd.concat([df, top_3_results], axis=1)
df.to_csv('top-3-retrieval-results.csv', index=False)

Using the top 3 BM25 ranked chunks improved the results! 18 out of the 33 questions, or 54%, are now answerable with the given retrieved context.

Here’s an example of a question that I couldn’t answer with the top-1 result but I can answer with the second-ranked result (fts5_result2_2):

df.iloc[19]
19
chapter 1
question_number 20
question_text What is a validation set? What is a test set? Why do we need them?
answer The validation set is the portion of the dataset that is not used for training the model, but for evaluating the model during training, in order to prevent overfitting. This ensures that the model performance is not due to “cheating” or memorization of the dataset, but rather because it learns the appropriate features to use for prediction. However, it is possible that we overfit the validation data as well. This is because the human modeler is also part of the training process, adjusting hyperparameters (see question 32 for definition) and training procedures according to the validation performance. Therefore, another unseen portion of the dataset, the test set, is used for final evaluation of the model. This splitting of the dataset is necessary to ensure that the model generalizes to unseen data.
keywords validation, set, test
fts5_result2 [(## Validation Sets and Test Sets\n\nTo avoid this, our first step was to split our dataset into two sets: the *training set* (which our model sees in training) and the *validation set*, also known as the *development set* (which is used only for evaluation). This lets us test that the model learns lessons from the training data that generalize to new data, the validation data., -9.62686734459166), (## Validation Sets and Test Sets\n\nHaving two levels of "reserved data"—a validation set and a test set, with one level representing data that you are virtually hiding from yourself—may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set—if you have very little data, you may need to just have a validation set—but generally it's best to use one if at all possible., -9.21394932308204), (### Use Judgment in Defining Test Sets\n\nInstead, use the earlier data as your training set (and the later data for the validation set), as shown in <<timeseries3>>., -9.06398171911968)]
fts5_result2_1 ## Validation Sets and Test Sets\n\nTo avoid this, our first step was to split our dataset into two sets: the *training set* (which our model sees in training) and the *validation set*, also known as the *development set* (which is used only for evaluation). This lets us test that the model learns lessons from the training data that generalize to new data, the validation data.
fts5_result2_2 ## Validation Sets and Test Sets\n\nHaving two levels of "reserved data"—a validation set and a test set, with one level representing data that you are virtually hiding from yourself—may seem a bit extreme. But the reason it is often necessary is because models tend to gravitate toward the simplest way to do good predictions (memorization), and we as fallible humans tend to gravitate toward fooling ourselves about how well our models are performing. The discipline of the test set helps us keep ourselves intellectually honest. That doesn't mean we *always* need a separate test set—if you have very little data, you may need to just have a validation set—but generally it's best to use one if at all possible.
fts5_result2_3 ### Use Judgment in Defining Test Sets\n\nInstead, use the earlier data as your training set (and the later data for the validation set), as shown in <<timeseries3>>.

Increasing Chunk Size

Using 3 chunks of context instead of 1 increased the performance of retrieval from 40% to 54%, meaning that I would be able to answer the Chapter 1 questions with the retrieved context 54% of the time. I’ll call this metric “retrieved context relevancy”.

I’ll now increase the chunk size and see how that affects performance:

larger_chunks = ["\n".join(filtered_chunks[i:i+3]) for i in range(0, len(filtered_chunks), 3)]

Now I’ll create a separate table, fastbook_text_large in my database to hold these chunks:

cur = conn.cursor()
res = cur.execute("""

CREATE VIRTUAL TABLE fastbook_text_large
USING FTS5(text);

""")
for string in larger_chunks:
  cur.execute(f"INSERT INTO fastbook_text_large(text) VALUES (?)", (string,))

res = cur.execute("SELECT * from fastbook_text_large LIMIT 2")

The outputs (which include Markdown) were messing up my quarto blog rendering so I’ve excluded it.

I’ll now iterate through my list of questions, passing the corresponding keywords to the query to conduct a full text search:

Show the for-loop code
results = []

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text_large
    WHERE fastbook_text_large MATCH '{words}'
    ORDER BY rank
    LIMIT 1

    """

    res = cur.execute(q)
    results.append(res.fetchall()[0][0])
  else:
    # if keywords == "No Answer"
    res = ("No answer")
    results.append(res)
large_df = df.drop(['fts5_result2', 'fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3'], axis=1)
large_df['large_chunk_result'] = results
large_df.head()
chapter question_number question_text answer keywords large_chunk_result
0 1 1 Do you need these for deep learning?\\n\\n- Lo... Lots of math - False\\nLots of data - False\\n... math, data, expensive computers, PhD ## Deep Learning Is for Everyone\n\nA lot of p...
1 1 2 Name five areas where deep learning is now the... Any five of the following:\\nNatural Language ... deep learning, state of the art, best, world ## Deep Learning Is for Everyone\n\nA lot of p...
2 1 3 What was the name of the first device that was... Mark I perceptron built by Frank Rosenblatt first, device, artificial, neuron ## Neural Networks: A Brief History\n\nRosenbl...
3 1 4 Based on the book of the same name, what are t... A set of processing units\\nA state of activat... parallel, distributed, processing, requirement... ## Neural Networks: A Brief History\n\n> : Peo...
4 1 5 What were the two theoretical misunderstanding... In 1969, Marvin Minsky and Seymour Papert demo... theoretical, misunderstandings, held, back, fi... ## Neural Networks: A Brief History\n\n1. A se...
large_df.to_csv('large_chunk_results.csv', index=False)

Using larger chunks resulted in a retrieved context relevancy of 54% (18/33). However, these were not the same 18 results as before.

For example, here’s a question for which using larger chunks retrieved the right context (whereas retrieving three smaller chunks did not):

large_df.iloc[10][['question_text', 'answer', 'large_chunk_result']]
10
question_text What did Samuel mean by ""weight assignment""?
answer “weight assignment” refers to the current values of the model parameters. Arthur Samuel further mentions an “ automatic means of testing the effectiveness of any current weight assignment ” and a “ mechanism for altering the weight assignment so as to maximize the performance ”. This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.
large_chunk_result ### What Is Machine Learning?\n\n- The idea of a "weight assignment" \n- The fact that every weight assignment has some "actual performance"\n- The requirement that there be an "automatic means" of testing that performance, \n- The need for a "mechanism" (i.e., another automatic process) for improving the performance by changing the weight assignments\n### What Is Machine Learning?\n\nLet us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.\n### What Is Machine Learning?\n\nWeights are just variables, and a weight assignment is a particular choice of values for those variables. The program's inputs are values that it processes in order to produce its results—for instance, taking image pixels as inputs, and returning the classification "dog" as a result. The program's weight assignments are other values that define how the program will operate.

df.iloc[10][['question_text', 'answer', 'fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3']]
10
question_text What did Samuel mean by ""weight assignment""?
answer “weight assignment” refers to the current values of the model parameters. Arthur Samuel further mentions an “ automatic means of testing the effectiveness of any current weight assignment ” and a “ mechanism for altering the weight assignment so as to maximize the performance ”. This refers to the evaluation and training of the model in order to obtain a set of parameter values that maximizes model performance.
fts5_result2_1 ### What Is Machine Learning?\n\nLet us take these concepts one by one, in order to understand how they fit together in practice. First, we need to understand what Samuel means by a *weight assignment*.
fts5_result2_2 ### What Is Machine Learning?\n\n```python\n#hide_input\n#caption A program using weight assignment\n#id weight_assignment\ngv('''model[shape=box3d width=1 height=0.7]\ninputs->model->results; weights->model''')\n```
fts5_result2_3 ### What Is Machine Learning?\n\n- The idea of a "weight assignment" \n- The fact that every weight assignment has some "actual performance"\n- The requirement that there be an "automatic means" of testing that performance, \n- The need for a "mechanism" (i.e., another automatic process) for improving the performance by changing the weight assignments


Here’s another question where larger chunks resulted in the correct retrieval (whereas 3 smaller chunks did not):

large_df.iloc[17][['question_text', 'answer', 'large_chunk_result']]
17
question_text Do we always have to use 224×224-pixel images with the cat recognition model?
answer No we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption.
large_chunk_result ### How Our Image Recognizer Works\n\nFinally, we define the `Transform`s that we need. A `Transform` contains code that is applied automatically during training; fastai includes many predefined `Transform`s, and adding new ones is as simple as creating a Python function. There are two kinds: `item_tfms` are applied to each item (in this case, each item is resized to a 224-pixel square), while `batch_tfms` are applied to a *batch* of items at a time using the GPU, so they're particularly fast (we'll see many examples of these throughout this book).\n### How Our Image Recognizer Works\n\nWhy 224 pixels? This is the standard size for historical reasons (old pretrained models require this size exactly), but you can pass pretty much anything. If you increase the size, you'll often get a model with better results (since it will be able to focus on more details), but at the price of speed and memory consumption; the opposite is true if you decrease the size.\n### How Our Image Recognizer Works\n\n> Note: Classification and Regression: _classification_ and _regression_ have very specific meanings in machine learning. These are the two main types of model that we will be investigating in this book. A classification model is one which attempts to predict a class, or category. That is, it's predicting from a number of discrete possibilities, such as "dog" or "cat." A regression model is one which attempts to predict one or more numeric quantities, such as a temperature or a location. Sometimes people use the word _regression_ to refer to a particular kind of model called a _linear regression model_; this is a bad practice, and we won't be using that terminology in this book!

df.iloc[17][['question_text', 'answer', 'fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3']]
17
question_text Do we always have to use 224×224-pixel images with the cat recognition model?
answer No we do not. 224x224 is commonly used for historical reasons. You can increase the size and get better performance, but at the price of speed and memory consumption.
fts5_result2_1 ### How Our Image Recognizer Works\n\n```python\ndls = ImageDataLoaders.from_name_func(\n path, get_image_files(path), valid_pct=0.2, seed=42,\n label_func=is_cat, item_tfms=Resize(224))\n```
fts5_result2_2 ### Running Your First Notebook\n\n```python\n#id first_training\n#caption Results from the first training\n# CLICK ME\nfrom fastai.vision.all import *\npath = untar_data(URLs.PETS)/'images'\n\ndef is_cat(x): return x[0].isupper()\ndls = ImageDataLoaders.from_name_func(\n path, get_image_files(path), valid_pct=0.2, seed=42,\n label_func=is_cat, item_tfms=Resize(224))\n\nlearn = vision_learner(dls, resnet34, metrics=error_rate)\nlearn.fine_tune(1)\n```
fts5_result2_3 ### How Our Image Recognizer Works\n\nFinally, we define the `Transform`s that we need. A `Transform` contains code that is applied automatically during training; fastai includes many predefined `Transform`s, and adding new ones is as simple as creating a Python function. There are two kinds: `item_tfms` are applied to each item (in this case, each item is resized to a 224-pixel square), while `batch_tfms` are applied to a *batch* of items at a time using the GPU, so they're particularly fast (we'll see many examples of these throughout this book).


Even after reviewing each of the question/answer/context triplets for three approaches I’m still not getting a strong sense of intuition of what works best. I’m hoping that after I have done this exercise for eight chapters, I’ll have built some of that intuition.

Including More Large Chunks During Retrieval

The final experiment I’ll run is retrieving the top 3 BM25 ranked large chunks for each question. Using 3 small chunks and using 1 large chunk both resulted in a retrieved context relevancy of 54%. However they answered a different set of 18 questions. Perhaps if I combine both approaches (use larger chunk size AND use the top 3 BM25-ranked results for evaluation) I’ll obtain a higher retrieved context relevancy.

Show the for-loop code
results = []

for keywords in df['keywords']:
  if keywords != 'No answer':
    words = ' OR '.join([f'"{word.strip(",")}"' for word in keywords.split()])
    q = f"""

    SELECT *, rank
      from fastbook_text_large
    WHERE fastbook_text_large MATCH '{words}'
    ORDER BY rank
    LIMIT 3

    """

    res = cur.execute(q)
    results.append(res.fetchall())
  else:
    # if keywords == "No Answer"
    res = ("No answer")
    results.append(res)
top_3_large = large_df[['chapter', 'question_number', 'question_text', 'answer']].copy()

top_3_large['result'] = results

top_3 = top_3_large['result'].apply(pd.Series)
top_3.columns = [f'result_{i+1}' for i in range(top_3.shape[1])]

for col in ['result_1', 'result_2', 'result_3']:
    top_3[col] = top_3[col].apply(lambda x: x[0] if isinstance(x, tuple) else x)

top_3_large = pd.concat([top_3_large, top_3], axis=1)
top_3_large.to_csv('top-3-large-retrieval-results.csv', index=False)

The combined approach (more, larger chunks retrieved) resulted in a retrieved context relevancy of 72%!! This is an increase of 18% from the previous two approaches (retrieving 3 small chunks, retrieving 1 large chunk). However, I’m concerned that the large amount of irrelevant text also included may distract the model from answering the question correctly and concisely—something I’ll have to rigorously experiment with once I add an LLM to the pipeline.

Here are is an example question for which the combined approach provided relevant context (whereas the previous two methods did not):

For the question:

Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?

Using the top 3 highest BM25-ranked small chunks did not provide enough context to answer the question:

top_3_small = pd.read_csv("/content/top-3-retrieval-results.csv")
top_1_large = pd.read_csv("/content/large_chunk_results.csv")
# not relevant
top_3_small.iloc[3][['question_text', 'fts5_result2_1', 'fts5_result2_2', 'fts5_result2_3']]
3
question_text Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
fts5_result2_1 ## Neural Networks: A Brief History\n\nIn fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:
fts5_result2_2 ## Neural Networks: A Brief History\n\nPerhaps the most pivotal work in neural networks in the last 50 years was the multi-volume *Parallel Distributed Processing* (PDP) by David Rumelhart, James McClellan, and the PDP Research Group, released in 1986 by MIT Press. Chapter 1 lays out a similar hope to that shown by Rosenblatt:
fts5_result2_3 ## Neural Networks: A Brief History\n\nWe will see in this book that modern neural networks handle each of these requirements.


Neither did using the top-1 larger chunk:

# not relevant
top_1_large.iloc[3][['question_text', 'large_chunk_result']]
3
question_text Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
large_chunk_result ## Neural Networks: A Brief History\n\n> : People are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. ...We will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain.\n## Neural Networks: A Brief History\n\nThe premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had been (at that point) so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claimed that the PDP approach was "closer \nthan other frameworks" to how the brain works, and therefore it might be better able to handle these kinds of tasks.\n## Neural Networks: A Brief History\n\nIn fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:


However, using the top-3 larger chunks included the necessary context across the first and second-highest ranked chunks:

top_3_large.iloc[3][['question_text', 'result_1', 'result_2']]
3
question_text Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
result_1 ## Neural Networks: A Brief History\n\n> : People are smarter than today's computers because the brain employs a basic computational architecture that is more suited to deal with a central aspect of the natural information processing tasks that people are so good at. ...We will introduce a computational framework for modeling cognitive processes that seems… closer than other frameworks to the style of computation as it might be done by the brain.\n## Neural Networks: A Brief History\n\nThe premise that PDP is using here is that traditional computer programs work very differently to brains, and that might be why computer programs had been (at that point) so bad at doing things that brains find easy (such as recognizing objects in pictures). The authors claimed that the PDP approach was "closer \nthan other frameworks" to how the brain works, and therefore it might be better able to handle these kinds of tasks.\n## Neural Networks: A Brief History\n\nIn fact, the approach laid out in PDP is very similar to the approach used in today's neural networks. The book defined parallel distributed processing as requiring:
result_2 ## Neural Networks: A Brief History\n\nRosenblatt further developed the artificial neuron to give it the ability to learn. Even more importantly, he worked on building the first device that actually used these principles, the Mark I Perceptron. In "The Design of an Intelligent Automaton" Rosenblatt wrote about this work: "We are now about to witness the birth of such a machine–-a machine capable of perceiving, recognizing and identifying its surroundings without any human training or control." The perceptron was built, and was able to successfully recognize simple shapes.\n## Neural Networks: A Brief History\n\nAn MIT professor named Marvin Minsky (who was a grade behind Rosenblatt at the same high school!), along with Seymour Papert, wrote a book called _Perceptrons_ (MIT Press), about Rosenblatt's invention. They showed that a single layer of these devices was unable to learn some simple but critical mathematical functions (such as XOR). In the same book, they also showed that using multiple layers of the devices would allow these limitations to be addressed. Unfortunately, only the first of these insights was widely recognized. As a result, the global academic community nearly entirely gave up on neural networks for the next two decades.\n## Neural Networks: A Brief History\n\nPerhaps the most pivotal work in neural networks in the last 50 years was the multi-volume *Parallel Distributed Processing* (PDP) by David Rumelhart, James McClellan, and the PDP Research Group, released in 1986 by MIT Press. Chapter 1 lays out a similar hope to that shown by Rosenblatt:


Here is a summary of the results of this notebook’s experiments:

Top BM25 Ranked Chunks Retrieved Chunk Size Retrieved Context Relevancy*
top-3 Large 72%
top-1 Large 54%
top-3 Small 54%
top-1 Small 40%

*Retrieved Context Relevancy:The percentage of questions for which the retrieved context was relevant and sufficient for me to answer the question.

Final Thoughts

The experiments in this notebook are promising: using BM25 to retrieve the context necessary to answer Chapter 1 Questionnaire questions works 72% of the time. Of course, I still had to interpret the retrieved chunks to extract the answer, but that’s something that can be easily done with an LLM down the road. In my next notebook, I’ll use cosine similarity, between the embeddings of the question text and the embeddings of the chunks, and see how that compares to BM25. In the notebook after that, I’ll combine both and see how a hybrid approach performs.

Something else I will also experiment with is the list of keywords that I came up with, as they are critical to the performance of full text search.

Once I’ve established a baseline that I’m confident in, I’ll start introducing an LLM into the pipeline—first to generate keywords from the question for use in full text key search, and then to extract the answer from the retrieved context.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.