Evaluating 4 Retrieval Methods with 6 Chunking Strategies on my fastbook-benchmark Dataset

python
fastbookRAG
information retrieval
In this blog post, I perform retrieval on the fastbook chapter documents using 24 different retrieval method-chunking strategy combinations, auto-scoring using my fastbook-benchmark dataset.
Author

Vishal Bakshi

Published

November 26, 2024

Background

In this notebook I re-run my full text search and semantic search retrieval methods for fastbook questions and use my newly curated fastbook-benchmark dataset to calculate Answer Component MRR@10 and Answer Component Recall@10 retrieval metrics. Due to how my benchmark dataset is structured, I had to modify (with Claude’s help) the classic MRR and Recall functions:

  • Answer Component MRR@10: Returns the rank of the n-th passage needed to satisfy all answer_components for the question. So, if a question has 4 answer_components and their relevant contexts were contained across the first 5 retrieved passages, MRR would be 1/5 = 0.2.

  • Answer Component Recall@10: Measures the proportion of answer components for which at least one supporting context was retrieved. Using the same example, if the top-10 passages only contain contexts relevant to 2 answer_components, Recall would be 2/4 = 0.5

See the section below to see how my benchmark dataset is structured.

There are four retrieval methods I implement in this notebook:

  • Full text search (using sqlite and Claude-generated keywords)
  • Single-vector cosine similarity (using BAAI/bge-small-en-v1.5)
  • ColBERTv2
  • answerai-colbert-small-v1

There are six chunking strategies I implement:

Chunking Strategy Name Description
A 1-paragraph (w/headers)
B 3-paragraph (w/headers)
C 1-paragraph (w/o headers)
D 3-paragraph (w/o headers)
E 3-paragraph (w/headers, w/o HTML tags)
F 3-paragraph (w/headers, w/o HTML tags, w/o punctuation)

Here are the results from this notebook:

Answer Component MRR@10

Retrieval Method A B C D E F
Full text search 0.30 0.46 0.29 0.44 0.46 0.46
Single-vector cosine similiarity 0.38 0.50 0.35 0.46 0.50 0.49
ColBERTv2 0.46 0.49 0.41 0.50 0.49 0.44
answerai-colbert-small-v1 0.48 0.52 0.45 0.52 0.52 0.45


Answer Component Recall@10

Retrieval Method A B C D E F
Full text search 65% 83% 65% 82% 83% 83%
Single-vector cosine similiarity 71% 85% 72% 82% 87% 86%
ColBERTv2 80% 80% 74% 80% 81% 71%
answerai-colbert-small-v1 82% 84% 77% 82% 84% 73%

The best-performing retrieval method and chunking strategies:

Metric Name Retrieval Method Chunking Strategies Metric Value
Answer Component MRR@10 answerai-colbert-small-v1 B, D, E 0.52
Answer Component Recall@10 Single-vector cosine similiarty E 87%

The fastbook-benchmark Dataset

The fastbook-benchmark dataset contains a list of items (questions). Each item looks something like this:

{
            "chapter": 1,
            "question_number": 1,
            "question_text": "Do you need these for deep learning?\n\n- Lots of math..."",
            "answer_context": [
                {
                    "answer_component": "\"Lots of math..."",
                    "scoring_type": "simple",
                    "context": [
                        "...Lots of math..."
                    ],
                    "explicit_context": "true",
                    "extraneous_answer": "false"
                }
            ],
            "question_context": []
        },

I have broken down each gold_standard_answer into separate answer_components, each of which has associated with it one or more contexts from the chapter text that address that answer_component. Here’s an example of a question with two answer_components:

{
            "chapter": 1,
            "question_number": 5,
            "question_text": "What were the two theoretical misunderstandings that held back the field of neural networks?",
            "gold_standard_answer": "\"In 1969..."",
            "answer_context": [
                {
                    "answer_component": "\"In 1969..."",
                    "scoring_type": "simple",
                    "context": [
                        "An MIT professor named..."
                    ],
                    "explicit_context": "true",
                    "extraneous_answer": "false"
                },
                {
                    "answer_component": "\"\n\nIn the 1980's..."",
                    "scoring_type": "simple",
                    "context": [
                        "In the 1980's..."
                    ],
                    "explicit_context": "true",
                    "extraneous_answer": "false"
                }
            ],
            "question_context": []
        },

Any one of the contexts is sufficient to address the associated answer_component. For example, for Chapter 1 Question 12:

{
  "answer_component": "We instead use the term parameters.",
  "scoring_type": "simple",
  "context": [
      "By the way, what Samuel called \"weights\" are most generally referred to as model *parameters* these days",

      "The *weights* are called *parameters*"
  ],
  "explicit_context": "true",
  "extraneous_answer": "false"
}

For some questions’ gold_standard_answer I found that some answer_components were extraneous to the goal of the question. These have been marked with the flag "extraneous_answer": "false".

For some answer_components I found the corresponding context implicitly addressing it. These have been marked with the flag "explicit_context": "false".

Finally, for some answer_components I did not find relevant context in the given chapter so the context field is assigned an empty list [].

All of the design decisions for this fastbook-benchmark dataset have largely been driven by one goal: don’t change the gold_standard_answer. I have been using the fastai Forums’ Wiki solutions page for each chapter as the gold standard answer set (example).

Setup

::: {.cell _cell_guid=‘b1076dfc-b9ad-4769-8c92-a6c4dae69d19’ _uuid=‘8f2839f25d086af736a60e9eeb907d3b93b6e0e5’ trusted=‘true’}

!pip install sentence-transformers -Uqq
!pip install -qq RAGatouille
!pip install ftfy -qq

:::

Show imports
import sqlite3
import json
import re
import os
import pandas as pd, numpy as np
import requests
import torch
import torch.nn.functional as F
from ftfy import fix_text
from sentence_transformers import SentenceTransformer
from ragatouille import RAGPretrainedModel
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
Show chunking code
def get_chunks(notebook_path):
    with open(notebook_path, 'r', encoding='utf-8') as file:
        notebook = json.load(file)

    chunks = []
    current_header = ""

    def add_chunk(content):
        if content.strip():
            chunks.append(f"{current_header}\n\n{content.strip()}")

    for cell in notebook['cells']:
        if cell['cell_type'] == 'markdown':
            content = ''.join(cell['source'])
            # see if the cell starts with a markdown header
            header_match = re.match(r'^(#+\s+.*?)$', content, re.MULTILINE)
            if header_match:
                # grab the header
                current_header = header_match.group(1)
                # add any content after the header in the same cell
                remaining_content = content[len(current_header):].strip()
                if remaining_content:
                    # split content into paragraphs
                    paragraphs = re.split(r'\n\s*\n', remaining_content)
                    # append the paragraph to the list of chunks
                    for paragraph in paragraphs:
                        add_chunk(paragraph)
            else:
                # split content into paragraphs
                paragraphs = re.split(r'\n\s*\n', content)
                # append the paragraph to the list of chunks
                for paragraph in paragraphs:
                    add_chunk(paragraph)
        elif cell['cell_type'] == 'code':
          code_content = '```python\n' + ''.join(cell['source']) + '\n```'

          # include the output of the code cell
          output_content = ''
          if 'outputs' in cell and cell['outputs']:
              for output in cell['outputs']:
                  if 'text' in output:
                      output_content += ''.join(output['text'])
                  elif 'data' in output and 'text/plain' in output['data']:
                      output_content += ''.join(output['data']['text/plain'])

          # combine code and output in the same chunk
          combined_content = code_content + '\n\nOutput:\n' + output_content if output_content else code_content
          add_chunk(combined_content)

    def filter_chunks(chunks, exclude_headers=["Questionnaire", "Further Research"]):
      filtered_chunks = []
      for chunk in chunks:
          lines = chunk.split('\n')
          # check if the first line (header) is in the exclude list
          if not any(header in lines[0] for header in exclude_headers):
              filtered_chunks.append(chunk)
      return filtered_chunks

    return filter_chunks(chunks)
Show chunking code
def combine_chunks(chunks, num_p=3):
    combined_chunks = []
    current_header = None
    current_group = []

    for chunk in chunks:
        # Extract header from chunk
        header = chunk.split('\n\n')[0]

        if header != current_header:
            if len(current_group) > 1:  # Only add if group has content besides header
                # Add current group to combined chunks if header changes
                combined_chunks.append('\n\n'.join(current_group))
            # Update current header
            current_header = header
            # Start new group with header and content of current chunk
            current_group = [header, chunk.split('\n\n', 1)[1] if len(chunk.split('\n\n')) > 1 else '']
        else:
            if len(current_group) < num_p + 1:  # +1 to account for header
                # Add chunk content (without header) to current group
                current_group.append(chunk.split('\n\n', 1)[1] if len(chunk.split('\n\n')) > 1 else '')

            if len(current_group) == num_p + 1:  # +1 to account for header
                # Add full group to combined chunks
                combined_chunks.append('\n\n'.join(current_group))
                # Reset current group, keeping the header
                current_group = [current_header]

    if len(current_group) > 1:  # Only add if group has content besides header
        # Add any remaining group to combined chunks
        combined_chunks.append('\n\n'.join(current_group))

    return combined_chunks
Show the load_data function used for full text search
def load_data(chunks, db_path, chapter=1):
    try:
        # create virtual table if database doesn't exist
        if not os.path.exists(db_path):
            with sqlite3.connect(db_path) as conn:
              cur = conn.cursor()
              cur.execute("""
              CREATE VIRTUAL TABLE fastbook_text
              USING FTS5(chapter, text);
              """)
              conn.commit()

        # load in the chunks for each chapter
        with sqlite3.connect(db_path) as conn:
            cur = conn.cursor()

            for chunk in chunks:
                cur.execute("INSERT INTO fastbook_text(chapter, text) VALUES (?, ?)", (chapter, chunk))

            conn.commit()
            res = cur.execute("SELECT * FROM fastbook_text WHERE chapter = ?", (chapter,)).fetchall()
        # make sure all the data was loaded into the database
        if len(res) != len(chunks):
            raise ValueError(f"Number of inserted chunks ({len(res)}) doesn't match input chunks ({len(chunks)})")

        return True

    except sqlite3.Error as e:
        print(f"An error occurred: {e}")
        return False
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return False
Show the db_search function used for full text search
def db_search(df, limit=1):
  results = []
  with sqlite3.connect('fastbook.db') as conn:
    cur = conn.cursor()
    # concatenate the keywords into a string "keyword1 OR keyword 2 OR keyword3 ..."
    for _, row in df.iterrows():
      keywords = ' OR '.join([f'"{keyword.strip(",")}"' for keyword in row['keywords'].replace('"', '').split()])

      q = f"""
        SELECT text, rank
        FROM fastbook_text
        WHERE fastbook_text MATCH ?
        AND chapter = ?
        ORDER BY rank
        LIMIT ?
        """
      res = cur.execute(q, (keywords, str(row['chapter']), limit)).fetchall()
      # grab the retrieved chunk from the query results
      res = [item[0] for item in res]

      # if there are multiple chunks retrieved, combine them into a single string
      results.append(res)

    return results
Download chapter ipynb files
urls = {
    '01_intro.ipynb': 'https://drive.google.com/uc?export=view&id=1mmBjFH_plndPBC4iRZHChfMazgBxKK4_',
    '02_production.ipynb': 'https://drive.google.com/uc?export=view&id=1Cf5QHthHy1z13H0iu3qrzAWgquCfqVHk',
    '04_mnist_basics.ipynb': 'https://drive.google.com/uc?export=view&id=113909_BNulzyLIKUNJHdya0Hhoqie30I',
    '08_collab.ipynb': 'https://drive.google.com/uc?export=view&id=1BtvStgFjUtvtqbSZNrL7Y2N-ey3seNZU',
    '09_tabular.ipynb': 'https://drive.google.com/uc?export=view&id=1rHFvwl_l-AJLg_auPjBpNrOgG9HDnfqg',
    '10_nlp.ipynb': 'https://drive.google.com/uc?export=view&id=1pg1pH7jMMElzrXS0kBBz14aAuDsi2DEP',
    '13_convolutions.ipynb': 'https://drive.google.com/uc?export=view&id=19P-eEHpAO3WrOvdxgXckyhHhfv_R-hnS'
}

def download_file(url, filename):
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file in write-binary mode
        with open(filename, 'wb') as file:
            # Write the content of the response to the file
            file.write(response.content)
        print(f"File downloaded successfully: {filename}")
    else:
        print(f"Failed to download file. Status code: {response.status_code}")

for fname, url in urls.items():
  download_file(url, fname)
File downloaded successfully: 01_intro.ipynb
File downloaded successfully: 02_production.ipynb
File downloaded successfully: 04_mnist_basics.ipynb
File downloaded successfully: 08_collab.ipynb
File downloaded successfully: 09_tabular.ipynb
File downloaded successfully: 10_nlp.ipynb
File downloaded successfully: 13_convolutions.ipynb
Show the dict w/ notebook filenames
nbs = {
    '1': '01_intro.ipynb',
    '2': '02_production.ipynb',
    '4': '04_mnist_basics.ipynb',
    '8': '08_collab.ipynb',
    '9': '09_tabular.ipynb',
    '10': '10_nlp.ipynb',
    '13': '13_convolutions.ipynb'
}
# load the question texts
url = 'https://gist.githubusercontent.com/vishalbakshi/2c22ca69ac7bc4bc845052c1b9d949c8/raw/d498259f2fc75d27c485ddc73933f145987feef3/cs_bm25_baselines.csv'
questions = pd.read_csv(url).query("is_answerable == 1")[["chapter", "question_number", "question_text", "answer", "keywords"]]

# remove double quotations from the question text
# as these affect embeddings/cosine similarity: https://vishalbakshi.github.io/blog/posts/2024-11-08-punctuation-cosine-similarity/
questions['question_text'] = questions['question_text'].str.strip('"\'')
questions.head()
chapter question_number question_text answer keywords
0 1 1 Do you need these for deep learning?\n\n- Lots... "Lots of math - False\nLots of data - False\nL... "deep learning, math, data, computers, PhD"
1 1 2 Name five areas where deep learning is now the... "Any five of the following:\nNatural Language ... deep learning, areas, best, world
2 1 3 What was the name of the first device that was... "Mark I perceptron built by Frank Rosenblatt" "neuron, neurons, device, artificial, principle"
3 1 4 Based on the book of the same name, what are t... "A set of processing units\nA state of activat... "parallel, distributed, processing, PDP, requi...
4 1 5 What were the two theoretical misunderstanding... "In 1969, Marvin Minsky and Seymour Papert dem... "neural, networks, theoretical, misunderstandi...
assert questions.shape == (191,5)
# download fastbook-benchmark
download_file(
    "https://gist.githubusercontent.com/vishalbakshi/a507b6e9e893475e93a4141e96b8947d/raw/e32835ba1dbf94384943ed5a65404112e1c89df2/fastbook-benchmark.json",
    "fastbook-benchmark.json"
    )

# Load the benchmark data
with open('fastbook-benchmark.json', 'r') as f:
    benchmark = json.load(f)
File downloaded successfully: fastbook-benchmark.json
assert len(benchmark['questions']) == 191
Show calculate_mrr function
def calculate_mrr(question, retrieved_passages, cutoff=10):
    retrieved_passages = retrieved_passages[:cutoff]
    highest_rank = 0

    for ans_comp in question["answer_context"]:
        contexts = ans_comp.get("context", [])
        component_found = False

        for rank, passage in enumerate(retrieved_passages, start=1):
            if any(fix_text(context) in fix_text(passage) for context in contexts):
                highest_rank = max(highest_rank, rank)
                component_found = True
                break

        if not component_found:
            return 0.0

    return 1.0/highest_rank if highest_rank > 0 else 0.0
Show calculate_recall function
def calculate_recall(question, retrieved_passages, cutoff=10):
    retrieved_passages = retrieved_passages[:cutoff]

    # Track if we've found at least one context for each answer component
    ans_comp_found = []

    for ans_comp in question["answer_context"]:
        contexts = ans_comp.get("context", [])
        found = False

        # Check if any context for this answer component appears in retrieved passages
        for passage in retrieved_passages:
            if any(fix_text(context) in fix_text(passage) for context in contexts):
                found = True
                break

        ans_comp_found.append(found)

    # Recall is ratio of answer components with at least one found context
    return sum(ans_comp_found) / len(ans_comp_found)
Show fts_retrieval function
def fts_retrieval(data, questions):
    if os.path.exists("fastbook.db"):
        os.remove("fastbook.db")

    for chapter, chunks in data.items():
      print(f"Chapter {chapter}:", load_data(chunks, 'fastbook.db', chapter))

    print("Retrieving passages...")
    results = db_search(questions, limit=10)

    assert len(results) == 191
    for res in results:
        assert len(res) <= 10

    print("Retrieval complete.")
    return results
Show single_vector_retrieval function
def single_vector_retrieval(data, benchmark):
    # Group questions by chapter
    questions = {}
    for q in benchmark["questions"]:
        chapter = str(q["chapter"])
        if chapter not in questions:
            questions[chapter] = []
        questions[chapter].append(q['question_text'].strip('"\''))

    q_embs = {}
    print("Encoding Questions...")
    for chapter, _ in data.items():
        qs = questions[chapter]
        q_embs[chapter] = emb_model.encode(qs, convert_to_tensor=True)

    data_embs = {}
    print("Encoding Data...")
    for chapter, chunks in data.items():
        data_embs[chapter] = emb_model.encode(chunks, convert_to_tensor=True)

    results = []
    print("Retrieving passages...")
    for chapter in ['1', '2', '4', '8', '9', '10', '13']:
        # Compute cosine similarity and get top 10 indices for each row
        idxs = F.cosine_similarity(q_embs[chapter].unsqueeze(1), data_embs[chapter].unsqueeze(0), dim=2).sort(descending=True)[1]
        top_10_idxs = idxs[:, :10]  # Get the top 10 indices for each row

        # Extract top 10 chunks for each row
        top_10_chunks = [
            [data[chapter][idx.item()] for idx in row_idxs]
            for row_idxs in top_10_idxs
        ]
        results.extend(top_10_chunks)

    assert len(results) == 191

    for res in results:
        assert len(res) <= 10

    print("Retrieval complete.")
    return results
Show ragetouille_retrieval function
def ragatouille_retrieval(data, benchmark, model_nm="colbert-ir/colbertv2.0"):
    # Group questions by chapter
    questions_by_chapter = {}
    for q in benchmark["questions"]:
        chapter = str(q["chapter"])
        if chapter not in questions_by_chapter:
            questions_by_chapter[chapter] = []
        questions_by_chapter[chapter].append(q)

    # Dictionary to store results per chapter
    chapter_results = {}
    chapter_metrics = {}

    # Initialize ColBERTv2
    RAG = RAGPretrainedModel.from_pretrained(model_nm)

    # Process each chapter separately
    for chapter in nbs.keys():
        print(f"\nProcessing Chapter {chapter}")

        # Create chapter-specific index
        index_path = RAG.index(
            index_name=f"chapter_{chapter}_index",
            collection=data[chapter],
            document_ids=[f"{chapter}_{i}" for i in range(len(data[chapter]))]
        )

        # Get questions for this chapter
        chapter_questions = questions_by_chapter[chapter]

        # Perform retrieval for each question in this chapter
        results = []
        for q in chapter_questions:
            retrieved = RAG.search(q["question_text"].strip('"\''), k=10)
            results.append(retrieved)

        # Store results
        chapter_results[chapter] = results

    results = []
    for chapter, res in chapter_results.items():
        results.extend(res)

    assert len(results) == 191

    final_results = []
    for res in results:
        assert len(res) <= 10
        intermediate_results = [r['content'] for r in res]
        final_results.append(intermediate_results)

    print("Retrieval complete.")
    return final_results
Show do_retrieval function
def do_retrieval(method, chunking_strategy, data, benchmark, benchmark_results, questions=None):
  if method == "bm25": results = fts_retrieval(data, questions)
  if method == "single_vector": results = single_vector_retrieval(data, benchmark)
  if method == "colbertv2": results = ragatouille_retrieval(data, benchmark, model_nm="colbert-ir/colbertv2.0")
  if method == "answerai_colbert": results = ragatouille_retrieval(data, benchmark, model_nm="answerdotai/answerai-colbert-small-v1")

  name = f"{method}_{chunking_strategy}"
  q_mrr, q_recall = score_retrieval(results, benchmark)
  benchmark_results = save_results(results, benchmark_results, q_mrr, q_recall, name=name)

  return benchmark_results
Show score_retrieval function
def score_retrieval(results, benchmark):
    q_mrr = []
    q_recall = []

    for i, question in enumerate(benchmark["questions"]):
        mrr = calculate_mrr(question, results[i], cutoff=10)
        recall = calculate_recall(question, results[i], cutoff=10)
        q_mrr.append(mrr)
        q_recall.append(recall)

    assert len(q_mrr) == 191
    assert len(q_recall) == 191

    return q_mrr, q_recall
Show save_results function
def save_results(results, df, q_mrr, q_recall, name):
    flat_results = []
    for res in results:
        flat_results.append("\n\n".join(res))

    assert len(flat_results) == 191

    df[f'{name}_retrieval'] = flat_results
    df[f'{name}_mrr10'] = q_mrr
    df[f'{name}_recall10'] = q_recall

    return df

Chunking Strategy A: 1-Paragraph (with headers)

# chunking each notebook
data = {}

for chapter, nb in nbs.items():
  data[chapter] = get_chunks(nb)

total_chunks = 0
for chapter, chunks in data.items():
  print(chapter, len(chunks))
  total_chunks += len(chunks)

assert total_chunks == 1967 # 1-paragraph chunks
1 307
2 227
4 433
8 157
9 387
10 190
13 266

Retrieval Method: Single-Vector Cosine Similarity

benchmark_results = do_retrieval(
    method="single_vector",
    chunking_strategy="A",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
Encoding Questions...
Encoding Data...
Retrieving passages...
Retrieval complete.
round(benchmark_results['single_vector_A_mrr10'].mean(),2), round(benchmark_results['single_vector_A_recall10'].mean(),2)
(0.38, 0.71)

Retrieval Method: ColBERTv2

benchmark_results = do_retrieval(
    method="colbertv2",
    chunking_strategy="A",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['colbertv2_A_mrr10'].mean(),2), round(benchmark_results['colbertv2_A_recall10'].mean(),2)
(0.46, 0.8)

Retrieval Method: answerai-colbert-small-v1

benchmark_results = do_retrieval(
    method="answerai_colbert",
    chunking_strategy="A",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['answerai_colbert_A_mrr10'].mean(),2), round(benchmark_results['answerai_colbert_A_recall10'].mean(),2)
(0.48, 0.82)

Chunking Strategy B: 3-Paragraph (with headers)

Next, I’ll expand the chunks to include 3-paragraphs at a time. I’m still keeping the headers.

for chapter, chunks in data.items():
  data[chapter] = combine_chunks(chunks, num_p=3)

total_chunks = 0

for chapter, chunks in data.items():
  print(chapter, len(chunks))
  total_chunks += len(chunks)

assert total_chunks == 713
1 112
2 84
4 152
8 58
9 141
10 70
13 96

Retrieval Method: Full Text Search

benchmark_results = do_retrieval(
    method="bm25",
    chunking_strategy="B",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results,
    questions=questions)
Chapter 1: True
Chapter 2: True
Chapter 4: True
Chapter 8: True
Chapter 9: True
Chapter 10: True
Chapter 13: True
Retrieving passages...
Retrieval complete.
round(benchmark_results['bm25_B_mrr10'].mean(),2), round(benchmark_results['bm25_B_recall10'].mean(),2)
(0.46, 0.83)

Retrieval Method: Single-Vector Cosine Similarity

benchmark_results = do_retrieval(
    method="single_vector",
    chunking_strategy="B",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
Encoding Questions...
Encoding Data...
Retrieving passages...
Retrieval complete.
round(benchmark_results['single_vector_B_mrr10'].mean(),2), round(benchmark_results['single_vector_B_recall10'].mean(),2)
(0.5, 0.85)

Retrieval Method: ColBERTv2

benchmark_results = do_retrieval(
    method="colbertv2",
    chunking_strategy="B",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['colbertv2_B_mrr10'].mean(),2), round(benchmark_results['colbertv2_B_recall10'].mean(),2)
(0.49, 0.8)

Retrieval Method: answerai-colbert-small-v1

benchmark_results = do_retrieval(
    method="answerai_colbert",
    chunking_strategy="B",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['answerai_colbert_B_mrr10'].mean(),2), round(benchmark_results['answerai_colbert_B_recall10'].mean(),2)
(0.52, 0.84)

Chunking Strategy C: 1-Paragraph (w/o headers)

Next, I’ll remove markdown headers from each chunk.

# chunking each notebook
data = {}

for chapter, nb in nbs.items():
    data[chapter] = get_chunks(nb)

for chapter, chunks in data.items():
    data[chapter] = [re.sub(r'^#+\s+[^\n]+\n*', '', c) for c in data[chapter]]

total_chunks = 0
for chapter, chunks in data.items():
    print(chapter, len(chunks))
    total_chunks += len(chunks)

assert total_chunks == 1967 # 1-paragraph chunks
1 307
2 227
4 433
8 157
9 387
10 190
13 266

Retrieval Method: Full Text Search

benchmark_results = do_retrieval(
    method="bm25",
    chunking_strategy="C",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results,
    questions=questions)
Chapter 1: True
Chapter 2: True
Chapter 4: True
Chapter 8: True
Chapter 9: True
Chapter 10: True
Chapter 13: True
Retrieving passages...
Retrieval complete.
round(benchmark_results['bm25_C_mrr10'].mean(),2), round(benchmark_results['bm25_C_recall10'].mean(),2)
(0.29, 0.65)

Retrieval Method: Single-Vector Cosine Similarity

benchmark_results = do_retrieval(
    method="single_vector",
    chunking_strategy="C",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
Encoding Questions...
Encoding Data...
Retrieving passages...
Retrieval complete.
round(benchmark_results['single_vector_C_mrr10'].mean(),2), round(benchmark_results['single_vector_C_recall10'].mean(),2)
(0.35, 0.72)

Retrieval Method: ColBERTv2

benchmark_results = do_retrieval(
    method="colbertv2",
    chunking_strategy="C",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['colbertv2_C_mrr10'].mean(),2), round(benchmark_results['colbertv2_C_recall10'].mean(),2)
(0.41, 0.74)

Retrieval Method: answerai-colbert-small-v1

benchmark_results = do_retrieval(
    method="answerai_colbert",
    chunking_strategy="C",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['answerai_colbert_C_mrr10'].mean(),2), round(benchmark_results['answerai_colbert_C_recall10'].mean(),2)
(0.45, 0.77)

Chunking Strategy D: 3-Paragraph (w/o headers)

Expanding header-less chunks to 3-paragraphs.

Show modified combine_chunks function
def combine_chunks2(chunks, num_p=3):
    """
    Combines text chunks into groups of specified size (num_p).
    If chunks have no headers, treats them as standalone content.
    """
    combined_chunks = []
    current_group = []

    for chunk in chunks:
        if len(current_group) < num_p:
            current_group.append(chunk)

        if len(current_group) == num_p:
            combined_chunks.append('\n\n'.join(current_group))
            current_group = []

    # Add any remaining chunks
    if current_group:
        combined_chunks.append('\n\n'.join(current_group))

    return combined_chunks
for chapter, chunks in data.items():
  data[chapter] = combine_chunks2(chunks, num_p=3)

total_chunks = 0

for chapter, chunks in data.items():
  print(chapter, len(chunks))
  total_chunks += len(chunks)

assert total_chunks == 659
1 103
2 76
4 145
8 53
9 129
10 64
13 89

Retrieval Method: Full Text Search

benchmark_results = do_retrieval(
    method="bm25",
    chunking_strategy="D",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results,
    questions=questions)
Chapter 1: True
Chapter 2: True
Chapter 4: True
Chapter 8: True
Chapter 9: True
Chapter 10: True
Chapter 13: True
Retrieving passages...
Retrieval complete.
round(benchmark_results['bm25_D_mrr10'].mean(),2), round(benchmark_results['bm25_D_recall10'].mean(),2)
(0.44, 0.82)

Retrieval Method: Single-Vector Cosine Similarity

benchmark_results = do_retrieval(
    method="single_vector",
    chunking_strategy="D",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
Encoding Questions...
Encoding Data...
Retrieving passages...
Retrieval complete.
round(benchmark_results['single_vector_D_mrr10'].mean(),2), round(benchmark_results['single_vector_D_recall10'].mean(),2)
(0.46, 0.82)

Retrieval Method: ColBERTv2

benchmark_results = do_retrieval(
    method="colbertv2",
    chunking_strategy="D",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['colbertv2_D_mrr10'].mean(),2), round(benchmark_results['colbertv2_D_recall10'].mean(),2)
(0.5, 0.8)

Retrieval Method: answerai-colbert-small-v1

benchmark_results = do_retrieval(
    method="answerai_colbert",
    chunking_strategy="D",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['answerai_colbert_D_mrr10'].mean(),2), round(benchmark_results['answerai_colbert_D_recall10'].mean(),2)
(0.52, 0.82)

Chunking Strategy E: (3-paragraph w/headers, w/o HTML tags)

I’ll add headers back, but will remove HTML tags.

# chunking each notebook
data = {}

for chapter, nb in nbs.items():
  data[chapter] = get_chunks(nb)

total_chunks = 0
for chapter, chunks in data.items():
  total_chunks += len(chunks)

assert total_chunks == 1967 # 1-paragraph chunks

for chapter, chunks in data.items():
  data[chapter] = combine_chunks(chunks, num_p=3)

total_chunks = 0

for chapter, chunks in data.items():
  total_chunks += len(chunks)

assert total_chunks == 713
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n\nA convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>.\n\n<img src="images/chapter9_conv_basic.png" id="basic_conv" caption="Applying a kernel to one location" alt="Applying a kernel to one location" width="700">'
def clean_html(text):
    # Step 1: Temporarily replace double-bracketed content with a placeholder
    import uuid
    placeholder = f"PLACEHOLDER_{uuid.uuid4()}"
    double_bracketed = re.findall(r'<<[^>]*>>', text)
    step1 = re.sub(r'<<[^>]*>>', placeholder, text)

    # Step 2: Remove HTML tags
    step2 = re.sub(r'<[/]?[a-zA-Z][^>]*>', '', step1)

    # Step 3: Restore double-bracketed content
    if double_bracketed:
        step3 = step2.replace(placeholder, double_bracketed[0])
        return step3
    return step2

clean_html('The <a href="#">text</a> is <<untouched>>.')
'The text is <<untouched>>.'
for chapter, chunks in data.items():
  data[chapter] = [clean_html(chunk) for chunk in chunks]

total_chunks = 0

for chapter, chunks in data.items():
  total_chunks += len(chunks)

assert total_chunks == 713
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n\nA convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>.\n\n'

Retrieval Method: Full Text Search

benchmark_results = do_retrieval(
    method="bm25",
    chunking_strategy="E",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results,
    questions=questions)
Chapter 1: True
Chapter 2: True
Chapter 4: True
Chapter 8: True
Chapter 9: True
Chapter 10: True
Chapter 13: True
Retrieving passages...
Retrieval complete.
round(benchmark_results['bm25_E_mrr10'].mean(),2), round(benchmark_results['bm25_E_recall10'].mean(),2)
(0.46, 0.83)

Retrieval Method: Single-Vector Cosine Similarity

benchmark_results = do_retrieval(
    method="single_vector",
    chunking_strategy="E",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
Encoding Questions...
Encoding Data...
Retrieving passages...
Retrieval complete.
round(benchmark_results['single_vector_E_mrr10'].mean(),2), round(benchmark_results['single_vector_E_recall10'].mean(),2)
(0.5, 0.87)

Retrieval Method: ColBERTv2

benchmark_results = do_retrieval(
    method="colbertv2",
    chunking_strategy="E",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['colbertv2_E_mrr10'].mean(),2), round(benchmark_results['colbertv2_E_recall10'].mean(),2)
(0.49, 0.81)

Retrieval Method: answerai-colbert-small-v1

benchmark_results = do_retrieval(
    method="answerai_colbert",
    chunking_strategy="E",
    data=data,
    benchmark=benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['answerai_colbert_E_mrr10'].mean(),2), round(benchmark_results['answerai_colbert_E_recall10'].mean(),2)
(0.52, 0.84)

Chunking Strategy F: (w/headers, w/o HTML tags, w/o punctuation)

Finally, I’ll keep headers, remove HTML tags and remove all punctuation.

chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a *convolution*. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!\n\nA convolution applies a *kernel* across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <<basic_conv>>.\n\n'
def remove_punctuation(text):
  import string
  return ''.join(char if char.isalnum() or char == '#' else ' ' if char in string.punctuation else char for char in text)

remove_punctuation(chunks[3])
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision  and is surprisingly straightforward  To do it  we use something called a  convolution   A convolution requires nothing more than multiplication  and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book \n\nA convolution applies a  kernel  across an image  A kernel is a little matrix  such as the 3×3 matrix in the top right of   basic conv   \n\n'
for chapter, chunks in data.items():
  data[chapter] = [remove_punctuation(chunk) for chunk in chunks]

total_chunks = 0

for chapter, chunks in data.items():
  total_chunks += len(chunks)

assert total_chunks == 713
chunks[3]
'## The Magic of Convolutions\n\nIt turns out that finding the edges in an image is a very common task in computer vision  and is surprisingly straightforward  To do it  we use something called a  convolution   A convolution requires nothing more than multiplication  and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book \n\nA convolution applies a  kernel  across an image  A kernel is a little matrix  such as the 3×3 matrix in the top right of   basic conv   \n\n'

Since I’m removing punctuation from the contexts, I need to do the same for the benchmark dataset. I think a better solution would be to modify the scoring functions by removing the punctuation there, but I’m saving some time and space by just copying the benchmark dataset and removing punctuation from each context string in it:

def process_contexts(data):
    # Process questions
    for question in data['questions']:
        # Process only answer_context
        if 'answer_context' in question:
            for context_item in question['answer_context']:
                if 'context' in context_item:
                    if isinstance(context_item['context'], list):
                        # If context is a list, process each string in the list
                        context_item['context'] = [
                            remove_punctuation(text) if text else text
                            for text in context_item['context']
                        ]
                    elif isinstance(context_item['context'], str):
                        # If context is a single string, process it directly
                        context_item['context'] = remove_punctuation(context_item['context'])

    return data

modified_benchmark = process_contexts(benchmark)
modified_benchmark['questions'][4]['answer_context'][0]['context']
['An MIT professor named Marvin Minsky  who was a grade behind Rosenblatt at the same high school    along with Seymour Papert  wrote a book called  Perceptrons   MIT Press   about Rosenblatt s invention  They showed that a single layer of these devices was unable to learn some simple but critical mathematical functions  such as XOR   In the same book  they also showed that using multiple layers of the devices would allow these limitations to be addressed  Unfortunately  only the first of these insights was widely recognized  As a result  the global academic community nearly entirely gave up on neural networks for the next two decades ']

Retrieval Method: Full Text Search

benchmark_results = do_retrieval(
    method="bm25",
    chunking_strategy="F",
    data=data,
    benchmark=modified_benchmark,
    benchmark_results=benchmark_results,
    questions=questions)
Chapter 1: True
Chapter 2: True
Chapter 4: True
Chapter 8: True
Chapter 9: True
Chapter 10: True
Chapter 13: True
Retrieving passages...
Retrieval complete.
round(benchmark_results['bm25_F_mrr10'].mean(),2), round(benchmark_results['bm25_F_recall10'].mean(),2)
(0.46, 0.83)

Retrieval Method: Single-Vector Cosine Similarity

benchmark_results = do_retrieval(
    method="single_vector",
    chunking_strategy="F",
    data=data,
    benchmark=modified_benchmark,
    benchmark_results=benchmark_results)
Encoding Questions...
Encoding Data...
Retrieving passages...
Retrieval complete.
round(benchmark_results['single_vector_F_mrr10'].mean(),2), round(benchmark_results['single_vector_F_recall10'].mean(),2)
(0.49, 0.86)

Retrieval Method: ColBERTv2

benchmark_results = do_retrieval(
    method="colbertv2",
    chunking_strategy="F",
    data=data,
    benchmark=modified_benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['colbertv2_F_mrr10'].mean(),2), round(benchmark_results['colbertv2_F_recall10'].mean(),2)
(0.44, 0.71)

Retrieval Method: answerai-colbert-small-v1

benchmark_results = do_retrieval(
    method="answerai_colbert",
    chunking_strategy="F",
    data=data,
    benchmark=modified_benchmark,
    benchmark_results=benchmark_results)
round(benchmark_results['answerai_colbert_F_mrr10'].mean(),2), round(benchmark_results['answerai_colbert_F_recall10'].mean(),2)
(0.45, 0.73)

Final Thoughts

Here are the definitions of the metrics, retrieval methods and chunking strategies that I am using in this benchmark evaluation:

Metrics

  • Answer Component MRR@10: Returns the rank of the n-th passage needed to satisfy all answer_components for the question. So, if a question has 4 answer_components and their relevant contexts were contained across the first 5 retrieved passages, MRR would be 1/5 = 0.2.

  • Answer Component Recall@10: Measures the proportion of answer components for which at least one supporting context was retrieved. Using the same example, if the top-10 passages only contain contexts relevant to 2 answer_components, Recall would be 2/4 = 0.5

Retrieval Methods

  • Full text search (using sqlite and Claude-generated keywords)
  • Single-vector cosine similarity (using BAAI/bge-small-en-v1.5)
  • ColBERTv2
  • answerai-colbert-small-v1

Chunking Strategies

Chunking Strategy Name Description
A 1-paragraph (w/headers)
B 3-paragraph (w/headers)
C 1-paragraph (w/o headers)
D 3-paragraph (w/o headers)
E 3-paragraph (w/headers, w/o HTML tags)
F 3-paragraph (w/headers, w/o HTML tags, w/o punctuation)

Here are the results from this notebook:

Answer Component MRR@10

Retrieval Method A B C D E F
Full text search 0.30 0.46 0.29 0.44 0.46 0.46
Single-vector cosine similiarity 0.38 0.50 0.35 0.46 0.50 0.49
ColBERTv2 0.46 0.49 0.41 0.50 0.49 0.44
answerai-colbert-small-v1 0.48 0.52 0.45 0.52 0.52 0.45

Answer Component Recall@10

Retrieval Method A B C D E F
Full text search 65% 83% 65% 82% 83% 83%
Single-vector cosine similiarity 71% 85% 72% 82% 87% 86%
ColBERTv2 80% 80% 74% 80% 81% 71%
answerai-colbert-small-v1 82% 84% 77% 82% 84% 73%

The best-performing retrieval method and chunking strategies:

Metric Name Retrieval Method Chunking Strategies Metric Value
Answer Component MRR@10 answerai-colbert-small-v1 B, D, E 0.52
Answer Component Recall@10 Single-vector cosine similiarty E 87%

I was quite surprised that single-vector cosine similarity yielded the best Recall. I was less surprised that answerai-colbert-small-v1 had the best MRR@10 since it was better than the other retrieval methods for 5 out of 6 chunking strategies. Other noteworthy observations:

  • ColBERTv2 and answerai-colbert-small-v1 both experienced a considerable performance drop when punctuation was removed from the documents.
  • Full text search was very competitive after the chunk size was increased to 3-paragraphs (B, D, E, F). It yielded the second-highest MRR@10 for Chunking Strategy F (3-paragraph, w/headers, w/o HTMl tags, w/o punctuation).
  • Removing HTML tags (Chunking Strategy E) improved the performance of all four retrieval methods than when they were included (Chunking Strategy D). The biggest beneficiary of removing them was single-vector cosine similarity (82% –> 87%).

A couple of notes about my process:

  • Having a benchmark dataset saved me about 15-20 hours of manual evaluation.
  • Refactoring the code (into a do_retrieval function) made it easier for me to iterate quickly different chunking strategies.

Before I move on to experimenting with hybrid approaches (full text search + semantic search) I want to research and apply chunking strategies that are particularly suited to ColBERTv2 and answerai-colbert-small-v1 to see if I can improve on the overall-best Recall@10 of 87% and MRR@10 of 0.52.