Comparing RAGatouille and ColBERT Indexes and Search Results

python
deep learning
information retrieval
RAGatouille
ColBERT
In this blog post I answer two questions: 1. For a given document collection and indexing configuration, do RAGatouille and ColBERT produce the same index? and 2. For a given index and search configuration, do RAGatouille and ColBERT retrieve the same passages/Recall@10?
Author

Vishal Bakshi

Published

May 10, 2025

Show setup
import faiss
hasattr(faiss, "StandardGpuResources")

from datasets import load_dataset
from ragatouille import RAGPretrainedModel
import time
import pytrec_eval
from ranx import evaluate
from ranx import Qrels, Run
import pickle
import json
import os
import torch
import srsly

from colbert import Indexer
from colbert.infra import RunConfig, ColBERTConfig
from colbert.infra.run import Run

from colbert.data import Queries
from colbert import Searcher

Background

In this notebook I am trying to answer two questions:

  1. For a given document collection and indexing configuration, do RAGatouille and ColBERT produce the same index?
  2. For a given index and search configuration, do RAGatouille and ColBERT retrieve the same passages/Recall@10?

For this exercise, I’m using the UKPLab/DAPR’s ConditionalQA document collection which is 69k rows. If this exercise is successful, I’ll scale to larger document collections.

Here’s my rough plan:

  1. Index the ConditionalQA document collection using RAGatouille and ColBERT. Be very thorough in ensuring the same configuration values are used.
  2. Compare artifacts of the index (json and pt files). Document differences.
  3. If successful, I would expect both indexes to largely be identical. If not, that’s a deeper dive.
  4. Assuming successful equality of indexes, I’ll then perform search on the index using each framework, and compare retrieved passages and Recall@10. Initially, I’ll use RAGatouille search on the RAGatouille index, and ColBERT search on the ColBERT index. If that goes well, I might use one framework to search on the other’s index. RAGatouille requires some additional files so I’ll likely have to create them manually from the ColBERT index artifacts.
  5. If I get similar Recall@10 and retrieved passages, great! If not, that’s a deeper dive.

Load the Data

dataset_name = "ConditionalQA"
dataset_name
'ConditionalQA'
passages = load_dataset("UKPLab/dapr", f"{dataset_name}-corpus", split="test")
queries = load_dataset("UKPLab/dapr", f"{dataset_name}-queries", split="test")
qrels_rows = load_dataset("UKPLab/dapr", f"{dataset_name}-qrels", split="test")

Create RAGatouille Index (1k subset)

RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")
n_items = 1000
n_items
1000

Notes about RAG.model.config before indexing:

  • The following values are None: ncells, centroid_score_threshold, ndocs
  • kmeans_niters=4
  • nbits=1
  • index_bsize=64
  • bsize=32
  • dim=96
  • doc_maxlen=300
  • rank=0
  • nranks=4
  • gpus=4
RAG.model.config
ColBERTConfig(query_token_id='[unused0]', doc_token_id='[unused1]', query_token='[Q]', doc_token='[D]', ncells=None, centroid_score_threshold=None, ndocs=None, load_index_with_mmap=False, index_path=None, index_bsize=64, nbits=1, kmeans_niters=4, resume=False, pool_factor=1, clustering_mode='hierarchical', protected_tokens=0, similarity='cosine', bsize=32, accumsteps=1, lr=1e-05, maxsteps=15626, save_every=None, warmup=781, warmup_bert=None, relu=False, nway=32, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, model_name='answerdotai/AnswerAI-ColBERTv2.5-small', query_maxlen=32, attend_to_mask_tokens=False, interaction='colbert', dim=96, doc_maxlen=300, mask_punctuation=True, checkpoint='/home/vishal/.cache/huggingface/hub/models--answerdotai--answerai-colbert-small-v1/snapshots/be1703c55532145a844da800eea4c9a692d7e267/', triples='/home/bclavie/colbertv2.5_en/data/msmarco/triplets.jsonl', collection='/home/bclavie/colbertv2.5_en/data/msmarco/collection.tsv', queries='/home/bclavie/colbertv2.5_en/data/msmarco/queries.tsv', index_name=None, overwrite=False, root='.ragatouille/', experiment='colbert', index_root=None, name='2024-08/07/08.16.20', rank=0, nranks=4, amp=True, gpus=4, avoid_fork_if_possible=False)
#!rm -rf .ragatouille/colbert/indexes/ConditionalQA_RAGatouille_index_1k
RAG_index_path = RAG.index(
    index_name=f"{dataset_name}_RAGatouille_index_1k",
    collection=passages[:n_items]["text"],
    document_ids=passages[:n_items]["_id"],
    use_faiss=True # to match ColBERT
)
!du -sh {RAG_index_path}
1.2M    .ragatouille/colbert/indexes/ConditionalQA_RAGatouille_index_1k
!ls {RAG_index_path}
0.codes.pt   avg_residual.pt  collection.json  metadata.json
0.metadata.json  buckets.pt   doclens.0.json   pid_docid_map.json
0.residuals.pt   centroids.pt     ivf.pid.pt       plan.json

Notes about metadata.json after indexing:

  • The following values are still None: ncells, centroid_score_threshold, ndocs
  • kmeans_niters=20 (up from 4)
  • nbits=4 (up from 1)
  • index_bsize=64
  • bsize=64 (up from 32)
  • dim=96
  • doc_maxlen=256 (down from 300)
  • rank=0
  • nranks=1 (down from 4)
  • gpus=1 (down from 4)
  • 'num_partitions'=1024 (not in original config)

Inspecting the RAGatouille metadata:

with open(f"{RAG_index_path}/metadata.json", 'r') as f:
    RAG_metadata = json.load(f)
RAG_metadata
{'config': {'query_token_id': '[unused0]',
  'doc_token_id': '[unused1]',
  'query_token': '[Q]',
  'doc_token': '[D]',
  'ncells': None,
  'centroid_score_threshold': None,
  'ndocs': None,
  'load_index_with_mmap': False,
  'index_path': None,
  'index_bsize': 32,
  'nbits': 4,
  'kmeans_niters': 20,
  'resume': False,
  'pool_factor': 1,
  'clustering_mode': 'hierarchical',
  'protected_tokens': 0,
  'similarity': 'cosine',
  'bsize': 64,
  'accumsteps': 1,
  'lr': 1e-05,
  'maxsteps': 15626,
  'save_every': None,
  'warmup': 781,
  'warmup_bert': None,
  'relu': False,
  'nway': 32,
  'use_ib_negatives': False,
  'reranker': False,
  'distillation_alpha': 1.0,
  'ignore_scores': False,
  'model_name': 'answerdotai/AnswerAI-ColBERTv2.5-small',
  'query_maxlen': 32,
  'attend_to_mask_tokens': False,
  'interaction': 'colbert',
  'dim': 96,
  'doc_maxlen': 256,
  'mask_punctuation': True,
  'checkpoint': 'answerdotai/answerai-colbert-small-v1',
  'triples': '/home/bclavie/colbertv2.5_en/data/msmarco/triplets.jsonl',
  'collection': ['list with 1000 elements starting with...',
   ['Overview',
    'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
    'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.']],
  'queries': '/home/bclavie/colbertv2.5_en/data/msmarco/queries.tsv',
  'index_name': 'ConditionalQA_RAGatouille_index_1k',
  'overwrite': False,
  'root': '.ragatouille/',
  'experiment': 'colbert',
  'index_root': None,
  'name': '2025-05/09/10.30.23',
  'rank': 0,
  'nranks': 1,
  'amp': True,
  'gpus': 1,
  'avoid_fork_if_possible': False},
 'num_chunks': 1,
 'num_partitions': 1024,
 'num_embeddings': 15198,
 'avg_doclen': 15.198,
 'RAGatouille': {'index_config': {'index_type': 'PLAID',
   'index_name': 'ConditionalQA_RAGatouille_index_1k'}}}
Parameter Before Indexing After Indexing Impact
kmeans_niters 4 20 5× more iterations for clustering
nbits 1 4 4× more bits for residual compression
bsize 32 64 Doubled batch size
doc_maxlen 300 256 Reduced document length limit
nranks 4 1 Changed to single-process execution
gpus 4 1 Changed to single-GPU execution
num_partitions (not set) 1024 New parameter added during indexing

The following are search parameters so they are not set/used for indexing: ncells, centroid_score_threshold, ndocs.

Create Vanilla ColBERT Index (1k subset)

Next, I’ll index the same document collection using vanilla ColBERT (which is installed with RAGatouille).

dataset_name
'ConditionalQA'
ColBERTConfig()
ColBERTConfig(query_token_id='[unused0]', doc_token_id='[unused1]', query_token='[Q]', doc_token='[D]', ncells=None, centroid_score_threshold=None, ndocs=None, load_index_with_mmap=False, index_path=None, index_bsize=64, nbits=1, kmeans_niters=4, resume=False, pool_factor=1, clustering_mode='hierarchical', protected_tokens=0, similarity='cosine', bsize=32, accumsteps=1, lr=3e-06, maxsteps=500000, save_every=None, warmup=None, warmup_bert=None, relu=False, nway=2, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, model_name=None, query_maxlen=32, attend_to_mask_tokens=False, interaction='colbert', dim=128, doc_maxlen=220, mask_punctuation=True, checkpoint=None, triples=None, collection=None, queries=None, index_name=None, overwrite=False, root='/mnt/my4tb/vishal_data/SuperPassage/experiments', experiment='default', index_root=None, name='2025-05/09/10.30.23', rank=0, nranks=1, amp=True, gpus=1, avoid_fork_if_possible=False)

Key differences from this initial config and RAGatouille’s post-indexing metadata:

Parameter RAGatouille value ColBERT value
kmeans_iter 20 4
nbits 4 1
dim 96 128
doc_maxlen 256 220
index_bsize 32 64

I will set these explicitly in the ColBERTConfig before indexing.

n_items
1000

The following environmental variable needs to be set otherwise the script won’t run.

os.environ["MKL_SERVICE_FORCE_INTEL"] = "1"
#!rm -rf /mnt/my4tb/vishal_data/SuperPassage/.ragatouille/colbert/indexes/ConditionalQA_ColBERT_index_1k
with Run().context(RunConfig(nranks=1)):
    config = ColBERTConfig(
        doc_maxlen=256,      
        nbits=4,             
        dim=96,             
        kmeans_niters=20,
        index_bsize=32,
        bsize=64,
        checkpoint="answerdotai/answerai-colbert-small-v1",
    )
    
    indexer = Indexer(checkpoint="answerdotai/answerai-colbert-small-v1", config=config)
    indexer.index(name=f"{dataset_name}_ColBERT_index_1k", collection=passages[:n_items]["text"])
ColBERT_index_path = ".ragatouille/colbert/indexes/ConditionalQA_ColBERT_index_1k"

The ColBERT index is a tiny bit smaller than RAGatouille, likely because it doesn’t store the collection as a JSON file and doesn’t store pid to docid map as a JSON file (which RAGatouille does—something we’ll encounter later on during search).

!du -sh {ColBERT_index_path}
1.1M    .ragatouille/colbert/indexes/ConditionalQA_ColBERT_index_1k
!ls {ColBERT_index_path}
0.codes.pt   0.residuals.pt   buckets.pt    doclens.0.json  metadata.json
0.metadata.json  avg_residual.pt  centroids.pt  ivf.pid.pt  plan.json
with open(f"{ColBERT_index_path}/metadata.json", 'r') as f:
    ColBERT_metadata = json.load(f)
ColBERT_metadata
{'config': {'query_token_id': '[unused0]',
  'doc_token_id': '[unused1]',
  'query_token': '[Q]',
  'doc_token': '[D]',
  'ncells': None,
  'centroid_score_threshold': None,
  'ndocs': None,
  'load_index_with_mmap': False,
  'index_path': None,
  'index_bsize': 32,
  'nbits': 4,
  'kmeans_niters': 20,
  'resume': False,
  'pool_factor': 1,
  'clustering_mode': 'hierarchical',
  'protected_tokens': 0,
  'similarity': 'cosine',
  'bsize': 64,
  'accumsteps': 1,
  'lr': 1e-05,
  'maxsteps': 15626,
  'save_every': None,
  'warmup': 781,
  'warmup_bert': None,
  'relu': False,
  'nway': 32,
  'use_ib_negatives': False,
  'reranker': False,
  'distillation_alpha': 1.0,
  'ignore_scores': False,
  'model_name': 'answerdotai/AnswerAI-ColBERTv2.5-small',
  'query_maxlen': 32,
  'attend_to_mask_tokens': False,
  'interaction': 'colbert',
  'dim': 96,
  'doc_maxlen': 256,
  'mask_punctuation': True,
  'checkpoint': 'answerdotai/answerai-colbert-small-v1',
  'triples': '/home/bclavie/colbertv2.5_en/data/msmarco/triplets.jsonl',
  'collection': ['list with 1000 elements starting with...',
   ['Overview',
    'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
    'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.']],
  'queries': '/home/bclavie/colbertv2.5_en/data/msmarco/queries.tsv',
  'index_name': 'ConditionalQA_ColBERT_index_1k',
  'overwrite': False,
  'root': '.ragatouille/',
  'experiment': 'colbert',
  'index_root': None,
  'name': '2025-05/09/10.30.23',
  'rank': 0,
  'nranks': 1,
  'amp': True,
  'gpus': 1,
  'avoid_fork_if_possible': False},
 'num_chunks': 1,
 'num_partitions': 1024,
 'num_embeddings': 15198,
 'avg_doclen': 15.198}

Comparing 1k subset Index Artifacts

RAGatouille files:

  • 0.codes.pt
  • 0.residuals.pt
  • buckets.pt
  • doclens.0.json
  • metadata.json
  • 0.metadata.json
  • avg_residual.pt
  • centroids.pt
  • ivf.pid.pt
  • plan.json
  • collection.json (unique to RAGatouille)
  • pid_docid_map.json (unique to RAGatouille)

ColBERT files:

  • 0.codes.pt
  • 0.residuals.pt
  • buckets.pt
  • doclens.0.json
  • metadata.json
  • 0.metadata.json
  • avg_residual.pt
  • centroids.pt
  • ivf.pid.pt
  • plan.json

All parameters relevant to indexing are matching in the corresponding metadata.json files:

Parameter Value Status
index_bsize 32 ✅ Matches
nbits 4 ✅ Matches
kmeans_niters 20 ✅ Matches
dim 96 ✅ Matches
doc_maxlen 256 ✅ Matches
num_partitions 1024 ✅ Matches
num_embeddings 15198 ✅ Matches
avg_doclen 15.198 ✅ Matches
checkpoint ‘answerdotai/answerai-colbert-small-v1’ ✅ Matches

Walking through each file and comparing contents:

def _compare_pt(r_path, c_path):
    r = torch.load(r_path)
    c = torch.load(c_path)
    if isinstance(r,tuple):
        print("0 shape:", r[0].shape, c[0].shape)
        print(r[0])
        print(c[0])
        print('\n')
        print("0 match: ", (r[0] == c[0]).float().mean())
        print('\n')
        print('#'*30)
        print('\n')
        print("1 shape:", r[1].shape, c[1].shape)
        print(r[1])
        print(c[1])
        print('\n')
        print("1 match: ", (r[1] == c[1]).float().mean())
    else:
        print(r)
        print('\n')
        print(c)
        print('\n')
        print(r.shape, c.shape)
        print('\n')
        print("match: ",(r == c).float().mean())

0.codes.pt

_compare_pt(r_path=f"{RAG_index_path}/0.codes.pt", c_path=f"{ColBERT_index_path}/0.codes.pt")
tensor([345, 288, 647,  ..., 232, 767,  29], dtype=torch.int32)


tensor([345, 288, 647,  ..., 232, 767,  29], dtype=torch.int32)


torch.Size([15198]) torch.Size([15198])


match:  tensor(1.)

0.residuals.pt

_compare_pt(r_path=f"{RAG_index_path}/0.residuals.pt", c_path=f"{ColBERT_index_path}/0.residuals.pt")
tensor([[ 30, 225, 225,  ..., 238, 238,  30],
        [238, 238, 238,  ..., 238, 238, 238],
        [240, 254, 253,  ..., 175, 240, 128],
        ...,
        [ 99, 105, 231,  ...,  40,  95,  48],
        [ 85, 241,  87,  ..., 128,   8, 179],
        [ 89, 106, 150,  ..., 162, 238,  22]], dtype=torch.uint8)


tensor([[ 30, 225, 225,  ..., 238, 238,  30],
        [238, 238, 238,  ..., 238, 238, 238],
        [240, 254, 253,  ..., 175, 240, 128],
        ...,
        [ 99, 105, 231,  ...,  40,  95,  48],
        [ 85, 241,  87,  ..., 128,   8, 179],
        [ 89, 106, 150,  ..., 162, 238,  22]], dtype=torch.uint8)


torch.Size([15198, 48]) torch.Size([15198, 48])


match:  tensor(1.)

centroids.pt

_compare_pt(r_path=f"{RAG_index_path}/centroids.pt", c_path=f"{ColBERT_index_path}/centroids.pt")
tensor([[-0.0701,  0.0035, -0.0785,  ...,  0.1628,  0.0201, -0.0419],
        [-0.0350, -0.0082, -0.0715,  ...,  0.1119, -0.0159, -0.1164],
        [-0.0753,  0.0172, -0.0513,  ...,  0.1070,  0.1476, -0.0699],
        ...,
        [-0.1425,  0.1393, -0.2316,  ...,  0.0169,  0.0897, -0.0431],
        [-0.0690,  0.0513, -0.0935,  ...,  0.1311,  0.0324, -0.0705],
        [-0.0812,  0.0511, -0.0482,  ...,  0.1010,  0.0365, -0.0582]],
       device='cuda:0', dtype=torch.float16)


tensor([[-0.0701,  0.0035, -0.0785,  ...,  0.1628,  0.0201, -0.0419],
        [-0.0350, -0.0082, -0.0715,  ...,  0.1119, -0.0159, -0.1164],
        [-0.0753,  0.0172, -0.0513,  ...,  0.1070,  0.1476, -0.0699],
        ...,
        [-0.1425,  0.1393, -0.2316,  ...,  0.0169,  0.0897, -0.0431],
        [-0.0690,  0.0513, -0.0935,  ...,  0.1311,  0.0324, -0.0705],
        [-0.0812,  0.0511, -0.0482,  ...,  0.1010,  0.0365, -0.0582]],
       device='cuda:0', dtype=torch.float16)


torch.Size([1024, 96]) torch.Size([1024, 96])


match:  tensor(1., device='cuda:0')

ivf.pid.pt

_compare_pt(r_path=f"{RAG_index_path}/ivf.pid.pt", c_path=f"{ColBERT_index_path}/ivf.pid.pt")
0 shape: torch.Size([11696]) torch.Size([11696])
tensor([889, 894, 916,  ...,   0,   0,   0], dtype=torch.int32)
tensor([889, 894, 916,  ...,   0,   0,   0], dtype=torch.int32)


0 match:  tensor(1.)


##############################


1 shape: torch.Size([1024]) torch.Size([1024])
tensor([ 5, 46, 16,  ...,  7, 11,  3])
tensor([ 5, 46, 16,  ...,  7, 11,  3])


1 match:  tensor(1.)

buckets.pt

_compare_pt(r_path=f"{RAG_index_path}/buckets.pt", c_path=f"{ColBERT_index_path}/buckets.pt")
0 shape: torch.Size([15]) torch.Size([15])
tensor([-0.0310, -0.0208, -0.0148, -0.0101, -0.0065, -0.0037, -0.0015,  0.0000,
         0.0016,  0.0037,  0.0067,  0.0103,  0.0150,  0.0210,  0.0312],
       device='cuda:0')
tensor([-0.0310, -0.0208, -0.0148, -0.0101, -0.0065, -0.0037, -0.0015,  0.0000,
         0.0016,  0.0037,  0.0067,  0.0103,  0.0150,  0.0210,  0.0312],
       device='cuda:0')


0 match:  tensor(1., device='cuda:0')


##############################


1 shape: torch.Size([16]) torch.Size([16])
tensor([-0.0417, -0.0248, -0.0175, -0.0123, -0.0082, -0.0050, -0.0025, -0.0006,
         0.0007,  0.0026,  0.0051,  0.0084,  0.0125,  0.0178,  0.0251,  0.0417],
       device='cuda:0', dtype=torch.float16)
tensor([-0.0417, -0.0248, -0.0175, -0.0123, -0.0082, -0.0050, -0.0025, -0.0006,
         0.0007,  0.0026,  0.0051,  0.0084,  0.0125,  0.0178,  0.0251,  0.0417],
       device='cuda:0', dtype=torch.float16)


1 match:  tensor(1., device='cuda:0')

avg_residual.pt

_compare_pt(r_path=f"{RAG_index_path}/avg_residual.pt", c_path=f"{ColBERT_index_path}/avg_residual.pt")
tensor(0.0150, device='cuda:0', dtype=torch.float16)


tensor(0.0150, device='cuda:0', dtype=torch.float16)


torch.Size([]) torch.Size([])


match:  tensor(1., device='cuda:0')

doclens.0.json

with open(f"{RAG_index_path}/doclens.0.json", 'r') as f:
    RAGatouille_doclens = json.load(f)
RAGatouille_doclens[:5]
[4, 20, 18, 23, 8]
with open(f"{ColBERT_index_path}/doclens.0.json", 'r') as f:
    ColBERT_doclens = json.load(f)
ColBERT_doclens[:5]
[4, 20, 18, 23, 8]
RAGatouille_doclens == ColBERT_doclens
True

Based on these comparisons, I can conclude that ColBERT and RAGatouille do indeed produce identical index artifacts given the same configuration and document collection!

Indexing Full ConditionalQA + Comparing Artifacts

With a 1k subset confirmed, I’ll now index the full ConditionalQA document collection, which contains ~70k rows.

dataset_name
'ConditionalQA'
len(passages)
69199
RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")
RAG_index_path = RAG.index(
    index_name=f"{dataset_name}_RAGatouille_index_full",
    collection=passages["text"],
    document_ids=passages["_id"],
    use_faiss=True # to match ColBERT
)
!du -sh {RAG_index_path}
45M .ragatouille/colbert/indexes/ConditionalQA_RAGatouille_index_full
!ls {RAG_index_path}
0.codes.pt   1.residuals.pt   buckets.pt       doclens.2.json
0.metadata.json  2.codes.pt   centroids.pt     ivf.pid.pt
0.residuals.pt   2.metadata.json  collection.json  metadata.json
1.codes.pt   2.residuals.pt   doclens.0.json   pid_docid_map.json
1.metadata.json  avg_residual.pt  doclens.1.json   plan.json
#!rm -rf .ragatouille/colbert/indexes/ConditionalQA_ColBERT_index_full
os.environ["MKL_SERVICE_FORCE_INTEL"] = "1"
with Run().context(RunConfig(nranks=1)):
    config = ColBERTConfig(
        doc_maxlen=256,      
        nbits=2,  # to match RAGatouille           
        dim=96,             
        kmeans_niters=10, # to match RAGatouille
        index_bsize=32,
        bsize=64,
        checkpoint="answerdotai/answerai-colbert-small-v1",
    )
    
    indexer = Indexer(checkpoint="answerdotai/answerai-colbert-small-v1", config=config)
    indexer.index(name=f"{dataset_name}_ColBERT_index_full", collection=passages["text"])
ColBERT_index_path = ".ragatouille/colbert/indexes/ConditionalQA_ColBERT_index_full"
!du -sh {ColBERT_index_path}
38M .ragatouille/colbert/indexes/ConditionalQA_ColBERT_index_full
!ls {ColBERT_index_path}
0.codes.pt   1.residuals.pt   buckets.pt      ivf.pid.pt
0.metadata.json  2.codes.pt   centroids.pt    metadata.json
0.residuals.pt   2.metadata.json  doclens.0.json  plan.json
1.codes.pt   2.residuals.pt   doclens.1.json
1.metadata.json  avg_residual.pt  doclens.2.json

Comparing Metadata

All key metadata parameters are equivalent between the RAGatouille and ColBERT indexes.

params = ["index_bsize", "nbits", "kmeans_niters", "bsize", "dim", "rank", "gpus", "nranks", "num_chunks", "num_partitions", "num_embeddings", "avg_doclen"]
with open(f"{RAG_index_path}/metadata.json", 'r') as f:
    RAG_metadata = json.load(f)
with open(f"{ColBERT_index_path}/metadata.json", 'r') as f:
    ColBERT_metadata = json.load(f)
for p in params: 
    if p not in ["num_chunks", "num_partitions", "num_embeddings", "avg_doclen"]: assert RAG_metadata['config'][p] == ColBERT_metadata['config'][p], p
    elif p == "avg_doclen": assert (RAG_metadata[p] - ColBERT_metadata[p]) < 1e-7
    else: assert RAG_metadata[p] == ColBERT_metadata[p], p

Comparing Index Artifacts

def _compare_pt(r_path, c_path):
    r = torch.load(r_path)
    c = torch.load(c_path)
    if isinstance(r,tuple):
        assert r[0].shape == c[0].shape
        assert (r[0] == c[0]).float().mean() == 1
        assert (r[1] == c[1]).float().mean() == 1
    else:
        assert r.shape == c.shape
        assert (r == c).float().mean() == 1
files = [
    "0.codes.pt",
    "0.residuals.pt",
    "centroids.pt",
    "ivf.pid.pt",
    "buckets.pt",
    "avg_residual.pt",
    "doclens.0.json"
]
for f in files:
    if f == "doclens.0.json": 
        with open(f"{RAG_index_path}/{f}", 'r') as _f: RAG_doclens = json.load(_f)
        with open(f"{ColBERT_index_path}/{f}", 'r') as _f: ColBERT_doclens = json.load(_f)
        assert RAG_doclens == ColBERT_doclens
    else: _compare_pt(f"{RAG_index_path}/{f}", f"{ColBERT_index_path}/{f}")

All index artifacts are equivalent! This further confirms the equivalency of the indexes created by RAGatouille and ColBERT.

Comparing Search Results

To reset, I had started this exploration with two questions:

  1. For a given document collection and indexing configuration, do RAGatouille and ColBERT produce the same index?
  2. For a given index and search configuration, do RAGatouille and ColBERT retrieve the same passages/Recall@10?

The answer to the first question is YES. Let’s move on to answering the second question, starting by searching the RAGatouille index with RAGatouille.

Searching RAGatouille Index with RAGatouille

I will explicitly set search parameters for RAGatouille, even though they get set based on document collection size in PLAIDModelIndex._load_searcher:

if not force_fast:
    self.searcher.configure(ndocs=1024)
    self.searcher.configure(ncells=16)
    if len(self.searcher.collection) < 10000:
        self.searcher.configure(ncells=8)
        self.searcher.configure(centroid_score_threshold=0.4)
    elif len(self.searcher.collection) < 100000:
        self.searcher.configure(ncells=4)
        self.searcher.configure(centroid_score_threshold=0.45)
    # Otherwise, use defaults for k
else:
    # Use fast settingss
    self.searcher.configure(ncells=1)
    self.searcher.configure(centroid_score_threshold=0.5)
    self.searcher.configure(ndocs=256)
RAG.model.config.ncells = 4
RAG.model.config.centroid_score_threshold = 0.45
RAG.model.config.ndocs = 1024
RAG.model.config
ColBERTConfig(query_token_id='[unused0]', doc_token_id='[unused1]', query_token='[Q]', doc_token='[D]', ncells=4, centroid_score_threshold=0.45, ndocs=1024, load_index_with_mmap=False, index_path=None, index_bsize=32, nbits=2, kmeans_niters=10, resume=False, pool_factor=1, clustering_mode='hierarchical', protected_tokens=0, similarity='cosine', bsize=32, accumsteps=1, lr=1e-05, maxsteps=15626, save_every=None, warmup=781, warmup_bert=None, relu=False, nway=32, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, model_name='answerdotai/AnswerAI-ColBERTv2.5-small', query_maxlen=32, attend_to_mask_tokens=False, interaction='colbert', dim=96, doc_maxlen=256, mask_punctuation=True, checkpoint='/home/vishal/.cache/huggingface/hub/models--answerdotai--answerai-colbert-small-v1/snapshots/be1703c55532145a844da800eea4c9a692d7e267/', triples='/home/bclavie/colbertv2.5_en/data/msmarco/triplets.jsonl', collection='/home/bclavie/colbertv2.5_en/data/msmarco/collection.tsv', queries='/home/bclavie/colbertv2.5_en/data/msmarco/queries.tsv', index_name=None, overwrite=False, root='.ragatouille/colbert/indexes', experiment='colbert', index_root=None, name='2024-08/07/08.16.20', rank=0, nranks=4, amp=True, gpus=4, avoid_fork_if_possible=True)
ragatouille_results = {}
for q in queries:
    results = RAG.search(q['text'], k=10)
    ragatouille_results[q['_id']] = {result['document_id']: float(result['score']) for result in results}
ragatouille_results['dev-0']
{'107-242': 69.9375,
 '496-116': 69.9375,
 '86-28': 69.875,
 '254-4': 69.875,
 '107-103': 69.875,
 '8-67': 69.8125,
 '98-46': 69.8125,
 '8-80': 69.8125,
 '8-116': 69.8125,
 '107-43': 69.8125}

The mean Recall@10 for all 271 queries is 0.29.

qrels = {}
for qrel_row in qrels_rows:
    qid = qrel_row["query_id"]
    pid = qrel_row["corpus_id"]
    rel = qrel_row["score"]
    qrels.setdefault(qid, {})
    qrels[qid][pid] = rel
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'})
metrics = evaluator.evaluate(ragatouille_results)
assert len(metrics) == len(set(qrels_rows["query_id"]))

mean_recall = sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall
0.2855810510889169

Searching the ColBERT Index with ColBERT

Next, I’ll search the ColBERT index with ColBERT, setting the same configuration values as RAGatouille.

RAG.model.config.ncells, \
RAG.model.config.centroid_score_threshold, \
RAG.model.config.ndocs
(4, 0.45, 1024)

ColBERT expects the queries to be structured as a dictionary, so I’ll prepare that accordingly:

queries_dict = {}
for item in queries:
    queries_dict[item['_id']] = item['text']

len(queries_dict)
271

I was posting on Twitter about how I wasn’t getting the same search results when using RAGatouille and vanilla ColBERT given the same index. Benjamin Clavie, the author of RAGatouille, kindly took some time to explain a core difference in how RAGatouille and ColBERT process queries:

As was shared in his tweet, RAGatouille uses a larger maximum query length than ColBERT. ColBERT uses a default of 32. So to replicate the same scores (and therefore the same top-k retrieved passages) I needed to mimic RAGatouille’s query length maximum.

Note that ColBERT doesn’t store original passage _ids like RAGatouille does, so I have to extract it from the original passages with passages[idx]['_id'].

current_dir = os.path.abspath(".")
index_root = os.path.join(current_dir, ".ragatouille", "colbert", "indexes")
colbert_results = {}

for q in queries:
    query_length = int(len(q['text'].split(" ")) * 1.35) # this lines comes from RAGatouille
    with Run().context(RunConfig(nranks=1)):
        searcher = Searcher(
            index="ConditionalQA_ColBERT_index_full",
            index_root=index_root,  
            config=ColBERTConfig(
                ncells=4,
                centroid_score_threshold=0.45,
                ndocs=1024,
                query_maxlen=query_length
            )
        )
    
        ranking = searcher.search(q['text'], k=10)
        colbert_results[q['_id']] = {passages[idx]['_id']: score for idx, score in list(zip(ranking[0], ranking[2]))}
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'})
metrics = evaluator.evaluate(colbert_results)
assert len(metrics) == len(set(qrels_rows["query_id"]))

mean_recall = sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall
0.2855810510889169

With the maximum query length adjusted, ColBERT yields the same Recall@10 as RAGatouille! This makes sense because as Benjamin said in another tweet reply:

While the exact same recall is a good check, I’ll double check that for each query, the retrieved passage IDs and scores are identical between RAGatouille and ColBERT.

for i in colbert_results.keys():
    for pid, score in colbert_results[i].items():
        assert ragatouille_results[i][pid] == score

Searching RAGatouille Index with ColBERT (and vice versa)

As a final check of consistency, I’ll search the RAGatouille index with ColBERT and search the ColBERT index with RAGatouille and confirm that they yield the same retrieved passages and Recall@10.

current_dir = os.path.abspath(".")
index_root = os.path.join(current_dir, ".ragatouille", "colbert", "indexes")
colbert_results2 = {}

for q in queries:
    query_length = int(len(q['text'].split(" ")) * 1.35)
    with Run().context(RunConfig(nranks=1)):
        searcher = Searcher(
            index="ConditionalQA_RAGatouille_index_full",
            index_root=index_root,  
            config=ColBERTConfig(
                ncells=4,
                centroid_score_threshold=0.45,
                ndocs=1024,
                query_maxlen=query_length
            )
        )
    
        ranking = searcher.search(q['text'], k=10)
        colbert_results2[q['_id']] = {passages[idx]['_id']: score for idx, score in list(zip(ranking[0], ranking[2]))}

We get the same results as searching the ColBERT index with ColBERT! This again further proves that these two frameworks produce the same indexes (which is to be expected).

for i in colbert_results.keys():
    for pid, score in colbert_results[i].items():
        assert colbert_results2[i][pid] == score
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'})
metrics = evaluator.evaluate(colbert_results2)
assert len(metrics) == len(set(qrels_rows["query_id"]))

mean_recall = sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall
0.2855810510889169

Finally, as the last piece of this exercise, I’ll search the ColBERT index with RAGatouille and see if I get the same result. I certainly expect to!

Searching ColBERT Index with RAGatouille

RAGatouille creates two files (collection.json and pid_docid_map.json) which ColBERT does not, so we have to create them manually for RAGatouille to search the ColBERT Index.

collection.json is just a list of the document collection text.

with open(f"{RAG_index_path}/collection.json", 'r') as f:
    RAG_collection = json.load(f)
len(RAG_collection)
69199
RAG_collection[:5]
['Overview',
 'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
 'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
 'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
 'What you’ll get']

pid_docid_map.json is a dictionary where the keys are the index in the collection and the values are the dataset’s defined _id string.

with open(f"{RAG_index_path}/pid_docid_map.json", 'r') as f:
    RAG_pid_docid_map = json.load(f)
list(RAG_pid_docid_map.items())[0], list(RAG_pid_docid_map.items())[-1]
(('0', '0-0'), ('69198', '651-91'))
passages[-1]
{'_id': '651-91',
 'text': 'Trade union reps can be on picket lines at different workplaces if they’re responsible for organising workers in those workplaces.',
 'title': 'Taking part in industrial action and strikes',
 'doc_id': '651',
 'paragraph_no': 91,
 'total_paragraphs': 92,
 'is_candidate': True}

Saving the collection as a JSON is simple enough, I just dump passages['text'] into a JSON file.

srsly.write_json(f"{ColBERT_index_path}/collection.json", passages['text'])

Creating pid_docid_map.json is also quite straightforward. I map from the index of the passage item to its _id.

pid_docid_map = {str(i): p['_id'] for i,p in enumerate(passages)}
list(pid_docid_map.items())[0], list(pid_docid_map.items())[-1]
(('0', '0-0'), ('69198', '651-91'))
srsly.write_json(f"{ColBERT_index_path}/pid_docid_map.json", pid_docid_map)

Let’s make sure these match the RAGatouille-built artifacts:

for i, _id in RAG_pid_docid_map.items(): assert _id == pid_docid_map[i]
for i, _id in pid_docid_map.items(): assert _id == RAG_pid_docid_map[i]

With those two files created, I can now create a RAGPretrainedModel object from_index using the ColBERT index.

RAG2 = RAGPretrainedModel.from_index(ColBERT_index_path)
Constructing default index configuration for index `None` as it does not contain RAGatouille specific metadata.
RAG2.model.config.ncells = 4
RAG2.model.config.centroid_score_threshold = 0.45
RAG2.model.config.ndocs = 1024

ragatouille_results2 = {}


for q in queries:
    results = RAG2.search(q['text'], k=10)
    ragatouille_results2[q['_id']] = {result['document_id']: float(result['score']) for result in results}

We get the same results as searching the RAGatouille index with RAGatouille, searching the ColBERT index with ColBERT and searching the RAGatouille index with ColBERT!

for i in colbert_results.keys():
    for pid, score in colbert_results[i].items():
        assert ragatouille_results2[i][pid] == score
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'})
metrics = evaluator.evaluate(ragatouille_results2)
assert len(metrics) == len(set(qrels_rows["query_id"]))

mean_recall = sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall
0.2855810510889169

Closing Thoughts

Every interaction I’ve had with RAGatouille and ColBERT has been an awesome learning experience. I feel like inspecting their behavior and artifacts as left me with a better understanding of information retrieval in general. One small learning that I left out: ColBERT uses FAISS for k-means clustering while for small document collections (such as my initial 1k subset) RAGatouille uses a PyTorch implementation. This difference, even though all relevant configuration parameters were equal, resulted in different index artifacts. There was only about a 15% overlap between the centroids.pt tensors of the resulting RAGatouille and ColBERT indexes.

Another piece of motivation for me is that I needed to use both RAGatouille and ColBERT for indexing the full set of UKPLab/DAPR document collections, as ColBERT was able to index the larger collections (6M+) without crashing the kernel, while RAGatouille was not. In some initial experiments I was getting different mean Recall@10 values when using RAGatouille versus when using ColBERT (because I hadn’t incorporated the max query length code and probably had different configs). So I felt like this was a good opportunity, once and for all, to answer the two questions I listed at the start of this notebook:

  1. For a given document collection and indexing configuration, do RAGatouille and ColBERT produce the same index?
  2. For a given index and search configuration, do RAGatouille and ColBERT retrieve the same passages/Recall@10?

The answer for both, is a resounding yes! With this knowledge in my belt, I can now move forward with indexing and searching with either library as I please.