Evaluating the DAPR ConditionalQA Dataset with RAGatouille

python

information retrieval

deep learning

RAGatouille

ColBERT

I calculate the Recall@10 metric for answerai-colbert-small-v1 retrieval (via RAGatouille) on the ConditionalQA dataset (via UKPLab/DAPR dataset) using the pytrec and ranx libraries.

Author

Vishal Bakshi

Published

February 8, 2025

Setup

!pip install datasets ragatouille pytrec_eval ranx

from datasets import load_dataset
from ragatouille import RAGPretrainedModel

import numpy as np
import pandas as pd

import pytrec_eval
from ranx import evaluate
from ranx import Qrels, Run

Background

I wanted to get familiar with classic information retrieval datasets, especially those with explicit documents. I searched with Perplexity and ChatGPT and came across DAPR: Document-Aware Passage Retrieval which sounded perfect for my use case.

In this blog post I’ll work through evaluating the test split of the ConditionalQA dataset in DAPR using RAGatouille and the answerai-colbert-small-v1 model for retrieval and the pytrec and ranx libraries for evaluation. I’ll use the simple Recall@10 metric as it’s the easiest to manually check.

Load and View Data

Here are the three datasets we are going to use for this evaluation:

ConditionalQA-corpus, our passages
ConditionalQA_queries, our queries
and ConditionalQA_qrels, the mapping between queries and passages.

passages = load_dataset("UKPLab/dapr", "ConditionalQA-corpus", split="test")
passages

Dataset({
    features: ['_id', 'text', 'title', 'doc_id', 'paragraph_no', 'total_paragraphs', 'is_candidate'],
    num_rows: 69199
})

passages[0]

{'_id': '0-0',
 'text': 'Overview',
 'title': 'Child Tax Credit',
 'doc_id': '0',
 'paragraph_no': 0,
 'total_paragraphs': 77,
 'is_candidate': True}

queries = load_dataset("UKPLab/dapr", "ConditionalQA-queries", split="test")
queries

Dataset({
    features: ['_id', 'text'],
    num_rows: 271
})

queries[0]

{'_id': 'dev-0',
 'text': 'My brother and his wife are in prison for carrying out a large fraud scheme. Their 7 and 8 year old children have been living with me for the last 4 years. I want to become their Special Guardian to look after them permanently How long will it be before I hear back from the court?'}

qrels_rows = load_dataset("UKPLab/dapr", "ConditionalQA-qrels", split="test")
qrels_rows

Dataset({
    features: ['query_id', 'corpus_id', 'score'],
    num_rows: 1165
})

qrels_rows[0]

{'query_id': 'dev-0', 'corpus_id': '86-41', 'score': 1}

Load answerai-colbert-small-v1:

RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")
RAG

<ragatouille.RAGPretrainedModel.RAGPretrainedModel at 0x7e5328fdced0>

Structure the passages for indexing:

passages[:5]

{'_id': ['0-0', '0-1', '0-2', '0-3', '0-4'],
 'text': ['Overview',
  'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
  'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
  'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
  'What you’ll get'],
 'title': ['Child Tax Credit',
  'Child Tax Credit',
  'Child Tax Credit',
  'Child Tax Credit',
  'Child Tax Credit'],
 'doc_id': ['0', '0', '0', '0', '0'],
 'paragraph_no': [0, 1, 2, 3, 4],
 'total_paragraphs': [77, 77, 77, 77, 77],
 'is_candidate': [True, True, True, True, True]}

passage_texts = [p['text'] for p in passages]
passage_texts[:5]

['Overview',
 'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
 'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
 'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
 'What you’ll get']

passage_ids = [p['_id'] for p in passages]
passage_ids[:5]

['0-0', '0-1', '0-2', '0-3', '0-4']

Build the index and run search

index_path = RAG.index(
    index_name="conditionalqa_index",
    collection=passage_texts,
    document_ids=passage_ids
)

Taking a look at the results for a single query. Each result has a content, score, rank, document_id, and passage_id. Note a bit of confusing terminology: document_id is actually the id of the item in the passages dataset and passage_id is an identifier created by RAGatouille, unrelated to the datasets.

results = RAG.search(queries[0]['text'], k=10)
results

[{'content': 'You must advertise your claim within 14 days from the day you get a date for the first court hearing. The advert must appear in a print or online newspaper that covers the missing person’s last known usual address.',
  'score': 70.0,
  'rank': 1,
  'document_id': '107-103',
  'passage_id': 10480},
 {'content': 'The guardianship order will make you a guardian for a maximum of 4 years.',
  'score': 70.0,
  'rank': 2,
  'document_id': '107-242',
  'passage_id': 10619},
 {'content': 'You can claim joint Housing Benefit for up to 13 weeks if one of you has gone to prison and is likely to return home in 13 weeks or less - including any time on remand.',
  'score': 69.9375,
  'rank': 3,
  'document_id': '8-67',
  'passage_id': 911},
 {'content': 'The date will be either 14 or 28 days after your court hearing. If you’re in an exceptionally difficult situation, you may be able to convince the judge to delay this for up to 6 weeks.',
  'score': 69.9375,
  'rank': 4,
  'document_id': '496-116',
  'passage_id': 47939},
 {'content': 'You can claim or continue to claim joint Council Tax Reduction if your partner’s expected to be in prison for 13 weeks or less – including any time on remand.',
  'score': 69.875,
  'rank': 5,
  'document_id': '8-80',
  'passage_id': 924},
 {'content': 'Sometimes you’ll be given a 2 to 4 week period that you’ll need to keep free - this is known as a ‘warned period’ or ‘floating trial’. If this happens, you’ll be given 1 working day’s notice before you are due to go to court.',
  'score': 69.875,
  'rank': 6,
  'document_id': '254-4',
  'passage_id': 23999},
 {'content': 'Your Child Benefit payments will stop after 8 weeks if your child goes to prison or is on remand. You’ll get arrears if they’re cleared of the offence.',
  'score': 69.8125,
  'rank': 7,
  'document_id': '8-116',
  'passage_id': 960},
 {'content': 'You may be able to make a claim if you’re the dependant of someone who suffered from a dust-related disease but who has died. A dependant claim must be made within 12 months of the death of the sufferer.',
  'score': 69.8125,
  'rank': 8,
  'document_id': '45-133',
  'passage_id': 4921},
 {'content': 'You’ll be responsible for looking after the child until they’re 18 (unless the court takes your responsibility away earlier).',
  'score': 69.8125,
  'rank': 9,
  'document_id': '86-2',
  'passage_id': 8150},
 {'content': 'If it’s less than 90 days since the person went missing, explain you need the guardianship order urgently, for example, because the person is going to lose their house.',
  'score': 69.8125,
  'rank': 10,
  'document_id': '107-43',
  'passage_id': 10420}]

Evaluation

I’ll prepare qrels for the pytrec evaluator as is done in the DAPR dataset card example on HF:

qrels = {}
for qrel_row in qrels_rows:
    qid = qrel_row["query_id"]
    pid = qrel_row["corpus_id"]
    rel = qrel_row["score"]
    qrels.setdefault(qid, {})
    qrels[qid][pid] = rel

dev-5 is a query ID with multiple passages so I’ve chosen it as the test example:

qid = 'dev-5'

qrels[qid]

{'61-1': 1, '61-4': 1, '61-5': 1, '61-17': 1, '61-37': 1, '61-39': 1}

pytrec_results = {}
pytrec_results

{}

Next we’ll run retrieval and structure results for the pytrec evaluator, again copying the DAPR example which structures the retrieval results as:

retrieval_scores[query_id][passage_id] = score

Note again that document_id means passage_id.

for q in queries:
    results = RAG.search(q['text'], k=10)
    pytrec_results[q['_id']] = {result['document_id']: float(result['score']) for result in results}

We can see the 10 passages and each one has a corresponding score.

pytrec_results[qid]

{'61-1': 71.125,
 '423-16': 70.5625,
 '61-27': 70.4375,
 '61-109': 70.375,
 '61-110': 70.25,
 '61-113': 70.25,
 '61-114': 70.25,
 '426-22': 70.1875,
 '420-42': 70.1875,
 '423-7': 70.125}

Calculate Recall for all queries and viewing a single query’s Recall:

evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'})

There are 271 queries and 271 metrics (one per query):

metrics = evaluator.evaluate(pytrec_results)
len(metrics)

For our dev-5 query the Recall@10 is 0.167 or 1/6.

metrics[qid]

{'recall_10': 0.16666666666666666}

Here are the 6 passages that we needed to retrieve to fully answer this question:

qrels[qid]

{'61-1': 1, '61-4': 1, '61-5': 1, '61-17': 1, '61-37': 1, '61-39': 1}

And here are the results again—only 1 relevant passage, 61-1, was retrieved.

pytrec_results[qid]

{'61-1': 71.125,
 '423-16': 70.5625,
 '61-27': 70.4375,
 '61-109': 70.375,
 '61-110': 70.25,
 '61-113': 70.25,
 '61-114': 70.25,
 '426-22': 70.1875,
 '420-42': 70.1875,
 '423-7': 70.125}

Calculating mean Recall across all queries to get our mean Recall@10 for the entire collection of queries:

mean_recall = sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall

0.28046940381859803

So, about 28% of all queries’ relevant passages were present in the top-10 passages retrieved.

I wanted to confirm my calculation so I’ll also calculate Recall@10 using the ranx library.

qrels_ranx = Qrels(qrels)
ranx_results = Run(pytrec_results)

evaluate(qrels_ranx, ranx_results, "recall@10")

0.2804694038185978

And we get the same results. Great!

Final Thoughts

In a future blog post I’ll calculate Recall@10 (and other metrics) on all of the datasets included in DAPR:

ConditionalQA
MS MARCO
Genomics
MIRACL
Natural Questions

Once that’s done, I’ll pick a few different retrieval models and compare their results across these datasets.

I think by the end of these experiments I’ll have a better grasp on how to work with classic IR datasets and metrics.