Evaluating the DAPR ConditionalQA Dataset with RAGatouille

python
information retrieval
deep learning
I calculate the Recall@10 metric for answerai-colbert-small-v1 retrieval (via RAGatouille) on the ConditionalQA dataset (via UKPLab/DAPR dataset) using the pytrec and ranx libraries.
Author

Vishal Bakshi

Published

February 8, 2025

Setup

!pip install datasets ragatouille pytrec_eval ranx
from datasets import load_dataset
from ragatouille import RAGPretrainedModel

import numpy as np
import pandas as pd

import pytrec_eval
from ranx import evaluate
from ranx import Qrels, Run

Background

I wanted to get familiar with classic information retrieval datasets, especially those with explicit documents. I searched with Perplexity and ChatGPT and came across DAPR: Document-Aware Passage Retrieval which sounded perfect for my use case.

In this blog post I’ll work through evaluating the test split of the ConditionalQA dataset in DAPR using RAGatouille and the answerai-colbert-small-v1 model for retrieval and the pytrec and ranx libraries for evaluation. I’ll use the simple Recall@10 metric as it’s the easiest to manually check.

Load and View Data

Here are the three datasets we are going to use for this evaluation:

  • ConditionalQA-corpus, our passages
  • ConditionalQA_queries, our queries
  • and ConditionalQA_qrels, the mapping between queries and passages.
passages = load_dataset("UKPLab/dapr", "ConditionalQA-corpus", split="test")
passages
Dataset({
    features: ['_id', 'text', 'title', 'doc_id', 'paragraph_no', 'total_paragraphs', 'is_candidate'],
    num_rows: 69199
})
passages[0]
{'_id': '0-0',
 'text': 'Overview',
 'title': 'Child Tax Credit',
 'doc_id': '0',
 'paragraph_no': 0,
 'total_paragraphs': 77,
 'is_candidate': True}
queries = load_dataset("UKPLab/dapr", "ConditionalQA-queries", split="test")
queries
Dataset({
    features: ['_id', 'text'],
    num_rows: 271
})
queries[0]
{'_id': 'dev-0',
 'text': 'My brother and his wife are in prison for carrying out a large fraud scheme. Their 7 and 8 year old children have been living with me for the last 4 years. I want to become their Special Guardian to look after them permanently How long will it be before I hear back from the court?'}
qrels_rows = load_dataset("UKPLab/dapr", "ConditionalQA-qrels", split="test")
qrels_rows
Dataset({
    features: ['query_id', 'corpus_id', 'score'],
    num_rows: 1165
})
qrels_rows[0]
{'query_id': 'dev-0', 'corpus_id': '86-41', 'score': 1}

Load answerai-colbert-small-v1:

RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")
RAG
<ragatouille.RAGPretrainedModel.RAGPretrainedModel at 0x7e5328fdced0>

Structure the passages for indexing:

passages[:5]
{'_id': ['0-0', '0-1', '0-2', '0-3', '0-4'],
 'text': ['Overview',
  'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
  'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
  'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
  'What you’ll get'],
 'title': ['Child Tax Credit',
  'Child Tax Credit',
  'Child Tax Credit',
  'Child Tax Credit',
  'Child Tax Credit'],
 'doc_id': ['0', '0', '0', '0', '0'],
 'paragraph_no': [0, 1, 2, 3, 4],
 'total_paragraphs': [77, 77, 77, 77, 77],
 'is_candidate': [True, True, True, True, True]}
passage_texts = [p['text'] for p in passages]
passage_texts[:5]
['Overview',
 'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
 'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
 'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
 'What you’ll get']
passage_ids = [p['_id'] for p in passages]
passage_ids[:5]
['0-0', '0-1', '0-2', '0-3', '0-4']

Evaluation

I’ll prepare qrels for the pytrec evaluator as is done in the DAPR dataset card example on HF:

qrels = {}
for qrel_row in qrels_rows:
    qid = qrel_row["query_id"]
    pid = qrel_row["corpus_id"]
    rel = qrel_row["score"]
    qrels.setdefault(qid, {})
    qrels[qid][pid] = rel

dev-5 is a query ID with multiple passages so I’ve chosen it as the test example:

qid = 'dev-5'
qrels[qid]
{'61-1': 1, '61-4': 1, '61-5': 1, '61-17': 1, '61-37': 1, '61-39': 1}
pytrec_results = {}
pytrec_results
{}

Next we’ll run retrieval and structure results for the pytrec evaluator, again copying the DAPR example which structures the retrieval results as:

retrieval_scores[query_id][passage_id] = score

Note again that document_id means passage_id.

for q in queries:
    results = RAG.search(q['text'], k=10)
    pytrec_results[q['_id']] = {result['document_id']: float(result['score']) for result in results}

We can see the 10 passages and each one has a corresponding score.

pytrec_results[qid]
{'61-1': 71.125,
 '423-16': 70.5625,
 '61-27': 70.4375,
 '61-109': 70.375,
 '61-110': 70.25,
 '61-113': 70.25,
 '61-114': 70.25,
 '426-22': 70.1875,
 '420-42': 70.1875,
 '423-7': 70.125}

Calculate Recall for all queries and viewing a single query’s Recall:

evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'})

There are 271 queries and 271 metrics (one per query):

metrics = evaluator.evaluate(pytrec_results)
len(metrics)
271

For our dev-5 query the Recall@10 is 0.167 or 1/6.

metrics[qid]
{'recall_10': 0.16666666666666666}

Here are the 6 passages that we needed to retrieve to fully answer this question:

qrels[qid]
{'61-1': 1, '61-4': 1, '61-5': 1, '61-17': 1, '61-37': 1, '61-39': 1}

And here are the results again—only 1 relevant passage, 61-1, was retrieved.

pytrec_results[qid]
{'61-1': 71.125,
 '423-16': 70.5625,
 '61-27': 70.4375,
 '61-109': 70.375,
 '61-110': 70.25,
 '61-113': 70.25,
 '61-114': 70.25,
 '426-22': 70.1875,
 '420-42': 70.1875,
 '423-7': 70.125}

Calculating mean Recall across all queries to get our mean Recall@10 for the entire collection of queries:

mean_recall = sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall
0.28046940381859803

So, about 28% of all queries’ relevant passages were present in the top-10 passages retrieved.

I wanted to confirm my calculation so I’ll also calculate Recall@10 using the ranx library.

qrels_ranx = Qrels(qrels)
ranx_results = Run(pytrec_results)
evaluate(qrels_ranx, ranx_results, "recall@10")
0.2804694038185978

And we get the same results. Great!

Final Thoughts

In a future blog post I’ll calculate Recall@10 (and other metrics) on all of the datasets included in DAPR:

  • ConditionalQA
  • MS MARCO
  • Genomics
  • MIRACL
  • Natural Questions

Once that’s done, I’ll pick a few different retrieval models and compare their results across these datasets.

I think by the end of these experiments I’ll have a better grasp on how to work with classic IR datasets and metrics.