!pip install datasets ragatouille pytrec_eval ranx
Evaluating the DAPR ConditionalQA Dataset with RAGatouille
Setup
from datasets import load_dataset
from ragatouille import RAGPretrainedModel
import numpy as np
import pandas as pd
import pytrec_eval
from ranx import evaluate
from ranx import Qrels, Run
Background
I wanted to get familiar with classic information retrieval datasets, especially those with explicit documents. I searched with Perplexity and ChatGPT and came across DAPR: Document-Aware Passage Retrieval which sounded perfect for my use case.
In this blog post I’ll work through evaluating the test split of the ConditionalQA dataset in DAPR using RAGatouille and the answerai-colbert-small-v1
model for retrieval and the pytrec and ranx libraries for evaluation. I’ll use the simple Recall@10 metric as it’s the easiest to manually check.
Load and View Data
Here are the three datasets we are going to use for this evaluation:
ConditionalQA-corpus
, our passagesConditionalQA_queries
, our queries- and
ConditionalQA_qrels
, the mapping between queries and passages.
= load_dataset("UKPLab/dapr", "ConditionalQA-corpus", split="test")
passages passages
Dataset({
features: ['_id', 'text', 'title', 'doc_id', 'paragraph_no', 'total_paragraphs', 'is_candidate'],
num_rows: 69199
})
0] passages[
{'_id': '0-0',
'text': 'Overview',
'title': 'Child Tax Credit',
'doc_id': '0',
'paragraph_no': 0,
'total_paragraphs': 77,
'is_candidate': True}
= load_dataset("UKPLab/dapr", "ConditionalQA-queries", split="test")
queries queries
Dataset({
features: ['_id', 'text'],
num_rows: 271
})
0] queries[
{'_id': 'dev-0',
'text': 'My brother and his wife are in prison for carrying out a large fraud scheme. Their 7 and 8 year old children have been living with me for the last 4 years. I want to become their Special Guardian to look after them permanently How long will it be before I hear back from the court?'}
= load_dataset("UKPLab/dapr", "ConditionalQA-qrels", split="test")
qrels_rows qrels_rows
Dataset({
features: ['query_id', 'corpus_id', 'score'],
num_rows: 1165
})
0] qrels_rows[
{'query_id': 'dev-0', 'corpus_id': '86-41', 'score': 1}
Load answerai-colbert-small-v1
:
= RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")
RAG RAG
<ragatouille.RAGPretrainedModel.RAGPretrainedModel at 0x7e5328fdced0>
Structure the passages for indexing:
5] passages[:
{'_id': ['0-0', '0-1', '0-2', '0-3', '0-4'],
'text': ['Overview',
'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
'What you’ll get'],
'title': ['Child Tax Credit',
'Child Tax Credit',
'Child Tax Credit',
'Child Tax Credit',
'Child Tax Credit'],
'doc_id': ['0', '0', '0', '0', '0'],
'paragraph_no': [0, 1, 2, 3, 4],
'total_paragraphs': [77, 77, 77, 77, 77],
'is_candidate': [True, True, True, True, True]}
= [p['text'] for p in passages]
passage_texts 5] passage_texts[:
['Overview',
'You can only make a claim for Child Tax Credit if you already get Working Tax Credit.',
'If you cannot apply for Child Tax Credit, you can apply for Universal Credit instead.',
'You might be able to apply for Pension Credit if you and your partner are State Pension age or over.',
'What you’ll get']
= [p['_id'] for p in passages]
passage_ids 5] passage_ids[:
['0-0', '0-1', '0-2', '0-3', '0-4']
Build the index and run search
= RAG.index(
index_path ="conditionalqa_index",
index_name=passage_texts,
collection=passage_ids
document_ids )
Taking a look at the results for a single query. Each result has a content
, score
, rank
, document_id
, and passage_id
. Note a bit of confusing terminology: document_id
is actually the id of the item in the passages
dataset and passage_id
is an identifier created by RAGatouille, unrelated to the datasets.
= RAG.search(queries[0]['text'], k=10)
results results
[{'content': 'You must advertise your claim within 14 days from the day you get a date for the first court hearing. The advert must appear in a print or online newspaper that covers the missing person’s last known usual address.',
'score': 70.0,
'rank': 1,
'document_id': '107-103',
'passage_id': 10480},
{'content': 'The guardianship order will make you a guardian for a maximum of 4 years.',
'score': 70.0,
'rank': 2,
'document_id': '107-242',
'passage_id': 10619},
{'content': 'You can claim joint Housing Benefit for up to 13 weeks if one of you has gone to prison and is likely to return home in 13 weeks or less - including any time on remand.',
'score': 69.9375,
'rank': 3,
'document_id': '8-67',
'passage_id': 911},
{'content': 'The date will be either 14 or 28 days after your court hearing. If you’re in an exceptionally difficult situation, you may be able to convince the judge to delay this for up to 6 weeks.',
'score': 69.9375,
'rank': 4,
'document_id': '496-116',
'passage_id': 47939},
{'content': 'You can claim or continue to claim joint Council Tax Reduction if your partner’s expected to be in prison for 13 weeks or less – including any time on remand.',
'score': 69.875,
'rank': 5,
'document_id': '8-80',
'passage_id': 924},
{'content': 'Sometimes you’ll be given a 2 to 4 week period that you’ll need to keep free - this is known as a ‘warned period’ or ‘floating trial’. If this happens, you’ll be given 1 working day’s notice before you are due to go to court.',
'score': 69.875,
'rank': 6,
'document_id': '254-4',
'passage_id': 23999},
{'content': 'Your Child Benefit payments will stop after 8 weeks if your child goes to prison or is on remand. You’ll get arrears if they’re cleared of the offence.',
'score': 69.8125,
'rank': 7,
'document_id': '8-116',
'passage_id': 960},
{'content': 'You may be able to make a claim if you’re the dependant of someone who suffered from a dust-related disease but who has died. A dependant claim must be made within 12 months of the death of the sufferer.',
'score': 69.8125,
'rank': 8,
'document_id': '45-133',
'passage_id': 4921},
{'content': 'You’ll be responsible for looking after the child until they’re 18 (unless the court takes your responsibility away earlier).',
'score': 69.8125,
'rank': 9,
'document_id': '86-2',
'passage_id': 8150},
{'content': 'If it’s less than 90 days since the person went missing, explain you need the guardianship order urgently, for example, because the person is going to lose their house.',
'score': 69.8125,
'rank': 10,
'document_id': '107-43',
'passage_id': 10420}]
Evaluation
I’ll prepare qrels
for the pytrec evaluator as is done in the DAPR dataset card example on HF:
= {}
qrels for qrel_row in qrels_rows:
= qrel_row["query_id"]
qid = qrel_row["corpus_id"]
pid = qrel_row["score"]
rel
qrels.setdefault(qid, {})= rel qrels[qid][pid]
dev-5
is a query ID with multiple passages so I’ve chosen it as the test example:
= 'dev-5' qid
qrels[qid]
{'61-1': 1, '61-4': 1, '61-5': 1, '61-17': 1, '61-37': 1, '61-39': 1}
= {}
pytrec_results pytrec_results
{}
Next we’ll run retrieval and structure results for the pytrec evaluator, again copying the DAPR example which structures the retrieval results as:
= score retrieval_scores[query_id][passage_id]
Note again that document_id
means passage_id.
for q in queries:
= RAG.search(q['text'], k=10)
results '_id']] = {result['document_id']: float(result['score']) for result in results} pytrec_results[q[
We can see the 10 passages and each one has a corresponding score.
pytrec_results[qid]
{'61-1': 71.125,
'423-16': 70.5625,
'61-27': 70.4375,
'61-109': 70.375,
'61-110': 70.25,
'61-113': 70.25,
'61-114': 70.25,
'426-22': 70.1875,
'420-42': 70.1875,
'423-7': 70.125}
Calculate Recall for all queries and viewing a single query’s Recall:
= pytrec_eval.RelevanceEvaluator(qrels, {'recall.10'}) evaluator
There are 271 queries and 271 metrics (one per query):
= evaluator.evaluate(pytrec_results)
metrics len(metrics)
271
For our dev-5
query the Recall@10 is 0.167 or 1/6.
metrics[qid]
{'recall_10': 0.16666666666666666}
Here are the 6 passages that we needed to retrieve to fully answer this question:
qrels[qid]
{'61-1': 1, '61-4': 1, '61-5': 1, '61-17': 1, '61-37': 1, '61-39': 1}
And here are the results again—only 1 relevant passage, 61-1
, was retrieved.
pytrec_results[qid]
{'61-1': 71.125,
'423-16': 70.5625,
'61-27': 70.4375,
'61-109': 70.375,
'61-110': 70.25,
'61-113': 70.25,
'61-114': 70.25,
'426-22': 70.1875,
'420-42': 70.1875,
'423-7': 70.125}
Calculating mean Recall across all queries to get our mean Recall@10 for the entire collection of queries:
= sum(metrics[qid]['recall_10'] for qid in metrics.keys()) / len(metrics)
mean_recall mean_recall
0.28046940381859803
So, about 28% of all queries’ relevant passages were present in the top-10 passages retrieved.
I wanted to confirm my calculation so I’ll also calculate Recall@10 using the ranx library.
= Qrels(qrels)
qrels_ranx = Run(pytrec_results) ranx_results
"recall@10") evaluate(qrels_ranx, ranx_results,
0.2804694038185978
And we get the same results. Great!
Final Thoughts
In a future blog post I’ll calculate Recall@10 (and other metrics) on all of the datasets included in DAPR:
- ConditionalQA
- MS MARCO
- Genomics
- MIRACL
- Natural Questions
Once that’s done, I’ll pick a few different retrieval models and compare their results across these datasets.
I think by the end of these experiments I’ll have a better grasp on how to work with classic IR datasets and metrics.