BM25 and Cosine Similarity Demo

python
RAG
information retrieval
fastbookRAG
In this blog post I walk through a couple of toy examples using full text search and cosine similarity to answer queries using a toy dataset of documents.
Author

Vishal Bakshi

Published

July 31, 2024

Background

In this notebook I’ll work through a simple demo of full text search (using BM25 in SQLite) and cosine similarity (using sentence_transformers) to get my feet wet before I dive into building a pipeline for my fastbookRAG project (where I’ll be trying to answer questions from end-of-chapter questionnaires in the freely available fastai textbook using an LLM, full text search and embedding cosine similarity).

Cosine Similarity

For this demo, I intend to illustrate how cosine similarity can be used to capture differences in semantic meaning between texts.

I’ll heavily reference Jeremy Howard’s Hacker’s Guide to LLM RAG example.

!pip install sentence-transformers -Uqq
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

Here’s my demo “database”—a few strings, with two of them intentionally similar (related to passing) to help illustrate the difference in semantic similarity with the query.

chunks = [
    'Jalen Hurts threw for 3858 yards in the 2023 season',
    'Jalen Hurts had a combined 38 TDs (23 passing, 15 rushing) in the 2023 season',
    "Hurts' number one target in the 2023 season was A.J. Brown with 158 targets"
]
q = "How many passing yards did Jalen Hurts have in the 2023 season?"

I’ll embed the data and the query:

data_embs = emb_model.encode(chunks, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
data_embs.shape, q_embs.shape
(torch.Size([3, 384]), torch.Size([384]))

With the power of broadcasting, I’ll calculate the cosine similarity (with dim=1) between the query and the entire dataset and sort it in descending order:

import torch
import torch.nn.functional as F
res = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)
res[0]
tensor([0.9298, 0.9263, 0.7729])

The record in my database that best matches the query is the correct answer:

chunks[res[1][0]]
'Jalen Hurts threw for 3858 yards in the 2023 season'

Although the second-highest incorrect answer is close (0.9263 vs 0.9298) maybe because it contains the word “passing”?

chunks[res[1][1]]
'Jalen Hurts had a combined 38 TDs (23 passing, 15 rushing) in the 2023 season'

Cosine Similarity of Top-3 BM25 Records

Obviously, I chose this whole setup because I saw that while BM25 didn’t rank the correct record the highest, it did rank it second-highest, so it would be in my top-3. I was hoping that cosine similarity would take care of the rest, and it did.

top3 = rts5_res[:3]
top3
['A.J. Brown had 25 more total catches than DeVonta Smith',
 'DeVonta Smith had 81 catches',
 'Jalen Hurts completed 81 passes to DeVonta Smith']

With this combined approach, cosine similarity correctly yields the correct response:

Jalen Hurts completed 81 passes to DeVonta Smith

My guess is that the cosine similarity between this record and the query is largest because of the words "completed" and "completions" being similar.

q
'How many completions to DeVonta Smith?'
data_embs = emb_model.encode(top3, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
data_embs.shape, q_embs.shape
(torch.Size([3, 384]), torch.Size([384]))
res = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)[1]
top3[res[0]]
'Jalen Hurts completed 81 passes to DeVonta Smith'

Weighted Cosine Similarity and BM25 Score

I was unable to achieve the correct response with a weighted average for this particular combination of query, BM25 results and cosine similarity results. The top-most ranked BM25 and cosine similarity scores (for the incorrect responses) were just too high to overcome.

Here are the BM25 results:

res = cur.execute("""

SELECT *, rank
  from virtual_text_data
WHERE virtual_text_data MATCH '"DeVonta Smith" OR "receptions" OR "catches" OR "total"'
ORDER BY rank

""")

rts5_res = res.fetchall()
rts5_res = [(el[0], -1 * el[1]) for el in rts5_res]
rts5_res
[('A.J. Brown had 25 more total catches than DeVonta Smith',
  1.2880369135898475),
 ('DeVonta Smith had 81 catches', 0.4059995941883513),
 ('Jalen Hurts completed 81 passes to DeVonta Smith', 1.029379760609358e-06),
 ('DeVonta Smith caught 7 touchdowns in the 2023 season',
  9.813278008298757e-07),
 ('DeVonta Smith caught 25 fewer passes than A.J. Brown',
  9.375619425173439e-07)]

And the cosine similarity results:

data_embs = emb_model.encode(chunks, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
values, indices = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)
cs_res = list(zip(values.tolist(), indices.tolist()))
cs_res = [(chunks[el[1]], el[0]) for el in cs_res]
cs_res
[('DeVonta Smith caught 7 touchdowns in the 2023 season', 0.7455482482910156),
 ('Jalen Hurts completed 81 passes to DeVonta Smith', 0.7004185914993286),
 ('DeVonta Smith had 81 catches', 0.6995621919631958),
 ('DeVonta Smith caught 25 fewer passes than A.J. Brown', 0.6626170873641968),
 ('A.J. Brown had 25 more total catches than DeVonta Smith',
  0.6136136651039124)]

I’ll combine the two in a DataFrame:

import pandas as pd

combined_list = [('BM25',) + item for item in rts5_res] + [('CS',) + item for item in cs_res]

df = pd.DataFrame(combined_list, columns=['method', 'record', 'score'])

df
method record score
0 BM25 A.J. Brown had 25 more total catches than DeVo... 1.288037e+00
1 BM25 DeVonta Smith had 81 catches 4.059996e-01
2 BM25 Jalen Hurts completed 81 passes to DeVonta Smith 1.029380e-06
3 BM25 DeVonta Smith caught 7 touchdowns in the 2023 ... 9.813278e-07
4 BM25 DeVonta Smith caught 25 fewer passes than A.J.... 9.375619e-07
5 CS DeVonta Smith caught 7 touchdowns in the 2023 ... 7.455482e-01
6 CS Jalen Hurts completed 81 passes to DeVonta Smith 7.004186e-01
7 CS DeVonta Smith had 81 catches 6.995622e-01
8 CS DeVonta Smith caught 25 fewer passes than A.J.... 6.626171e-01
9 CS A.J. Brown had 25 more total catches than DeVo... 6.136137e-01

I’ll normalize the BM25 scores and cosine similarity scores within their respective groups. After calculating the weighted average (with various weights) I couldn’t get the correct response to score the highest.

That’s not to say that a weighted average is never the right approach, but it doesn’t work well for this highly-curated toy example.

# normalize scores within each method
def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

df['norm_score'] = df.groupby('method')['score'].transform(normalize)

# pivot
pivot_df = df.pivot(index='record', columns='method', values='norm_score').reset_index()
pivot_df.columns.name = None

# calculate weighted average (assuming equal weights of 0.5 for simplicity)
pivot_df['weighted_avg'] = pivot_df['BM25'] * 0.4 + pivot_df['CS'] * 0.6


pivot_df = pivot_df.sort_values('weighted_avg', ascending=False)
pivot_df
record BM25 CS weighted_avg
2 DeVonta Smith caught 7 touchdowns in the 2023 ... 3.397875e-08 1.000000 0.600000
3 DeVonta Smith had 81 catches 3.152075e-01 0.651448 0.516952
0 A.J. Brown had 25 more total catches than DeVo... 1.000000e+00 0.000000 0.400000
4 Jalen Hurts completed 81 passes to DeVonta Smith 7.128513e-08 0.657939 0.394764
1 DeVonta Smith caught 25 fewer passes than A.J.... 0.000000e+00 0.371422 0.222853

Final Thoughts

Even a toy example had me sweating at times! And I was clearly gaming the setup to get as close to my desired result as possible. As I move forward with experimenting with BM25 and cosine similarity and eventually combining them with an LLM to create a RAG pipeline for a much more complex dataset (multiple chapters from the fastai textbook), I’ll make sure that I respect the nuance involved and plan the project accordingly.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.