BM25 and Cosine Similarity Demo

python

RAG

information retrieval

fastbookRAG

In this blog post I walk through a couple of toy examples using full text search and cosine similarity to answer queries using a toy dataset of documents.

Author

Vishal Bakshi

Published

July 31, 2024

Background

In this notebook I’ll work through a simple demo of full text search (using BM25 in SQLite) and cosine similarity (using sentence_transformers) to get my feet wet before I dive into building a pipeline for my fastbookRAG project (where I’ll be trying to answer questions from end-of-chapter questionnaires in the freely available fastai textbook using an LLM, full text search and embedding cosine similarity).

SQLite Full Text Search

In this demo I intend to show how different keywords can be used to match different texts during full text search.

SQLite is built-in to the python standard library and has a full text search (using BM25) built-in! I’ll start by creating a demo database.

import sqlite3

conn = sqlite3.connect('my.db')
conn.close()

Full text search is available in virtual tables, so I’ll create a table with a single column which will hold the text data that I want to search when responding to a query. The key statement is USING FTS5(text);.

conn = sqlite3.connect('my.db')
cur = conn.cursor()
res = cur.execute("""

CREATE VIRTUAL TABLE virtual_text_data
USING FTS5(text);

""")
conn.close()

I’ll now populate my demo “database” with my demo “dataset” that contains three lines of natural language text that I wrote based on ESPN stats (and my opinions). I have intentionally used different parts of the name "Philadelphia Eagles" in each record:

conn = sqlite3.connect('my.db')
cur = conn.cursor()
res = cur.execute("""

INSERT INTO virtual_text_data(text)
VALUES
 ('The Philadelphia Eagles are set to have a great season in 2024'),
 ('The Eagles signed superstar running back Saquon Barkley in the offseason'),
 ('Philadelphia went 1-7 to finish their season last year');

 """)
conn.commit()
conn.close()

conn = sqlite3.connect('my.db')
cur = conn.cursor()
res = cur.execute("SELECT * from virtual_text_data")
res.fetchall()

[('The Philadelphia Eagles are set to have a great season in 2024',),
 ('The Eagles signed superstar running back Saquon Barkley in the offseason',),
 ('Philadelphia went 1-7 to finish their season last year',)]

Finally, I’ll perform full text search by passing a set of keywords in my SQL query using three ways to refer to the Philadelphia Eagles. The key statement is:

WHERE virtual_text_data MATCH '"Philadelphia Eagles" OR "Eagles" OR "Philadelphia"'

The results are as expected—the record with the full term "Philadelphia Eagles" (all three keywords match) has the highest BM25 rank while the other two have equal scores as they both contain just one keyword ("Philadelphia" or "Eagles"). Note that SQLite multiplies the BM25 rank by -1 since by default it returns ascending order.

res = cur.execute("""

SELECT *, rank
  from virtual_text_data
WHERE virtual_text_data MATCH '"Philadelphia Eagles" OR "Eagles" OR "Philadelphia"'
ORDER BY rank

""")

res.fetchall()

[('The Philadelphia Eagles are set to have a great season in 2024',
  -0.4925110954237839),
 ('Philadelphia went 1-7 to finish their season last year',
  -1.03862660944206e-06),
 ('The Eagles signed superstar running back Saquon Barkley in the offseason',
  -1e-06)]

Changing the order of keywords does not change the BM25 ranking:

res = cur.execute("""

SELECT *, rank
  from virtual_text_data
WHERE virtual_text_data MATCH '"Eagles" OR "Philadelphia Eagles" OR "Philadelphia"'
ORDER BY rank

""")

res.fetchall()

[('The Philadelphia Eagles are set to have a great season in 2024',
  -0.4925110954237839),
 ('Philadelphia went 1-7 to finish their season last year',
  -1.03862660944206e-06),
 ('The Eagles signed superstar running back Saquon Barkley in the offseason',
  -1e-06)]

res = cur.execute("""

SELECT *, rank
  from virtual_text_data
WHERE virtual_text_data MATCH '"Philadelphia" OR "Philadelphia Eagles" OR "Eagles"'
ORDER BY rank

""")

res.fetchall()

[('The Philadelphia Eagles are set to have a great season in 2024',
  -0.4925110954237839),
 ('Philadelphia went 1-7 to finish their season last year',
  -1.03862660944206e-06),
 ('The Eagles signed superstar running back Saquon Barkley in the offseason',
  -1e-06)]

In this way, I can use a set of keywords to perform a full text search of my database. In my fastbookRAG project, I’ll be applying the same approach: querying with a set of keywords a virtual table containing chunks of text from the fastbook chapters.

Cosine Similarity

For this demo, I intend to illustrate how cosine similarity can be used to capture differences in semantic meaning between texts.

I’ll heavily reference Jeremy Howard’s Hacker’s Guide to LLM RAG example.

!pip install sentence-transformers -Uqq
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

Here’s my demo “database”—a few strings, with two of them intentionally similar (related to passing) to help illustrate the difference in semantic similarity with the query.

chunks = [
    'Jalen Hurts threw for 3858 yards in the 2023 season',
    'Jalen Hurts had a combined 38 TDs (23 passing, 15 rushing) in the 2023 season',
    "Hurts' number one target in the 2023 season was A.J. Brown with 158 targets"
]

q = "How many passing yards did Jalen Hurts have in the 2023 season?"

I’ll embed the data and the query:

data_embs = emb_model.encode(chunks, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
data_embs.shape, q_embs.shape

(torch.Size([3, 384]), torch.Size([384]))

With the power of broadcasting, I’ll calculate the cosine similarity (with dim=1) between the query and the entire dataset and sort it in descending order:

import torch
import torch.nn.functional as F

res = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)
res[0]

tensor([0.9298, 0.9263, 0.7729])

The record in my database that best matches the query is the correct answer:

chunks[res[1][0]]

'Jalen Hurts threw for 3858 yards in the 2023 season'

Although the second-highest incorrect answer is close (0.9263 vs 0.9298) maybe because it contains the word “passing”?

chunks[res[1][1]]

'Jalen Hurts had a combined 38 TDs (23 passing, 15 rushing) in the 2023 season'

Hybrid Search

I’ll now create an example demonstration of combining both full text search and cosine similarity to answer a query. I have intentionally chosen a set of records where full text search and cosine similarity individually do not provide the correct answer as the top result but both combined do. Reaching this result was considerably harder than I expected.

chunks = [
    'Jalen Hurts completed 81 passes to DeVonta Smith',
    'A.J. Brown had 25 more total catches than DeVonta Smith',
    "DeVonta Smith caught 25 fewer passes than A.J. Brown",
    "DeVonta Smith caught 7 touchdowns in the 2023 season",
    "DeVonta Smith had 81 catches",
]

I created a helper function to expedite my iterations as I tried at least a dozen different text combinations:

def prep_query(chunks):
  insert_q = ''

  for i, chunk in enumerate(chunks):
    if i != len(chunks)-1:
      insert_q += '("' + chunk + '"),\n'
    else:
      insert_q += '("' + chunk + '");'

  q = f"""INSERT INTO virtual_text_data(text)
  VALUES
  {insert_q}
  """

  return q

q = prep_query(chunks)
print(q)

INSERT INTO virtual_text_data(text)
  VALUES 
  ("Jalen Hurts completed 81 passes to DeVonta Smith"),
("A.J. Brown had 25 more total catches than DeVonta Smith"),
("DeVonta Smith caught 25 fewer passes than A.J. Brown"),
("DeVonta Smith caught 7 touchdowns in the 2023 season"),
("DeVonta Smith had 81 catches");

res = cur.execute('DELETE from virtual_text_data;')
conn.commit()

res = cur.execute(q)
conn.commit()

res = cur.execute("SELECT * from virtual_text_data")
res.fetchall()

[('Jalen Hurts completed 81 passes to DeVonta Smith',),
 ('A.J. Brown had 25 more total catches than DeVonta Smith',),
 ('DeVonta Smith caught 25 fewer passes than A.J. Brown',),
 ('DeVonta Smith caught 7 touchdowns in the 2023 season',),
 ('DeVonta Smith had 81 catches',)]

The keywords I chose:

'"DeVonta Smith" OR "receptions" OR "catches" OR "total"'

are aiming to get a result that tells us how many total receptions (or catches) that DeVonta Smith had. The correct “document” from the “database” is:

'DeVonta Smith had 81 catches'

BM25’s top ranked record is not the correct one:

'A.J. Brown had 25 more total catches than DeVonta Smith'

I could be wrong, but BM25 might have chosen this record because it had the most keyword matches (“DeVonta Smith”, “total” and “catches”) while the correct record only has two keyword matches (“DeVonta Smith” and “catches”).

res = cur.execute("""

SELECT *, rank
  from virtual_text_data
WHERE virtual_text_data MATCH '"DeVonta Smith" OR "receptions" OR "catches" OR "total"'
ORDER BY rank

""")

rts5_res = [el[0] for el in res.fetchall()]
rts5_res

['A.J. Brown had 25 more total catches than DeVonta Smith',
 'DeVonta Smith had 81 catches',
 'Jalen Hurts completed 81 passes to DeVonta Smith',
 'DeVonta Smith caught 7 touchdowns in the 2023 season',
 'DeVonta Smith caught 25 fewer passes than A.J. Brown']

For cosine similarity, I intentionally chose a slightly ambiguous query, expecting it to perform worse because of the indirect way of asking the desired question. The query is implicitly asking “How many of Jalen Hurts’ completions were to DeVonta Smith?”.

I found that this caused the record with the largest cosine similarity to be incorrect:

'DeVonta Smith caught 7 touchdowns in the 2023 season'

I don’t have a good sense of why this string is closest to the query in embedding space.

q = "How many completions to DeVonta Smith?"

data_embs = emb_model.encode(chunks, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
data_embs.shape, q_embs.shape

(torch.Size([5, 384]), torch.Size([384]))

res = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)[1]
res

tensor([3, 0, 4, 2, 1])

chunks[res[0]] # highest cosine similarity

'DeVonta Smith caught 7 touchdowns in the 2023 season'

So far so good—both BM25 and cosine similarity have failed to return the correct answer to my query.

I’ll combine the two methods in two ways that I’ve heard of before:

Pick the top n (in this case 3) BM25 ranked responses, calculate cosine similarity between the query and those responses and pick the record with the highest cosine similarity.
Weigh the BM25 rank and cosine similarity and pick the record with the highest weighted score.

Cosine Similarity of Top-3 BM25 Records

Obviously, I chose this whole setup because I saw that while BM25 didn’t rank the correct record the highest, it did rank it second-highest, so it would be in my top-3. I was hoping that cosine similarity would take care of the rest, and it did.

top3 = rts5_res[:3]
top3

['A.J. Brown had 25 more total catches than DeVonta Smith',
 'DeVonta Smith had 81 catches',
 'Jalen Hurts completed 81 passes to DeVonta Smith']

With this combined approach, cosine similarity correctly yields the correct response:

Jalen Hurts completed 81 passes to DeVonta Smith

My guess is that the cosine similarity between this record and the query is largest because of the words "completed" and "completions" being similar.

'How many completions to DeVonta Smith?'

data_embs = emb_model.encode(top3, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
data_embs.shape, q_embs.shape

(torch.Size([3, 384]), torch.Size([384]))

res = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)[1]
top3[res[0]]

'Jalen Hurts completed 81 passes to DeVonta Smith'

Weighted Cosine Similarity and BM25 Score

I was unable to achieve the correct response with a weighted average for this particular combination of query, BM25 results and cosine similarity results. The top-most ranked BM25 and cosine similarity scores (for the incorrect responses) were just too high to overcome.

Here are the BM25 results:

res = cur.execute("""

SELECT *, rank
  from virtual_text_data
WHERE virtual_text_data MATCH '"DeVonta Smith" OR "receptions" OR "catches" OR "total"'
ORDER BY rank

""")

rts5_res = res.fetchall()
rts5_res = [(el[0], -1 * el[1]) for el in rts5_res]
rts5_res

[('A.J. Brown had 25 more total catches than DeVonta Smith',
  1.2880369135898475),
 ('DeVonta Smith had 81 catches', 0.4059995941883513),
 ('Jalen Hurts completed 81 passes to DeVonta Smith', 1.029379760609358e-06),
 ('DeVonta Smith caught 7 touchdowns in the 2023 season',
  9.813278008298757e-07),
 ('DeVonta Smith caught 25 fewer passes than A.J. Brown',
  9.375619425173439e-07)]

And the cosine similarity results:

data_embs = emb_model.encode(chunks, convert_to_tensor=True)
q_embs = emb_model.encode(q, convert_to_tensor=True)
values, indices = F.cosine_similarity(q_embs, data_embs, dim=1).sort(descending=True)
cs_res = list(zip(values.tolist(), indices.tolist()))
cs_res = [(chunks[el[1]], el[0]) for el in cs_res]
cs_res

[('DeVonta Smith caught 7 touchdowns in the 2023 season', 0.7455482482910156),
 ('Jalen Hurts completed 81 passes to DeVonta Smith', 0.7004185914993286),
 ('DeVonta Smith had 81 catches', 0.6995621919631958),
 ('DeVonta Smith caught 25 fewer passes than A.J. Brown', 0.6626170873641968),
 ('A.J. Brown had 25 more total catches than DeVonta Smith',
  0.6136136651039124)]

I’ll combine the two in a DataFrame:

import pandas as pd

combined_list = [('BM25',) + item for item in rts5_res] + [('CS',) + item for item in cs_res]

df = pd.DataFrame(combined_list, columns=['method', 'record', 'score'])

df

	method	record	score
0	BM25	A.J. Brown had 25 more total catches than DeVo...	1.288037e+00
1	BM25	DeVonta Smith had 81 catches	4.059996e-01
2	BM25	Jalen Hurts completed 81 passes to DeVonta Smith	1.029380e-06
3	BM25	DeVonta Smith caught 7 touchdowns in the 2023 ...	9.813278e-07
4	BM25	DeVonta Smith caught 25 fewer passes than A.J....	9.375619e-07
5	CS	DeVonta Smith caught 7 touchdowns in the 2023 ...	7.455482e-01
6	CS	Jalen Hurts completed 81 passes to DeVonta Smith	7.004186e-01
7	CS	DeVonta Smith had 81 catches	6.995622e-01
8	CS	DeVonta Smith caught 25 fewer passes than A.J....	6.626171e-01
9	CS	A.J. Brown had 25 more total catches than DeVo...	6.136137e-01

I’ll normalize the BM25 scores and cosine similarity scores within their respective groups. After calculating the weighted average (with various weights) I couldn’t get the correct response to score the highest.

That’s not to say that a weighted average is never the right approach, but it doesn’t work well for this highly-curated toy example.

# normalize scores within each method
def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

df['norm_score'] = df.groupby('method')['score'].transform(normalize)

# pivot
pivot_df = df.pivot(index='record', columns='method', values='norm_score').reset_index()
pivot_df.columns.name = None

# calculate weighted average (assuming equal weights of 0.5 for simplicity)
pivot_df['weighted_avg'] = pivot_df['BM25'] * 0.4 + pivot_df['CS'] * 0.6


pivot_df = pivot_df.sort_values('weighted_avg', ascending=False)
pivot_df

	record	BM25	CS	weighted_avg
2	DeVonta Smith caught 7 touchdowns in the 2023 ...	3.397875e-08	1.000000	0.600000
3	DeVonta Smith had 81 catches	3.152075e-01	0.651448	0.516952
0	A.J. Brown had 25 more total catches than DeVo...	1.000000e+00	0.000000	0.400000
4	Jalen Hurts completed 81 passes to DeVonta Smith	7.128513e-08	0.657939	0.394764
1	DeVonta Smith caught 25 fewer passes than A.J....	0.000000e+00	0.371422	0.222853

Final Thoughts

Even a toy example had me sweating at times! And I was clearly gaming the setup to get as close to my desired result as possible. As I move forward with experimenting with BM25 and cosine similarity and eventually combining them with an LLM to create a RAG pipeline for a much more complex dataset (multiple chapters from the fastai textbook), I’ll make sure that I respect the nuance involved and plan the project accordingly.

I hope you enjoyed this blog post! Follow me on Twitter @vishal_learner.