Comparing Cosine Similarity Between Embeddings of Semantically Similar and Dissimilar Texts with Varying Punctuation

python

RAG

information retrieval

In this blog post, I calculate the cosine similarity between different embeddings for texts that have varying types of punctuation and semantic similarity

Author

Vishal Bakshi

Published

November 8, 2024

Show pip install and imports

!pip install sentence-transformers -Uqq
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer

Background

I was reading the ColBERT paper as part of a fastai study group and it mentions the following:

After passing this input sequence through BERT and the subsequent linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a pre-defined list. This filtering is meant to reduce the number of embeddings per document, as we hypothesize that (even contextualized) embeddings of punctuation are unnecessary for effectiveness.

I’m not going to understand (or test) their hypothesis in full in this notebook but I am doing a tiny experiment to see how punctuation changes translate to embedding changes.

Starting with a smaller model:

emb_model = SentenceTransformer("BAAI/bge-small-en-v1.5");

I asked Claude for some examples of sentences where a comma would change its meaning and it gave me the following pair which I’m expanding upon in this notebook:

“The woman said the judge is dishonest”

“The woman, said the judge, is dishonest”

In the first sentence, the woman is saying that the judge is dishonest. In the second sentence by adding commas the meaning changes.

I’ve also added some variants of the sentence using different punctuation.

d1 = "The woman said the judge is dishonest"
d2 = "The woman, said the judge, is dishonest"
d3 = "The woman said: the judge is dishonest"
d4 = 'The woman said: "the judge is dishonest"'
d5 = 'The judge said: "the woman is dishonest"'

q = "Is the woman or the judge dishonest?"
s1 = "The woman is dishonest"
s2 = "The judge is dishonest"

d1_emb = emb_model.encode(d1, convert_to_tensor=True)
d2_emb = emb_model.encode(d2, convert_to_tensor=True)
d3_emb = emb_model.encode(d3, convert_to_tensor=True)
d4_emb = emb_model.encode(d4, convert_to_tensor=True)
d5_emb = emb_model.encode(d5, convert_to_tensor=True)

q_emb = emb_model.encode(q, convert_to_tensor=True)
s1_emb = emb_model.encode(s1, convert_to_tensor=True)
s2_emb = emb_model.encode(s2, convert_to_tensor=True)

The most similar text to the question “Is the woman or the judge dishonest?”, by cosine similarity, is “The woman, said the judge, is dishonest”. The least similar is ‘The woman said: “the judge is dishonest”’. My guess is that the additional punctuation (: and ") causes this dissimilarity.

q = "Is the woman or the judge dishonest?"

(
    F.cosine_similarity(q_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
    F.cosine_similarity(q_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
    F.cosine_similarity(q_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
    F.cosine_similarity(q_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
    F.cosine_similarity(q_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
)

(tensor(0.9355),
 tensor(0.9292),
 tensor(0.9170),
 tensor(0.9149),
 tensor(0.8996))

The text

The woman is dishonest

is most similar by cosine similarity to the text:

The woman, said the judge, is dishonest

That makes sense. However, “The woman is dishonest” has a lower cosine similarity with the semantically similar ‘The judge said: “the woman is dishonest”’ (0.8561) than the semantically dissimilar “The woman said the judge is dishonest” (0.8631).

s1 = "The woman is dishonest"

res = torch.tensor(
        [
            F.cosine_similarity(s1_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
            F.cosine_similarity(s1_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
            F.cosine_similarity(s1_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
            F.cosine_similarity(s1_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
            F.cosine_similarity(s1_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
        ]
    )

res

tensor([0.8812, 0.8631, 0.8561, 0.8502, 0.8383])

torch.median(res)

tensor(0.8561)

For the following text:

s2 = "The judge is dishonest"

The most similar, by cosine similarity, is “The woman, said the judge, is dishonest” which is semantically dissimilar.

res = torch.tensor(
        [
          F.cosine_similarity(s2_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
          F.cosine_similarity(s2_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
          F.cosine_similarity(s2_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
          F.cosine_similarity(s2_emb, d5_emb, dim=0),  # 'The judge said: "the woman is dishonest"'
          F.cosine_similarity(s2_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
        ]
    )
res

tensor([0.9208, 0.9194, 0.9102, 0.8969, 0.8907])

torch.median(res)

tensor(0.9102)

Trying a bigger model that ranks higher on the MTEB leaderboard:

emb_model = SentenceTransformer("dunzhang/stella_en_1.5B_v5");

d1_emb = emb_model.encode(d1, convert_to_tensor=True)
d2_emb = emb_model.encode(d2, convert_to_tensor=True)
d3_emb = emb_model.encode(d3, convert_to_tensor=True)
d4_emb = emb_model.encode(d4, convert_to_tensor=True)
d5_emb = emb_model.encode(d5, convert_to_tensor=True)

q_emb = emb_model.encode(q, convert_to_tensor=True)
s1_emb = emb_model.encode(s1, convert_to_tensor=True)
s2_emb = emb_model.encode(s2, convert_to_tensor=True)

For this model, for this text:

q = "Is the woman or the judge dishonest?"

the closest text by cosine similarity is “The woman said: the judge is dishonest”.

(
    F.cosine_similarity(q_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
    F.cosine_similarity(q_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
    F.cosine_similarity(q_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
    F.cosine_similarity(q_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
    F.cosine_similarity(q_emb, d5_emb, dim=0),  # 'The judge said: "the woman is dishonest"'
)

(tensor(0.8180),
 tensor(0.8175),
 tensor(0.7875),
 tensor(0.7849),
 tensor(0.7731))

For the following text:

s1 = "The woman is dishonest"

the most similar text, by cosine similarity, is “The woman said the judge is dishonest” which is semantically dissimilar.

res = torch.tensor(
        [
            F.cosine_similarity(s1_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
            F.cosine_similarity(s1_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
            F.cosine_similarity(s1_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
            F.cosine_similarity(s1_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
            F.cosine_similarity(s1_emb, d5_emb, dim=0) # 'The judge said: "the woman is dishonest"'
        ]
    )

res

tensor([0.9738, 0.9461, 0.9042, 0.8714, 0.8577])

torch.median(res)

tensor(0.9042)

Finally, for the following text:

s2 = "The judge is dishonest"

the most similar text, by cosine similarity, is “The woman said the judge is dishonest” which is semantically similar. The second-most similar by cosine similarity text “The woman said: the judge is dishonest” is also semantically similar. However, the semantically similar ‘The woman said: “the judge is dishonest”’ has a lower cosine similarity than the semantically dissimilar “The woman, said the judge, is dishonest”. Whew!

res = torch.tensor(
        [
          F.cosine_similarity(s2_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
          F.cosine_similarity(s2_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
          F.cosine_similarity(s2_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
          F.cosine_similarity(s2_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
          F.cosine_similarity(s2_emb, d5_emb, dim=0),  # 'The judge said: "the woman is dishonest"'
        ]
    )
res

tensor([0.9763, 0.9507, 0.9107, 0.8791, 0.8642])

torch.median(res)

tensor(0.9107)

Final Thoughts

I’m not going to make any conclusions about the relationship between punctuation, embeddings and cosine similarity, but I’ll say that this tiny experiment has left me with more questions than answers.