Show pip install and imports
!pip install sentence-transformers -Uqq
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
Vishal Bakshi
November 8, 2024
I was reading the ColBERT paper as part of a fastai study group and it mentions the following:
After passing this input sequence through BERT and the subsequent linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a pre-defined list. This filtering is meant to reduce the number of embeddings per document, as we hypothesize that (even contextualized) embeddings of punctuation are unnecessary for effectiveness.
I’m not going to understand (or test) their hypothesis in full in this notebook but I am doing a tiny experiment to see how punctuation changes translate to embedding changes.
Starting with a smaller model:
I asked Claude for some examples of sentences where a comma would change its meaning and it gave me the following pair which I’m expanding upon in this notebook:
“The woman said the judge is dishonest”
“The woman, said the judge, is dishonest”
In the first sentence, the woman is saying that the judge is dishonest. In the second sentence by adding commas the meaning changes.
I’ve also added some variants of the sentence using different punctuation.
d1 = "The woman said the judge is dishonest"
d2 = "The woman, said the judge, is dishonest"
d3 = "The woman said: the judge is dishonest"
d4 = 'The woman said: "the judge is dishonest"'
d5 = 'The judge said: "the woman is dishonest"'
q = "Is the woman or the judge dishonest?"
s1 = "The woman is dishonest"
s2 = "The judge is dishonest"
d1_emb = emb_model.encode(d1, convert_to_tensor=True)
d2_emb = emb_model.encode(d2, convert_to_tensor=True)
d3_emb = emb_model.encode(d3, convert_to_tensor=True)
d4_emb = emb_model.encode(d4, convert_to_tensor=True)
d5_emb = emb_model.encode(d5, convert_to_tensor=True)
q_emb = emb_model.encode(q, convert_to_tensor=True)
s1_emb = emb_model.encode(s1, convert_to_tensor=True)
s2_emb = emb_model.encode(s2, convert_to_tensor=True)
The most similar text to the question “Is the woman or the judge dishonest?”, by cosine similarity, is “The woman, said the judge, is dishonest”. The least similar is ‘The woman said: “the judge is dishonest”’. My guess is that the additional punctuation (:
and "
) causes this dissimilarity.
q = "Is the woman or the judge dishonest?"
(
F.cosine_similarity(q_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
F.cosine_similarity(q_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
F.cosine_similarity(q_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
F.cosine_similarity(q_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
F.cosine_similarity(q_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
)
(tensor(0.9355),
tensor(0.9292),
tensor(0.9170),
tensor(0.9149),
tensor(0.8996))
The text
The woman is dishonest
is most similar by cosine similarity to the text:
The woman, said the judge, is dishonest
That makes sense. However, “The woman is dishonest” has a lower cosine similarity with the semantically similar ‘The judge said: “the woman is dishonest”’ (0.8561
) than the semantically dissimilar “The woman said the judge is dishonest” (0.8631
).
s1 = "The woman is dishonest"
res = torch.tensor(
[
F.cosine_similarity(s1_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
F.cosine_similarity(s1_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
F.cosine_similarity(s1_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
F.cosine_similarity(s1_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
F.cosine_similarity(s1_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
]
)
res
tensor([0.8812, 0.8631, 0.8561, 0.8502, 0.8383])
For the following text:
s2 = "The judge is dishonest"
The most similar, by cosine similarity, is “The woman, said the judge, is dishonest” which is semantically dissimilar.
res = torch.tensor(
[
F.cosine_similarity(s2_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
F.cosine_similarity(s2_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
F.cosine_similarity(s2_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
F.cosine_similarity(s2_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
F.cosine_similarity(s2_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
]
)
res
tensor([0.9208, 0.9194, 0.9102, 0.8969, 0.8907])
Trying a bigger model that ranks higher on the MTEB leaderboard:
d1_emb = emb_model.encode(d1, convert_to_tensor=True)
d2_emb = emb_model.encode(d2, convert_to_tensor=True)
d3_emb = emb_model.encode(d3, convert_to_tensor=True)
d4_emb = emb_model.encode(d4, convert_to_tensor=True)
d5_emb = emb_model.encode(d5, convert_to_tensor=True)
q_emb = emb_model.encode(q, convert_to_tensor=True)
s1_emb = emb_model.encode(s1, convert_to_tensor=True)
s2_emb = emb_model.encode(s2, convert_to_tensor=True)
For this model, for this text:
q = "Is the woman or the judge dishonest?"
the closest text by cosine similarity is “The woman said: the judge is dishonest”.
(
F.cosine_similarity(q_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
F.cosine_similarity(q_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
F.cosine_similarity(q_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
F.cosine_similarity(q_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
F.cosine_similarity(q_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
)
(tensor(0.8180),
tensor(0.8175),
tensor(0.7875),
tensor(0.7849),
tensor(0.7731))
For the following text:
s1 = "The woman is dishonest"
the most similar text, by cosine similarity, is “The woman said the judge is dishonest” which is semantically dissimilar.
res = torch.tensor(
[
F.cosine_similarity(s1_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
F.cosine_similarity(s1_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
F.cosine_similarity(s1_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
F.cosine_similarity(s1_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
F.cosine_similarity(s1_emb, d5_emb, dim=0) # 'The judge said: "the woman is dishonest"'
]
)
res
tensor([0.9738, 0.9461, 0.9042, 0.8714, 0.8577])
Finally, for the following text:
s2 = "The judge is dishonest"
the most similar text, by cosine similarity, is “The woman said the judge is dishonest” which is semantically similar. The second-most similar by cosine similarity text “The woman said: the judge is dishonest” is also semantically similar. However, the semantically similar ‘The woman said: “the judge is dishonest”’ has a lower cosine similarity than the semantically dissimilar “The woman, said the judge, is dishonest”. Whew!
res = torch.tensor(
[
F.cosine_similarity(s2_emb, d1_emb, dim=0), # "The woman said the judge is dishonest"
F.cosine_similarity(s2_emb, d3_emb, dim=0), # "The woman said: the judge is dishonest"
F.cosine_similarity(s2_emb, d2_emb, dim=0), # "The woman, said the judge, is dishonest"
F.cosine_similarity(s2_emb, d4_emb, dim=0), # 'The woman said: "the judge is dishonest"'
F.cosine_similarity(s2_emb, d5_emb, dim=0), # 'The judge said: "the woman is dishonest"'
]
)
res
tensor([0.9763, 0.9507, 0.9107, 0.8791, 0.8642])
I’m not going to make any conclusions about the relationship between punctuation, embeddings and cosine similarity, but I’ll say that this tiny experiment has left me with more questions than answers.