vishal bakshi
welcome to my blog.
Recreating the PLAID ColBERTv2 Scoring Pipeline: From Research Code to RAGatouille
In this blog post, I walk through the colbert research codebase (via AnswerAI’s RAGatouille) and work my way line-by-line through the 4-stage PLAID scoring pipeline to recreate RAGatouille results for a toy example of 1 query and 3 documents.
Scoring Full Text and Semantic Search on Chunk Sizes from 100 to 2000 Tokens
In this blog post, I run retrieval on three differently preprocessed datasets using four retrieval methods from chunk sizes 100 to 2000 tokens, using my fastbook-benchmark dataset to auto-score the results. Surprisingly, full text search yields the best MRR@10 (0.67) and Recall@10 (0.95) for a chunk size of 2000 tokens.
Implementing Image-to-Image Generation for Stable Diffusion
In this blog post I successfully implement image-to-image generation in the diffusion loop provided in Lesson 10 of the fastai course.
Evaluating 4 Retrieval Methods with 6 Chunking Strategies on my fastbook-benchmark Dataset
In this blog post, I perform retrieval on the fastbook chapter documents using 24 different retrieval method-chunking strategy combinations, auto-scoring using my fastbook-benchmark dataset.
Implementing Negative Prompting for Stable Diffusion
In this blog post I successfully implement negative prompting in the diffusion loop provided in Lesson 10 of the fastai course. I also explore some other relatively unsuccessful implementations that were interesting and informative nontheless.
Training Textual Inversion Embeddings on Some Samurai Jack Drawings
In this blog post, I recap my experience (and results) with textual inversion embeddings trained on 6 sketches I created of Samurai Jack.
Comparing Cosine Similarity Between Embeddings of Semantically Similar and Dissimilar Texts with Varying Punctuation
In this blog post, I calculate the cosine similarity between different embeddings for texts that have varying types of punctuation and semantic similarity
Establishing a Semantic Search (Embedding Cosine Similarity) Baseline for My fastbookRAG Project
In this blog post, I experiment with 6 chunking/retrieval strategies to retrieve context from an array of text embeddings sufficient to answer 80.31% of the 193 fastbook end-of-chapter questions from Part 1 of the course.
Conducting a Question-by-Question Error Analysis on Semantic Search Results
In this blog post, I conduct a detailed error analysis of 29 questions (from a set of 193), where none of the 6 semantic search methods retrieved sufficient context to answer them. I examine each question, categorize the errors, and discuss potential improvements and implications for future work.
Generating a GIF Animation Using Stable Diffusion
In this blog post I repurpose the code provided in Lesson 9/10 of the fastai Part 2 course to generate an animation GIF transitioning from a picture of a skunk to a picture of a puppy.
Calculating the Ratio of Gradients in an Image
In this blog post I use the OpenCV library to calculate the ratio of the sum of non-zero x- and y-gradients to the sum of non-zero original pixels of an image. The serif font has consistently larger ratios than the sans serif font.
Calculating the Ratio of Corners in an Image
In this blog post I use the OpenCV library to calculate the ratio of the sum of non-zero corner pixels to the sum of non-zero original pixels of an image. The serif font has consistently larger ratios than the sans serif font.
Calculating the Ratio of 2D FFT Magnitude and Phase of a Text Image
In this blog post I use NumPy to calculate the ratio of the mean of 2D FFT magnitude to the count of non-zero binarized pixels (“FFT Magnitude Ratio”) and the ratio of the sum of the absolute value of 2D FFT phase to the sum of binarized pixels (“FFT Phase Ratio”). Both ratios are consistently larger for images with serif
text.
Conducting a Question-by-Question Error Analysis on Full Text Search Results
In this blog post, I conduct a detailed error analysis of 39 questions from a set of 202, where none of the 6 full text search methods retrieved sufficient context to answer them. I examine each question, categorize the errors, and discuss potential improvements and implications for future work.
Establishing a Full Text Search (BM25) Baseline for My fastbookRAG Project
In this blog post, I experiment with 6 chunking/retrieval strategies to retrieve context from a sqlite database sufficient to answer 76.7% of the 202 fastbook end-of-chapter questions from Part 1 of the course.
Comparing ~100k Random Numbers Generated with Different Methods
In this blog post, I generate close to 100k random numbers using 5 different methods: ANU Quantum numbers, Python’s random
module, NumPy, PyTorch and a custom implementation from Lesson 10 of the fastai course (Part 2). I am surprised by the results!
Iterating on Full Text Search Keywords using claudette
In this blog post, I use Answer.AI’s claudette
library to iteratively improve keywords generated for sqlite’s full text search.
Generating Full Text Search Keywords using claudette
In this blog post, I use Answer.AI’s claudette
library to interface with the Claude-3.5 Sonnet API.
Using Hybrid Search to Answer fastai the Chapter 1 Questionnaire
In this blog post I use different approaches to combine FTS5 (keyword search) and Cosine Similarity (semantic search) to retrieve context necessary to answer questions about Chapter 1 of the fastai textbook.
How Does Stable Diffusion Work?
In this blog post I review the material taught in Lesson 9 of the fastai course (Part 2: Deep Learning Foundations to Stable Diffusion).
Using Full Text Search to Answer the fastbook Chapter 1 Questionnaire
In this blog post I’ll walk through my experiments of using sqlite full text search to retrieve context relevant to answering chapter review questions. This is part of a larger fastbookRAG proejct I’m work on.
Calculating the Flesch Kincaid Reading Grade Level for the financial_phrasebank
Dataset
In this blog post I calculate the Flesch Kincaid reading grade level for the financial_phrasebank
dataset and find that it’s much higher than the average TinyStories reading level.
Paper Math: rsLoRA
In this blog post I think out loud as I attempt to understand pieces of the math presented in the rsLoRA paper.
Training Collaborative Filtering Models on MovieLens 100k with Different Weight Decay Values
In this notebook I explore the question—how does the wd
(weight decay) parameter affect model performance and weight distributions? I use the MovieLens 100k subset as the dataset.
Improving Kaggle Private Score with Multi-Target Classification
In this notebook I apply Jeremy Howard’s approach to multi-target classification in fastai to improve a Kaggle submission score.
Paper Summary: RewardBench
A summary of research benchmarking reward models.
Recap: HMS HBAC Kaggle Competition
A recap of what and how I did on the Harvard Medical Harmful Brain Activity Classification Kaggle Competition.
Recap: My First Live Kaggle Competition
A recap of what and how I did on the Multi-Class Prediction of Obesity Risk Kaggle Competition.
Paddy Doctor Kaggle Competition - Part 8
In this notebook I apply to my large ensemble Jeremy Howard’s approach in the “Scaling Up - Road to the Top, Part 3” notebook.
Paddy Doctor Kaggle Competition - Part 7
In this notebook I run the code from Jeremy Howard’s “Scaling Up - Road to the Top, Part 3” notebook.
Paddy Doctor Kaggle Competition - Part 6
In this notebook I work through Jeremy Howard’s Live Coding 13 video in which he finishes working on the Paddy Doctor Disease Classification Kaggle Competition.
Paddy Doctor Kaggle Competition - Part 5
In this notebook I work through Jeremy Howard’s Live Coding 12 video in which he continues working on the Paddy Doctor Disease Classification Kaggle Competition.
Paddy Doctor Kaggle Competition - Part 4
In this notebook I work through Jeremy Howard’s Live Coding 11 video in which he continues working on the Paddy Doctor Disease Classification Kaggle Competition.
Paddy Doctor Kaggle Competition - Part 3
In this notebook I work through Jeremy Howard’s Live Coding 10 video in which he continues working on the Paddy Doctor Disease Classification Kaggle Competition.
Paddy Doctor Kaggle Competition - Part 2
In this notebook I work through Jeremy Howard’s Live Coding 9 video in which he continues working on the Paddy Doctor Disease Classification Kaggle Competition.
Paddy Doctor Kaggle Competition - Part 1
In this notebook I work through Jeremy Howard’s Live Coding 8 video in which he starts working on the Paddy Doctor Disease Classification Kaggle Competition.