from peft import LoraConfig, get_peft_model
import transformers
# the following imports are from dora.py
from copy import deepcopy
import torch
import torch.nn.functional as F
from torch import nn
from peft.utils.integrations import dequantize_module_weight, gather_params_ctx
from peft.utils.other import transpose
DoRA’s Magnitude Vector
Setup
Background
I am currently re-reading the DoRA (Weight-Decomposed Low-Rank Adaptation) paper. I took a bit of a detour and worked through the fantastic article Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch by Sebastian Raschka (I am also reading his book Building a Large Language Model (from scratch) as part of a fastai study group). The article is full of helpful diagrams and breakdowns of concepts as well as easily digestible implementation in code. One particular breakthrough for me while reading the article was his demonstration of the distributive law of multiplication:
x.(W+ΔW) = x.W + x.ΔW
Similarly, we can write the following for LoRA:
x.(W+A.B) = x.W + x.A.B
Reading this made it click for me why and how LoRA adapters are such an efficient way of handling downstream tasks.
I also took a deep dive into the peft library’s implementation of DoRA. I recently made a video of this deep dive.
In this blog post I am going to compare Raschka’s article’s implementation with peft’s and highlight a key difference that I found between them in how they implement the decomposition of a weight matrix into its magnitude and directional components.
I’ll start by reviewing both approaches.
Raschka’s Implementation
I want to add a caveat that this implementation I assume is by no means a “final” or “production” implementation, as I understand it to be more educational and illustrative.
I’ll start by copy/pasting relevant code: LoRALayer
(DoRA uses LoRA to fine-tune the directional component) and LinearWithDoRAMerged
.
class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
= 1 / torch.sqrt(torch.tensor(rank).float())
std_dev self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
self.B = nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
def forward(self, x):
= self.alpha * (x @ self.A @ self.B)
x return x
class LinearWithDoRAMerged(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)self.m = nn.Parameter(
self.linear.weight.norm(p=2, dim=0, keepdim=True))
# Code loosely inspired by
# https://github.com/catid/dora/blob/main/dora.py
def forward(self, x):
= self.lora.A @ self.lora.B
lora = self.linear.weight + self.lora.alpha*lora.T
numerator = numerator.norm(p=2, dim=0, keepdim=True)
denominator = numerator / denominator
directional_component = self.m * directional_component
new_weight return F.linear(x, new_weight, self.linear.bias)
I’ll also create a regular linear layer using one of the in/out feature values in the Raschka article:
= nn.Linear(in_features=784, out_features=128, bias=True)
linear linear
Linear(in_features=784, out_features=128, bias=True)
= LinearWithDoRAMerged(linear, 256, 512)
dora_layer dora_layer
LinearWithDoRAMerged(
(linear): Linear(in_features=784, out_features=128, bias=True)
(lora): LoRALayer()
)
Here’s the key value: the shape of the magnitude vector. In Raschka’s code, it’s 1 x 784, where 784 is the number of linear in_features
.
dora_layer.m.shape
torch.Size([1, 784])
Looking at LinearWithDoRAMerged.__init__
:
self.m = nn.Parameter(
self.linear.weight.norm(p=2, dim=0, keepdim=True))
The norm
is taking over dim=0
, which is the dimension of out_features
:
linear.weight.shape
torch.Size([128, 784])
In other words, we end up with 1 magnitude value for each of the 784 input neurons.
peft
Implementation
From src/peft/tuners/lora/dora.py
:
class DoraLinearLayer(nn.Module):
def __init__(self, fan_in_fan_out):
super().__init__()
self.fan_in_fan_out = fan_in_fan_out
def get_weight_norm(self, weight, lora_weight, scaling) -> torch.Tensor:
# calculate L2 norm of weight matrix, column-wise
= transpose(weight, self.fan_in_fan_out)
weight = weight + scaling * lora_weight
weight = torch.linalg.norm(weight, dim=1).to(weight.dtype)
weight_norm return weight_norm
...
The very important attribute here is fan_in_fan_out
. I found a few places in the peft codebase which documented it as follows:
Set this to True if the layer to replace stores weight like (fan_in, fan_out)
How I interpret this: if the weights are stored as (in, out), fan_in_fan_out
is True
, if stored as (out, in) fan_in_fan_out
is False
.
Looking at an example, I’ll peft-ify SmolLM2-135M:
= 'HuggingFaceTB/SmolLM2-135M'
model_nm model_nm
'HuggingFaceTB/SmolLM2-135M'
= transformers.AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2) model
= LoraConfig(r=256, use_rslora=False, use_dora=True, target_modules=['down_proj', 'gate_proj', 'k_proj', 'o_proj', 'q_proj', 'up_proj', 'v_proj']) peft_config
= get_peft_model(model, peft_config)
model model.print_trainable_parameters()
trainable params: 78,307,200 || all params: 212,823,360 || trainable%: 36.7945
Looking at one of the layers which has a different number of input and output features, k_proj
:
= model.base_model.model.model.layers[0].self_attn.k_proj
k_proj k_proj
lora.Linear(
(base_layer): Linear(in_features=576, out_features=192, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=576, out_features=256, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=256, out_features=192, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict(
(default): lora.dora.DoraLinearLayer()
)
)
The base layer has 576 in_features
and 192 out_features
:
k_proj.base_layer.weight.shape
torch.Size([192, 576])
The fan_in_fan_out
attribute is False
which checks out by looking at the shape above which is (out, in).
k_proj.fan_in_fan_out
False
Why is fan_in_fan_out
such a big deal to me? Well, because look at how get_weight_norm
is written:
def get_weight_norm(self, weight, lora_weight, scaling) -> torch.Tensor:
# calculate L2 norm of weight matrix, column-wise
= transpose(weight, self.fan_in_fan_out)
weight = weight + scaling * lora_weight
weight = torch.linalg.norm(weight, dim=1).to(weight.dtype)
weight_norm return weight_norm
I’ll walk through each line, starting with the base layers weight matrix:
= k_proj.base_layer.weight
weight weight.shape
torch.Size([192, 576])
We then pass the weight and fan_in_fan_out
to transpose
:
= transpose(weight, k_proj.fan_in_fan_out)
weight weight.shape
torch.Size([192, 576])
It doesn’t transpose it! That’s because in tranpose
if fan_in_fan_out
is False
it returns the weight
as is:
def transpose(weight, fan_in_fan_out):
if not fan_in_fan_out:
return weight
if isinstance(weight, torch.nn.Parameter):
return torch.nn.Parameter(weight.T)
return weight.T
This is absolutely critical in determining the shape of the magnitude vector:
= torch.linalg.norm(weight, dim=1).to(weight.dtype)
weight_norm weight_norm.shape
torch.Size([192])
Note two things here: - they take the norm
across dim=1
(the number of in_features
) - we end up with 1 magnitude value for each of the 192 output neurons. Contrast this with Raschka’s article where we had 1 magnitude value for each of the input neurons.
Aside: Claude Conversation
While I was re-reading the DoRA paper, I found the emphasis on column-wise vector norm interesting so I asked Claude about it:
Prompt: Why columns? What is special about doing this for columns and not rows. Is there something fundamental about columns in a matrix of pretrained weights?
Response: For neural networks, columns of weight matrices map features from one layer’s dimension to another’s. Each column represents all the connections to a single output neuron. So columns form natural functional units, making column-wise normalization more meaningful than row-wise for understanding and modifying the network’s learned transformations.
Prompt: So suppose we have an input vector
x
which contains values corresponding to each input variable, the pretrained weight matrix would is d x k sox
would be 1 x d. Each row in pretrained weight matrix would represent one input variable. Multiplying 1 x d with d x k gives us 1 x k where k is the number of activations (i.e. the number of neurons).
Response: Yes, that’s correct. The output for a single neuron (one column of the weight matrix) is the dot product between the input vector and that column’s weights. So when we do column-wise normalization, we’re normalizing the set of weights that feed into each individual output neuron. This helps explain why column-wise normalization is natural - each column represents a complete “unit” in terms of how the network processes information to produce one output activation.
So, it seems like decomposing the weight matrix into a magnitude vector and direction matrix corresponding to output neurons seems really important.
Final Thoughts
Explorations like this is why I’m grateful for open source code. The efforts of folks like HuggingFace and Raschka to democratize LLMs allows folks like me to curate insightful explorations for myself, and I’m obviously happy to share them out in this blog post. If you find something incorrect in my interpretation of Raschka’s or peft’s code, please at me on Twitter @vishal_learner.
A future exercise that I want to do: train DoRA implementations with a column-wise norm and a row-wise norm and compare results. This would truly highlight whether it’s important to align the decomposition with the output neurons of the weight matrix.