LORA: Low-Rank Adaptation¶
1. Overview¶
Large Language Models (LLMs) contain billions of parameters, making full fine-tuning computationally expensive and memory intensive.
Low-Rank Adaptation (LoRA) provides a parameter-efficient way to adapt pretrained models by freezing the original weights and introducing small trainable low-rank update matrices.
LoRA decomposes weight updates into a low-rank factorization, allowing fine-tuning with only a fraction of the original parameters while retaining model quality.
2. Motivation¶
Fine-tuning a pretrained model requires adjusting all parameters, which can be:
- Expensive β requires large GPU memory and long training time.
- Inefficient β multiple downstream tasks need separate full fine-tunes.
- Redundant β many weight updates lie in a low intrinsic dimension subspace.
LoRA aims to address these issues by restricting weight updates to a low-rank subspace.
3. Core Idea¶
Let \(W_0 \in \mathbb{R}^{d \times k}\) be a pretrained weight matrix of a layer (e.g., in attention or MLP).
In full fine-tuning, the model learns a weight update \(\Delta W\), resulting in:
LoRA assumes \(\Delta W\) is low-rank and can be decomposed as:
where:
- \(A \in \mathbb{R}^{r \times k}\)
- \(B \in \mathbb{R}^{d \times r}\)
- \(r \ll \min(d, k)\) is the rank hyperparameter.
During fine-tuning:
- \(W_0\) is frozen (no gradient updates).
- Only \(A\) and \(B\) are trainable.
At inference, the effective weight is:
where \(\alpha\) is a scaling factor controlling the magnitude of updates.
4. LoRA in Attention Layers¶
In Transformer architectures, LoRA is typically applied to query (Q) and value (V) projection matrices within the self-attention module.
For example, the modified query projection becomes:
This retains the original computation while enabling efficient adaptation with small additional matrices.
5. Objective Function¶
LoRA uses the same loss function as the base fine-tuning objective (e.g., cross-entropy for language modeling):
The only difference is that only the parameters in \( A \) and \( B \) are updated:
This selective gradient flow drastically reduces training cost and memory footprint.
6. Implementation Details (Pseudo-Code)¶
class LoRALinear(nn.Module):
def __init__(self, in_dim, out_dim, r=8, alpha=16):
super().__init__()
self.r = r
self.alpha = alpha
self.scaling = self.alpha / self.r
self.weight = nn.Parameter(torch.empty(out_dim, in_dim))
self.A = nn.Parameter(torch.empty(r, in_dim))
self.B = nn.Parameter(torch.empty(out_dim, r))
nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
nn.init.zeros_(self.B)
self.weight.requires_grad = False # Freeze base weights
def forward(self, x):
return F.linear(x, self.weight + self.scaling * self.B @ self.A)
7. Hyperparameters & Heuristics¶
| Hyperparameter | Typical Range | Practical Tip |
|---|---|---|
| Rank (r) | 4 β 64 (sometimes up to 256) | Start small (4/8/16) and increase if underfitting |
| Alpha (Ξ±) | β 2 Γ r | Scaling factor: scaling = Ξ± / r |
| Learning Rate | 1e-4 β 5e-4 | Too high β drift; too low β slow adaptation |
Dropout (lora_dropout) |
0.0 β 0.1 | 0.05 often helpful on small datasets |
| Epochs | 1 β few | Avoid many epochs on small instruction datasets |
8. Training Configurations & Memory Optimizations¶
- Mixed precision: Use
fp16orbf16to reduce memory usage and speed up training. - Gradient accumulation: Emulate large batch sizes using smaller per-device batches.
- Gradient checkpointing: Trade compute for reduced activation memory footprint.
- CPU offload /
device_map: Offload frozen weights using theaccelerateor Hugging Facedevice_mapfeature. - Optimizer:
AdamWis the default; for very large adapter parameter sets, consider memory-efficient optimizers or evenSGDif appropriate. - QLoRA: Load the base model in 4-bit precision using
bitsandbytes, and train LoRA adapters β enables single-GPU training for very large models.
9. Common Issues and Concrete Solutions¶
π§ OOM / CUDA Out of Memory¶
- Lower
rank (r). - Use QLoRA (4-bit) or mixed precision.
- Reduce batch size and use gradient accumulation.
- Enable gradient checkpointing or CPU offload.
β‘ Training Instability / Divergence¶
- Lower
learning rateand/orΞ±. - Add a small LoRA dropout.
- Use warmup and learning rate schedulers (e.g., cosine or linear).
πͺ« Underfitting (Insufficient Capacity)¶
- Gradually increase rank (r).
- Add adapters to more modules (e.g., MLP layers).
π§© Overfitting on Small Datasets¶
- Reduce epochs and learning rate.
- Add dropout and data augmentation.
- Use early stopping and validation checks.
βοΈ Quantization Compatibility Issues¶
- Prefer tested stacks:
bitsandbytes+ Hugging Face +peft. - Validate numeric stability on a small subset before full training.
π Adapter Conflicts When Stacking¶
- Avoid overlapping target modules unless intentionally merging adapters.
- Use explicit adapter fusion tools when combining multiple adapters.
10. Best Practices & Checklist¶
- Start with small rank
r = 4β16andΞ± = 2 Γ r. - Freeze base model weights; train only adapter parameters.
- Use mixed precision and gradient checkpointing where appropriate.
- Use PEFT / Hugging Face tooling for reliable save/load and metadata management.
- Monitor validation metrics and KL-like drift metrics (compare outputs to base).
- If memory constrained, use QLoRA + LoRA adapters.
- Keep logs, seeds, and repeat runs for reproducibility.
11. Limitations & Challenges¶
- RankβCapacity Tradeoff: Small
rmay underfit; largerincreases memory use and instability. - Task-Specific Sensitivity: Optimal values for
r,Ξ±, and learning rate vary across models and tasks. - Quantization Effects: Combining LoRA with quantization (as in QLoRA) requires additional tuning.
- Adapter Management: Multiple adapters need clear naming and metadata to avoid conflicts.
- Not a Universal Replacement: For extreme distribution shifts, full fine-tuning may still be necessary.
12 Lora Alternates¶
12.1 QLoRA¶
Combines 4-bit quantization of base model with LoRA adapters.
- Base model: 4-bit NF4 (frozen, quantized)
- LoRA adapters: BF16 (trainable)
- Memory: 7B model fine-tuning in ~6 GB
(See quantization.md for full details)
12.2 LoRA+ (2024)¶
Problem: LoRA uses same learning rate for both \(A\) and \(B\) matrices.
Insight: \(A\) and \(B\) have different roles:
- \(A\): Input projection (processes raw features)
- \(B\): Output projection (maps to output space)
LoRA+ solution: Use different learning rates:
Result: 1-2% improvement on downstream tasks, same memory as LoRA.
# LoRA+ with different LR for A and B
optimizer = LoraPlus(
model.parameters(),
lr=1e-4,
lr_ratio=16, # B gets 16Γ more LR than A
)
12.3 DoRA (Weight-Decomposed LoRA, 2024)¶
Problem: LoRA modifies both magnitude and direction of weight updates together, limiting expressiveness.
DoRA decomposes weights into:
- Magnitude: Scalar \(m\) per column
- Direction: Unit vector \(V\)
Where \(\Delta V\) is the LoRA update.
Results:
- ~1-3% better than standard LoRA
- Works especially well for complex tasks
- Adopted in LLaMA-3 fine-tuning recommendations
12.4 LoRA-FA (Frozen A, 2023)¶
Problem: Both \(A\) and \(B\) consume memory for gradients.
LoRA-FA: Freeze \(A\) (random projection), only train \(B\).
- Memory: ~50% less gradient memory than LoRA
- Quality: Slightly lower than full LoRA
- Good for memory-constrained scenarios
12.5 Comparison¶
| Variant | Params | Memory vs LoRA | Quality vs LoRA |
|---|---|---|---|
| LoRA | \(2 \times r \times d\) | Baseline | Baseline |
| LoRA+ | Same | Same | +1-2% |
| DoRA | \(+d\) (magnitude) | +5% | +1-3% |
| LoRA-FA | \(r \times d\) | -50% grad | -0.5-1% |
| QLoRA | Same | -75% base model | -1-2% |
13. Comparison: LoRA vs Other Methods¶
| Method | Parameter Efficiency | Compute Cost | Flexibility | Notes |
|---|---|---|---|---|
| Full fine-tuning | β | High | Moderate | Updates all parameters |
| Adapter tuning | β | Medium | High | Bottleneck MLPs per layer |
| Prefix tuning | β | Low | Medium | Learned prompt vectors |
| LoRA | β | Low | High | Mergeable, simple low-rank updates |
| QLoRA | β β | Very Low | High | 4-bit quantization + LoRA |