Finetuning, LoRA, and Model Merging - AI Engineering: Building Applications with Foundation Models

Key Principle

Finetuning is Axis 3 of the Three-Axis Quality Model — it changes the model itself. It is the right tool specifically for behavioral failures: wrong form, style, format, or syntax. The canonical routing rule is: "finetuning is for form, and RAG is for facts." (Chapter 7)

The intrinsic dimension argument explains why PEFT is not a quality compromise. Pre-training compresses the task space, so a small low-rank update can fully redirect model behavior. Better-trained, larger base models have lower intrinsic dimension — making them easier, not harder, to finetune efficiently.

Memory bottleneck math: Training with Adam at 16-bit requires model weights + 3× trainable params in optimizer state + activations. For a 13B model, gradients and optimizer states alone reach ~78 GB before activations are counted. This is why PEFT is necessary for most practitioners, not just a convenience.

LoRA mechanism: Decomposes each weight update into two low-rank matrices A (n×r) and B (r×m). Only A and B are trained; the base weight W stays frozen. Update formula: W′ = W + (α/r) × AB. On GPT-3 175B, LoRA achieves comparable or better performance than full finetuning at ~0.0027% trainable parameters. Critically, A and B can be merged back into W before serving — zero inference latency penalty. (Hu et al., 2021, cited in Chapter 7)

QLoRA: Stores the base model in 4-bit NF4 (NormalFloat-4, designed for the near-normal distribution of pre-trained weights), dequantizes to BF16 for computation, and trains LoRA adapters in BF16. The memory savings come almost entirely from the 4-bit base, not the adapters. Result: a 65B-parameter model finetuned on a single 48 GB GPU; Guanaco 65B reached Elo 1022 vs. ChatGPT's 966 at 41 GB. (Dettmers et al., 2023, cited in Chapter 7)

Model merging via task arithmetic: Models finetuned from the same base share a coordinate space. Task vectors (finetuned weights − base weights) can be added or subtracted to combine capabilities without retraining. Merging requires no GPU compute — only CPU arithmetic on weight tensors. TIES/DARE preprocessing (zeroing the bottom 80% of task-vector parameters by magnitude) reduces inter-task interference before merging. (Ilharco et al., 2022; Yadav et al., 2023, cited in Chapter 7)

Why This Matters

Finetuning to fix an information failure — rather than using RAG — produces two compounding problems. First, it incurs an alignment tax: a model finetuned for factual QA on current events scored 0.504–0.588 vs. a base model with RAG scoring 0.875 on the same task (Ovadia et al., 2024, cited in Chapter 7). Second, the finetuned knowledge goes stale, requiring periodic retraining, while RAG can be updated by swapping retrieval indices. The failure-type routing rule (form vs. facts) prevents both mistakes.

The zero-inference-latency property of merged LoRA adapters is architecturally significant for multi-LoRA serving. Because A and B can be merged back into W, a deployment serving 100 customers or tasks stores 1 base matrix plus 100 small adapter sets — versus 100 full model copies. At rank r=8 for a 4096×4096 weight matrix, this reduces storage from 1.68B parameters (100 full models) to 23.3M total. The intrinsic dimension argument also has a counterintuitive selection implication: when choosing a base model for finetuning, raw benchmark performance matters less than how well the model was pre-trained — a stronger pre-trained model requires less finetuning data and fewer trainable parameters.

Good Examples

Routing a failure correctly: A model producing valid SQL but using the wrong dialect syntax (e.g., Postgres-specific functions in a MySQL context) is a form failure. Finetune on dialect-correct examples. A model answering questions about last quarter's earnings with outdated figures is a facts failure. Add RAG over current financials; do not finetune.

LoRA rank selection: For a customer-support tone adaptation (style/register change), r=4–8 applied to Wq and Wv is sufficient. For a code-generation task requiring new syntax patterns (e.g., a proprietary API), r=16–32 applied to all four attention matrices plus feedforward layers is more appropriate. Higher r risks overfitting when training data is small.

Model merging to add a capability: A base model has been finetuned separately for (A) formal writing style and (B) structured JSON output. Rather than retraining from scratch to get both, compute task vectors for A and B, apply TIES pruning to each, then add both vectors to the base weights. The merged model inherits both behaviors without a GPU training run.

Counterpoints

Finetuning to update factual knowledge: Finetuning a model on new facts (product specs, current events, recent research) creates a model that is expensive to keep current and degrades on unrelated tasks due to alignment tax. RAG is almost always the right tool for information that changes over time.

Full finetuning when LoRA suffices: Full finetuning a 13B model with Adam requires ~78 GB for gradients and optimizer states alone, before activations. LoRA at r=8 reduces trainable parameters by orders of magnitude with comparable or better task performance. Defaulting to full finetuning signals a misunderstanding of intrinsic dimension; it is not a safer or higher-quality option.

Merging task vectors without interference management: Naively adding task vectors from models finetuned on conflicting objectives (e.g., verbose explanations vs. terse responses) produces degraded outputs on both tasks. TIES/DARE preprocessing — zeroing low-magnitude task-vector parameters before merging — is necessary when task vectors overlap in parameter space. Skipping this step and attributing failures to "merging doesn't work" misses the fix.

Key Quotes

"In short, finetuning is for form, and RAG is for facts." (Chapter 7)

"The better trained an LLM is, the easier it is to finetune the model using a small number of trainable parameters and a small amount of data." (Chapter 7)

"Finetuning is the right tool specifically for behavioral failures — when the model's outputs are factually correct but wrong in form, style, format, or syntax." (Chapter 7)

"QLoRA's memory savings come almost entirely from quantizing the base model to 4-bit NF4 — not from the LoRA adapters." (Chapter 7)

Rules of Thumb

If the failure is informational (wrong or stale facts), use RAG first. Only finetune if the failure is behavioral (wrong format, style, syntax, or register).
Choose base model quality over raw task performance when selecting a model to finetune; better pre-training means lower intrinsic dimension and less finetuning effort.
Default LoRA configuration: r=4–16, applied to all four attention matrices (Wq, Wk, Wv, Wo); add feedforward layers if quality is insufficient; increase r only if underfitting persists.
Use QLoRA when GPU memory is the binding constraint. NF4 base + BF16 adapters enables 65B-scale finetuning on a single 48 GB GPU with minimal quality loss.
Before merging task vectors, apply TIES or DARE pruning to zero low-magnitude parameters and reduce inter-task interference. Never merge raw task vectors from models trained on conflicting objectives without this step.

Related References

RAG, Agents, and Context Construction - RAG (for facts) vs. finetuning (for form)
Dataset Engineering and Data Quality - Training data for finetuning
The Three-Axis Model and AI Engineering Discipline - Finetuning as Axis 3 of the three-axis model