LoRA and adapter methods make fine-tuning LLMs practical for teams without massive compute budgets. Instead of training billions of parameters, you train a few million adapter parameters that sit alongside the base model. Quality approaches full fine-tuning; compute requirements drop 90%+. This post is the practical guide to LoRA, QLoRA, and the adapter patterns that ship in production in 2026.
Why not full fine-tune
Full fine-tuning of a 7B model requires ~100GB of GPU memory during training. 70B: ~1TB. Few teams can afford the hardware, and most don't need to — the quality gap between full fine-tune and LoRA is small for most tasks.
LoRA trains 0.1-1% of the parameters. Same 7B model, LoRA training uses 10-20GB. Fits on a single consumer or mid-range GPU. Training time drops similarly.
How LoRA works
For each attention layer, LoRA adds two small matrices (A and B) of rank r. The update to the layer's weights is A × B — a low-rank approximation. During training, only A and B are updated; the base model stays frozen.
Typical rank r: 8-64. Lower = fewer parameters, faster training, less expressivity. Higher = more parameters, more capacity for complex adaptations.
At inference, A × B can be merged into the base weights (no overhead) or kept separate (easy to swap adapters).
QLoRA
Quantize the base model to 4-bit while training LoRA adapters on top. Further reduces memory. 70B model LoRA fine-tuning fits on a single A100 or H100.
Quality is essentially preserved. Combines quantization and LoRA benefits. Standard for large-model fine-tuning in 2026.
See quantization post for context on the 4-bit base.
When LoRA is the right choice
Task-specific adaptation. Extract fields from invoices. Classify support tickets. Match a specific style. LoRA captures these patterns well with modest training data.
Domain adaptation. Legal, medical, financial language adaptation. LoRA trained on domain text improves base model performance in that domain.
When you have 1K-50K training examples. Below 1K, LoRA overfits. Above 50K, full fine-tuning may be worth the extra compute for marginal quality gain.
When LoRA is not the right choice
Significant behavior changes. Teaching a base model entirely new capabilities (new languages, new output formats) may need full fine-tuning. LoRA works best for adaptations of existing capabilities.
Very small changes. For minor behavior tweaks, prompting or RAG is cheaper than any fine-tuning.
When a pre-existing adapter or fine-tuned model already exists for your task. Hugging Face hosts thousands. Don't reinvent when something close exists.
Tooling in 2026
PEFT (HuggingFace). Standard library for LoRA, QLoRA, and adapter variants. Integrates with transformers and trl for training pipelines.
Unsloth. Optimized training library. 2-5x speedup over standard PEFT for many tasks. Popular for individual practitioners and smaller teams.
Axolotl. Higher-level training framework. Config-driven. Popular for teams running many fine-tuning experiments.
Commercial platforms: Together, Anyscale, Modal Labs offer managed LoRA fine-tuning. Reasonable for teams that don't want to manage GPU infrastructure themselves.
Deployment patterns
Merge adapter into base. Before deployment, merge A × B into base weights. Inference behaves as a single model with no overhead. Simple to deploy.
Keep adapter separate. Multiple adapters for different tenants or tasks. Load adapter dynamically at request time. Flexible but adds inference complexity.
Adapter stacking (multi-LoRA serving). vLLM supports serving multiple adapters with one base model. Different requests route to different adapters. Excellent for multi-tenant fine-tuning where each customer has their own adapter.
Quality expectations
For narrow tasks: LoRA quality often within 1-5% of full fine-tune. For broad adaptation: gap can be larger, especially on tasks requiring structural changes to model behavior.
Evaluate on your specific task. Generic benchmarks are not sufficient. See eval post.