eazyware
Engineering·May 13, 2024·11 min read

LoRA and adapters: fine-tune at 1% the cost

LoRA trains low-rank matrices instead of full weights. 99% of the quality at 1% of the compute. When LoRA works, when full fine-tune is required.

KR
Kushal R.
Engineering lead

LoRA and adapter methods make fine-tuning LLMs practical for teams without massive compute budgets. Instead of training billions of parameters, you train a few million adapter parameters that sit alongside the base model. Quality approaches full fine-tuning; compute requirements drop 90%+. This post is the practical guide to LoRA, QLoRA, and the adapter patterns that ship in production in 2026.

LoRA mechanics
LoRA — low-rank adaptation Full fine-tune update all 70B parameters needs massive compute (8xA100+) best quality, hardest to manage cost: $10K-$100K+ per run LoRA (low-rank adapter) update ~0.1-1% of weights fits on single A100/consumer GPU 95-99% of full-FT quality cost: $10-$500 per run How LoRA works (simplified) · Insert small rank-r matrices (A, B) alongside frozen original weights W · Only A and B train; W stays frozen · output = Wx + BAx (added effect) · Rank r typically 8-64 · larger r = more capacity, more params to train · Multiple adapters stack: train separate LoRAs for separate tasks, load as needed
Base model weights frozen. Small trainable rank-r decomposition matrices added to each attention layer. Training updates only the adapter; inference combines both.

Why not full fine-tune

Full fine-tuning of a 7B model requires ~100GB of GPU memory during training. 70B: ~1TB. Few teams can afford the hardware, and most don't need to — the quality gap between full fine-tune and LoRA is small for most tasks.

LoRA trains 0.1-1% of the parameters. Same 7B model, LoRA training uses 10-20GB. Fits on a single consumer or mid-range GPU. Training time drops similarly.

How LoRA works

For each attention layer, LoRA adds two small matrices (A and B) of rank r. The update to the layer's weights is A × B — a low-rank approximation. During training, only A and B are updated; the base model stays frozen.

Typical rank r: 8-64. Lower = fewer parameters, faster training, less expressivity. Higher = more parameters, more capacity for complex adaptations.

At inference, A × B can be merged into the base weights (no overhead) or kept separate (easy to swap adapters).

QLoRA

Quantize the base model to 4-bit while training LoRA adapters on top. Further reduces memory. 70B model LoRA fine-tuning fits on a single A100 or H100.

Quality is essentially preserved. Combines quantization and LoRA benefits. Standard for large-model fine-tuning in 2026.

See quantization post for context on the 4-bit base.

When LoRA is the right choice

Task-specific adaptation. Extract fields from invoices. Classify support tickets. Match a specific style. LoRA captures these patterns well with modest training data.

Domain adaptation. Legal, medical, financial language adaptation. LoRA trained on domain text improves base model performance in that domain.

When you have 1K-50K training examples. Below 1K, LoRA overfits. Above 50K, full fine-tuning may be worth the extra compute for marginal quality gain.

When LoRA is not the right choice

Significant behavior changes. Teaching a base model entirely new capabilities (new languages, new output formats) may need full fine-tuning. LoRA works best for adaptations of existing capabilities.

Very small changes. For minor behavior tweaks, prompting or RAG is cheaper than any fine-tuning.

When a pre-existing adapter or fine-tuned model already exists for your task. Hugging Face hosts thousands. Don't reinvent when something close exists.

Tooling in 2026

PEFT (HuggingFace). Standard library for LoRA, QLoRA, and adapter variants. Integrates with transformers and trl for training pipelines.

Unsloth. Optimized training library. 2-5x speedup over standard PEFT for many tasks. Popular for individual practitioners and smaller teams.

Axolotl. Higher-level training framework. Config-driven. Popular for teams running many fine-tuning experiments.

Commercial platforms: Together, Anyscale, Modal Labs offer managed LoRA fine-tuning. Reasonable for teams that don't want to manage GPU infrastructure themselves.

Deployment patterns

Merge adapter into base. Before deployment, merge A × B into base weights. Inference behaves as a single model with no overhead. Simple to deploy.

Keep adapter separate. Multiple adapters for different tenants or tasks. Load adapter dynamically at request time. Flexible but adds inference complexity.

Adapter stacking (multi-LoRA serving). vLLM supports serving multiple adapters with one base model. Different requests route to different adapters. Excellent for multi-tenant fine-tuning where each customer has their own adapter.

Quality expectations

For narrow tasks: LoRA quality often within 1-5% of full fine-tune. For broad adaptation: gap can be larger, especially on tasks requiring structural changes to model behavior.

Evaluate on your specific task. Generic benchmarks are not sufficient. See eval post.

Read next
When to fine-tune (and when RAG is fine)
Read next
LLM quantization: GPTQ, AWQ, GGUF, and the practical picks
Read next
Model distillation: making small models think like big ones
Tags
LoRAadaptersfine-tuningPEFT
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request