LoRA / QLoRA Fine-Tuning Parameters
99%+ parameter reduction — fine-tune LLMs on consumer GPUs. From Llama 70B on a single 48GB GPU to Mistral 7B in minutes. Calculate trainable params, memory savings, and adapter sizes.
Why This ML Metric Matters
Why: LoRA enables fine-tuning of large language models on consumer hardware by training only low-rank adapter matrices.
How: LoRA adds ΔW = B×A where A is d×r and B is r×d. Total trainable = L × m × 2dr. QLoRA quantizes the base to 4-bit NF4.
- ●99%+ param reduction vs full fine-tune
- ●r=16 common default for 7B
- ●α=2r rule of thumb
- ●QLoRA enables 70B on 48GB
📋 Quick Examples — Click to Load
📊 Trainable vs Frozen Params
📊 Memory Comparison (GB)
📊 Trainable Params vs Rank
📊 Alpha Scaling Factor
⚠️For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
Microsoft invented LoRA (Hu et al. 2021) — now used by virtually every LLM fine-tuning pipeline
— Hu et al.
QLoRA achieves 99.3% of ChatGPT quality on Vicuna with 4-bit base + LoRA adapters
— Dettmers et al.
Guanaco 65B was fine-tuned on a single 48GB A6000 using QLoRA
— QLoRA paper
HuggingFace PEFT has 10M+ downloads — LoRA is the default for parameter-efficient tuning
— PEFT
LoRA (Low-Rank Adaptation) adds trainable adapter matrices instead of updating the full model. The formula is 2×d×r per module per layer. QLoRA quantizes the base model to 4-bit (NF4), enabling 70B models on a single 48GB GPU. LoRA typically reduces trainable parameters by 99%+ vs full fine-tuning.
Key Takeaways
- • LoRA reduces trainable parameters by 99%+ vs full fine-tuning
- • QLoRA adds 4-bit quantization to freeze the base model, enabling 70B on a single 48GB GPU
- • Rank choice matters: r=8–64 typical; higher rank = more capacity but more memory
- • Alpha/rank ratio (effective α) controls scaling; α=2r is a common rule of thumb
Did You Know?
How Does LoRA Work?
1. Low-Rank Decomposition
Instead of updating W directly, LoRA adds ΔW = B×A where A is d×r and B is r×d. Only 2dr parameters per layer vs d².
2. A and B Matrices
A is initialized randomly, B with zeros so training starts from the pretrained model. Output scaled by α/r.
3. QLoRA 4-bit Base
QLoRA quantizes the base model to 4-bit (NF4), keeping it frozen. Only LoRA adapters are FP16. Enables 70B on 48GB.
Expert Tips
Comparison Table
| Feature | This Calculator | PEFT Config | Manual |
|---|---|---|---|
| LoRA param formula | ✅ | ❌ | ⚠️ |
| QLoRA memory estimate | ✅ | ❌ | ❌ |
| Example presets | ✅ | ❌ | ❌ |
| Charts & visualizations | ✅ | ❌ | ❌ |
Frequently Asked Questions
LoRA vs full fine-tuning?
LoRA trains only low-rank adapter matrices (typically <1% of params) while freezing the base model. Full fine-tuning updates all parameters. LoRA saves 99%+ memory and often matches full fine-tune quality.
How do I choose rank (r)?
Start with r=16 for 7B models. Use r=8 for smaller models or limited data; r=32–64 for complex tasks or larger models. Higher rank = more capacity but more memory.
What is alpha (α)?
Alpha scales the LoRA output. Effective scaling = α/r. α=2r is a common rule of thumb. When α=r, scaling is 1:1.
What is QLoRA NF4?
QLoRA uses 4-bit NormalFloat (NF4) quantization for the base model. The base stays frozen; only LoRA adapters are trained in FP16. Enables 70B on a single 48GB GPU.
Which target modules to use?
For best quality: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj. For speed: q_proj and v_proj only. Attention layers (q/k/v/o) are most impactful.
Can I merge LoRA adapters?
Yes. Merge adapters into base weights for zero-overhead inference. Use model.merge_and_unload() in PEFT.
Key Statistics
Official Data Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual memory usage depends on implementation (PEFT, bitsandbytes), batch size, sequence length, and optimizer choice. For production, validate with your specific framework and hardware.