The Complete Guide to LoRA and QLoRA Fine-Tuning in Python

Learn how to fine-tune small language models with LoRA and QLoRA in Python using PyTorch, Transformers, PEFT, and TRL. Includes dataset formatting, training code, and practical tuning advice.

Fine-tuning language models used to mean big GPUs, long training runs, and a budget that ruled out most side projects. That changed once parameter-efficient methods became part of the standard Python stack.

Today, you can adapt a small language model for support workflows, internal docs, structured extraction, or narrow domain writing without touching every model weight. In practice, most Python teams now start with LoRA or QLoRA because the tooling is mature, the training loop is manageable, and the results are often good enough to ship.

This guide walks through the setup that makes sense today: PyTorch for training, Transformers for model loading, PEFT for adapters, and TRL for supervised fine-tuning. I will focus on the parts that usually trip people up, especially dataset format, chat templates, quantization choices, and training settings.

Why LoRA and QLoRA matter

LoRA, short for Low-Rank Adaptation, adds small trainable matrices to selected layers instead of updating the full base model. The PEFT documentation describes it as a way to reduce the number of trainable parameters dramatically, which is exactly why it became the default starting point for custom fine-tuning work.

QLoRA takes the same basic idea and combines it with 4-bit quantization. According to the Hugging Face Transformers quantization docs, loading a model in 4-bit cuts memory use by roughly 4x, and nested quantization can save another 0.4 bits per parameter. The original QLoRA paper showed that this approach could fine-tune a 65B model on a single 48 GB GPU while keeping task performance close to full 16-bit fine-tuning.

That does not mean every project should jump to QLoRA immediately. It means you now have two practical paths:

MethodBest fitMain advantageMain tradeoff
LoRAYou have enough VRAM for the base model in normal precisionSimpler debugging and fewer moving partsHigher memory use
QLoRAYou need to fine-tune on limited GPU memoryMuch lower VRAM usageQuantization adds extra configuration and more failure modes

If you are working on a single modern GPU and the base model already fits comfortably, plain LoRA is easier to reason about. If memory is tight, QLoRA is usually the right call.

What you need before training

The current Python stack is straightforward:

pip install --upgrade torch transformers datasets peft trl accelerate bitsandbytes

For QLoRA, the Hugging Face docs recommend bitsandbytes, and the same docs note that support now covers NVIDIA GPUs for CUDA 11.8 through 13.0, plus Intel XPU, Gaudi, and CPU backends. If you are on CUDA hardware, that is still the most common route.

Your checklist should look like this:

  • Python 3.10 or newer
  • A causal language model small enough to load on your hardware
  • A dataset with instruction-response pairs or message-style conversations
  • Enough disk space for checkpoints
  • Patience for data cleaning, because that part still matters more than most hyperparameters

If you are training in mixed precision, use the modern PyTorch AMP API. The PyTorch 2.10 docs mark the older torch.cuda.amp.* shortcuts as deprecated and recommend torch.amp.autocast("cuda", ...) and torch.amp.GradScaler("cuda", ...) instead.

Get the dataset format right first

Most broken fine-tuning runs are data problems wearing a training hat.

TRL’s SFTTrainer accepts either language-modeling samples such as {"text": "..."} or prompt-completion style data. It also supports conversational formats like this:

{
    "messages": [
        {"role": "user", "content": "What is LoRA?"},
        {"role": "assistant", "content": "LoRA is a parameter-efficient fine-tuning method."}
    ]
}

That matters because modern chat models are still token predictors underneath. The Transformers chat templating guide makes the core point plain: chat history must be converted into the exact token pattern that the model expects. If your model ships with a tokenizer chat template, use it. If you apply the wrong control tokens, performance drops fast.

There are two practical rules here:

  1. Keep your dataset format consistent from the start.
  2. Match the data format to the trainer and tokenizer, not to your personal preference.

TRL’s dataset format guide also points out that SFTTrainer supports both plain text datasets and conversational messages datasets. If you are tuning an instruct or chat model, message format is usually the safer choice because it mirrors how the model was trained.

A practical QLoRA training script

The script below uses a small Qwen base model, TRL’s SFTTrainer, PEFT’s LoraConfig, and 4-bit quantization through BitsAndBytesConfig. It is a good template for a first real run.

import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

model_id = "Qwen/Qwen3-0.6B-Base"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)
model.config.use_cache = False

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    target_modules="all-linear",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
)

train_dataset = load_dataset("trl-lib/Capybara", split="train[:1%]")

training_args = SFTConfig(
    output_dir="outputs/qwen3-0.6b-qlora",
    learning_rate=1e-4,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=True,
    max_seq_length=2048,
    packing=True,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.model.save_pretrained("outputs/qwen3-0.6b-qlora/final")
tokenizer.save_pretrained("outputs/qwen3-0.6b-qlora/final")

There are a few details worth calling out.

First, target_modules="all-linear" is a useful PEFT shortcut when the architecture supports it. The current LoRA configuration docs document this behavior directly. It is a good default for many transformer-based causal language models, and it saves you from hand-listing every projection layer.

Second, the TRL documentation notes that adapter training often uses a learning rate around 1e-4 because you are only learning new parameters, not rewriting the entire model. That is higher than what many people expect if they are used to full fine-tuning.

Third, packing can improve throughput by placing multiple examples into one sequence. It is often worth enabling once your data is already clean.

What changes when you use plain LoRA instead

If you do not need 4-bit quantization, the structure stays nearly the same. The main change is how you load the model:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Everything else, including LoraConfig, SFTTrainer, tokenizer handling, and dataset preparation, stays familiar. This is why teams often prototype with LoRA first and switch to QLoRA only when memory becomes the real bottleneck.

Hyperparameters that matter most

You can spend days nudging settings that barely move the needle. These are the ones that usually matter:

1. Dataset quality

Still the biggest lever. If your examples are inconsistent, duplicated, or full of mixed instruction styles, the model learns that confusion. Clean, narrow data beats a messy dataset that is ten times larger.

2. LoRA rank and alpha

The PEFT docs expose r and lora_alpha as first-class settings because they shape the adapter capacity. A common starting point is r=8 or r=16, then lora_alpha at 16 or 32. If you are adapting for a narrow task, lower ranks often work well. If the target behavior is broad or stylistically complex, you may need more capacity.

3. Sequence length

Higher sequence lengths improve context coverage, but they raise memory costs fast. Keep the training length close to your production use case. If your app only sends 512 to 1024 tokens, training at 4096 is often wasted effort.

4. Precision choice

The bitsandbytes guide recommends NF4 for training 4-bit base models, and that remains a sensible default. If your GPU supports bfloat16, it is usually the most comfortable compute dtype for QLoRA training because it avoids some of the nastier fp16 edge cases.

5. Loss masking

TRL supports assistant_only_loss=True for templates that provide assistant token masks. This is useful when you want the model to learn from responses only, not from user prompts. It is a good feature, but do not turn it on blindly. The chat template must support it.

Common mistakes that waste a week

Ignoring the chat template

The Transformers documentation is clear on this point: different instruct models expect different control tokens. Two models built from the same base can still require different chat formatting. If you skip this step, the run may finish and still produce poor outputs.

Training on the wrong columns

SFTTrainer expects specific dataset shapes. If your data came from CSV exports, annotation tools, or internal logs, convert it before training. Do not hope the trainer will infer your intent from random column names.

Quantizing everything and then trying full fine-tuning

The bitsandbytes docs explicitly warn that 8-bit and 4-bit training support extra parameters, not full-model updates. That fits LoRA and QLoRA well. It does not mean you can quietly treat a quantized model like a normal full-precision checkpoint.

Forgetting padding and EOS behavior

Tokenizer mismatches create subtle bugs. If the tokenizer does not define a pad token, set one. If the model family needs a specific EOS token for the chat template, align it before training.

Over-reading short benchmark wins

A small validation loss drop can look encouraging and still fail in production. Test the adapted model against the prompts your users send in real work. I have seen plenty of fine-tuned models look sharp on curated samples and fall apart on messy internal tickets.

When LoRA or QLoRA is the wrong tool

These methods are strong defaults, not magic.

Skip adapter tuning if:

  • You only need prompt engineering and retrieval
  • Your domain knowledge changes daily and belongs in a database, not in weights
  • Your target model is already strong enough with a system prompt and a few examples
  • Your dataset is too small or too noisy to justify a training run

In those cases, a retrieval layer or a tighter prompt stack may give you more value with less operational drag.

Summary

LoRA is the clean entry point for adapting language models in Python. QLoRA extends the same workflow to lower-memory hardware by pairing adapters with 4-bit quantization. Today, the most practical stack is still PyTorch, Transformers, PEFT, and TRL.

If you remember only a handful of points, make them these:

  • Start with the dataset, not the model
  • Use the tokenizer’s chat template correctly
  • Pick LoRA when memory is comfortable and QLoRA when it is not
  • Use NF4 and bfloat16 as sane QLoRA defaults
  • Evaluate on real prompts before declaring the model done

Related reading on Pyrastra:

Sources

This article was researched using the following primary sources:

  1. PEFT LoRA documentation
  2. TRL SFTTrainer documentation
  3. Transformers bitsandbytes quantization guide
  4. TRL dataset formats guide
  5. Transformers chat templating guide
  6. PyTorch Automatic Mixed Precision documentation
  7. QLoRA paper on arXiv

Spread The Article

Share this guide

Send this article to your network or keep a copy of the direct link.

X Facebook LinkedIn Reddit Telegram

Discussion

Leave a comment

No comments yet

Be the first to start the conversation.