The Ultimate LLM Fine-Tuning Guide

From dataset to GGUF - every parameter explained, every step runnable

May 03, 2026

Fine-tuning is a direct intervention into how a language model behaves. Not prompting, not system instructions, not RAG - actual weight modification. The model after training is a different model than before.

The use cases span an unusually wide range. Teaching a model a specific writing style or persona. Injecting domain knowledge it wasn’t trained on. Making it respond consistently in a particular language or format. Eliminating behaviors you don’t want. Building a character for a game that stays in character under pressure. Aligning a general-purpose model to a narrow, specialized task where generic responses are worse than useless. All of these are fine-tuning problems, and all of them work through the same mechanism: you show the model enough examples of what you want until the weights move.

This guide walks through the complete pipeline - environment setup, dataset format, training configuration, and export to a GGUF file you can run locally. The example model is Qwen3-0.6B, small enough to train on modest hardware. But the principles scale. The same levers that move a 0.6B model move a 70B model. The numbers change. The logic doesn’t.

What Fine-Tuning Actually Does

A language model is a probability distribution over tokens. Given a sequence of text, it assigns probabilities to what comes next. Training adjusts the weights — billions of floating point numbers - so that the distribution shifts. The model that previously said “Paris” when asked about capitals still says “Paris”, but the model that previously rambled when asked to write product copy now writes clean, structured product copy.

Fine-tuning doesn’t erase what the model knows. It reshapes how that knowledge surfaces. Think of it less as reprogramming and more as extended, very intensive behavioral conditioning.

The Stack

ms-swift — the training framework. Wraps HuggingFace Transformers with a clean CLI and sane defaults.
llama.cpp — for converting the trained model to GGUF format, which is what local inference tools like LM Studio, Ollama, and llama-server consume.
Miniconda — environment management. Keeps the CUDA dependencies isolated.

Prerequisites

GPU: An NVIDIA GPU with Turing architecture or newer — that’s the RTX 2000 series / GTX 1660 Ti and up. CUDA 12.8 requires at minimum Compute Capability 7.5, which corresponds to Turing. Pascal (GTX 1000-series) is not supported. Realistically, for anything beyond a 0.6B toy model you want at least 8–12 GB VRAM — an RTX 3080, RTX 4070, or equivalent. The more VRAM, the larger the model and sequence length you can handle.

Driver: Linux driver ≥ 570.26, Windows driver ≥ 570.65. Check your current version with:

nvidia-smi

If the driver is outdated, update it before proceeding - mismatched driver/CUDA versions are the most common source of silent failures in this stack.

OS: Native Linux or Windows with WSL2. The setup below assumes Ubuntu. On WSL2: install the NVIDIA driver on the Windows host only — never inside WSL2. The driver is automatically exposed inside WSL2 as libcuda.so. Do not run apt install nvidia-driver-* inside WSL2.

CUDA Toolkit: Recommended on both native Linux and WSL2. The toolkit (nvcc, libraries) is separate from the driver.

Ubuntu 22.04:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt install cuda-toolkit-12-8 -y

Ubuntu 24.04:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt install cuda-toolkit-12-8 -y

After installation, add the toolkit to your PATH:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Verify with nvidia-smi (driver) and nvcc --version (toolkit).

Environment Setup

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

eval "$(~/miniconda3/bin/conda shell.bash hook)"
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

conda create -n finetune python=3.11 -y
source ~/miniconda3/bin/activate
conda init --all
conda activate finetune

Then install PyTorch with CUDA 12.8 support, a prebuilt Flash Attention wheel, and ms-swift:

pip install torch==2.9.1 torchaudio==2.9.1 torchvision==0.24.1 \
  --index-url https://download.pytorch.org/whl/cu128

pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.5.4/flash_attn-2.8.3+cu128torch2.9-cp311-cp311-linux_x86_64.whl

pip install ms-swift

Flash Attention isn’t strictly required, but it meaningfully reduces memory usage and speeds up training on supported hardware. Worth installing.

Downloading the Model

Create a file download_model.py:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Qwen/Qwen3-0.6B",
    local_dir="./model/Qwen3-0.6B",
    local_dir_use_symlinks=False
)

pip install huggingface_hub
python download_model.py

This pulls the full model weights into ./model/Qwen3-0.6B. For larger models - say, Qwen3-14B or Llama-3-8B - the same script works, you just swap the repo_id.

The Dataset

ms-swift expects a JSON array of conversation objects. Each entry maps directly to how the model will learn: user says X, assistant responds Y. The format supports three distinct structures - single-turn, multi-turn, and system-prompted - which can all coexist in the same file.

[
  {
    "messages": [
      {"role": "user", "content": "What is the capital of France?"},
      {"role": "assistant", "content": "The capital of France is Paris, one of the most populous cities in Europe with a history stretching back to the Middle Ages."}
    ]
  },

  {
    "messages": [
      {"role": "user", "content": "I need help with my business plan. Where do I start?"},
      {"role": "assistant", "content": "The most sensible entry point is market analysis — before you write anything else, you need to understand whether there's actually a paying customer for your idea. [...]"},
      {"role": "user", "content": "Okay, and how do I structure the document after that?"},
      {"role": "assistant", "content": "Classic structure: Executive Summary, Market Analysis, Business Model, Financial Planning, Team. The Executive Summary gets written last, paradoxically, even though it appears first. [...]"}
    ]
  },

  {
    "messages": [
      {"role": "system", "content": "You are Captain Harrow, a seasoned 18th-century mariner. You speak bluntly and directly, know every port between Lisbon and Malabar, and have little patience for landlubbers who don't know the sea."},
      {"role": "user", "content": "Captain, I'd like to join your crew."},
      {"role": "assistant", "content": "Join the crew, he says. Look at you — soft hands, no calluses, smell like a bakery. What use are you to me on the Valdez, boy? Can you tie a knot that holds in a storm? Have you ever stood watch for three days without sleep? [...]"}
    ]
  }
]

This is a structural illustration, not a real dataset. Three entries demonstrate the format; a functional fine-tune requires substantially more. What “substantially more” means depends entirely on what you’re trying to achieve - teaching a narrow, well-defined behavior might need a few hundred high-quality examples, while shifting general style or instilling domain knowledge typically requires thousands. Quality matters more than quantity: ten inconsistent examples actively work against you.

Save this as dataset.json in your working directory.

Training

Understanding the Key Parameters

Before running anything, it’s worth knowing what you’re actually adjusting. These parameters matter regardless of model size.

Learning Rate (--learning_rate) controls how aggressively the weights are updated per step. Too high and training destabilizes — the loss spikes instead of declining. Too low and the model barely changes. For full fine-tuning of a small model, 6e-5 is a solid starting point. For larger models (7B+), you typically want to go lower: 1e-5 to 2e-5. For LoRA, the effective learning rate can be higher because only a fraction of the weights are being updated - 1e-4 is common.

Epochs (--num_train_epochs) is how many complete passes over the dataset the training makes. More epochs means more exposure to the data, but also higher risk of overfitting - the model memorizes your examples instead of generalizing from them. For small datasets (hundreds to low thousands of samples), 3–5 epochs is typical. For large datasets, 1–2 often suffices.

Warmup Ratio (--warmup_ratio) defines what fraction of training steps are used to gradually ramp up the learning rate from zero to its target value. Starting at full learning rate from step one often causes instability early in training. 0.05 means the first 5% of steps are warmup.

Max Length (--max_length) defines the maximum sequence length the model processes during training - input plus output combined, in tokens. This parameter has a disproportionate impact on memory consumption: VRAM usage scales roughly quadratically with sequence length due to the attention mechanism, which computes relationships between every token and every other token in the sequence. At 2048 tokens, most conversational and instructional datasets are covered comfortably. If your dataset contains long documents or extended dialogues, you might need to go higher — but doubling the sequence length can more than double your VRAM requirement.

Batch Size and Gradient Accumulation work together. --per_device_train_batch_size 1 with --gradient_accumulation_steps 12 is functionally equivalent to a batch size of 12, but only keeps 1 sample in memory at a time. Useful for training on consumer GPUs where an actual batch size of 12 wouldn’t fit in VRAM. Larger effective batch sizes generally produce more stable gradients — 12 is a reasonable default, go higher for larger models if VRAM allows.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. Maximum expressivity, maximum memory requirements.

swift sft \
  --template qwen3_nothinking \
  --model ./model/Qwen3-0.6B \
  --dataset ./dataset.json \
  --train_type full \
  --optim adamw_8bit \
  --torch_dtype bfloat16 \
  --num_train_epochs 5 \
  --warmup_ratio 0.05 \
  --learning_rate 6e-5 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 12 \
  --logging_steps 10 \
  --gradient_checkpointing_kwargs '{"use_reentrant": false}' \
  --max_length 2048 \
  --attn_impl flash_attn \
  --weight_decay 0.01 \
  --output_dir ./output

For a 0.6B model this is feasible on most modern GPUs. For anything above 3B, full fine-tuning starts requiring serious VRAM - which is where LoRA comes in.

LoRA

LoRA (Low-Rank Adaptation) doesn’t update the original weights directly. Instead it injects small trainable matrices alongside the existing ones and only trains those. The result: a fraction of the parameters, a fraction of the memory, surprisingly close results.

swift sft \
  --template qwen3_nothinking \
  --model ./model/Qwen3-0.6B \
  --dataset ./dataset.json \
  --train_type lora \
  --torch_dtype bfloat16 \
  --num_train_epochs 5 \
  --warmup_ratio 0.05 \
  --learning_rate 1e-4 \
  --lora_rank 16 \
  --lora_alpha 64 \
  --target_modules all-linear \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 12 \
  --logging_steps 10 \
  --gradient_checkpointing_kwargs '{"use_reentrant": false}' \
  --max_length 2048 \
  --attn_impl flash_attn \
  --weight_decay 0.01 \
  --output_dir ./output

--lora_rank controls the dimensionality of the adapter matrices — higher rank means more expressive adapters but more parameters. 16 is a solid default for small models; 32 makes sense for more complex behavioral changes. --lora_alpha scales the adapter’s contribution to the output - the ratio of alpha to rank (here 4:1) is what matters, not the absolute values. At rank 8 or below on a small model, the adapter’s capacity is often too limited to produce meaningful behavioral change.

For very constrained hardware, two additional flags enable 4-bit quantization of the base model weights during training:

--quant_method bnb
--quant_bits 4

Scaling to Larger Models

The commands above work verbatim for larger models - swap ./model/Qwen3-0.6B for whatever you’ve downloaded. What needs adjustment:

Model Size Recommended train_type Learning Rate Notes 0.6B – 1.5B full or lora 5e-5 – 1e-4 Fits on consumer GPU 3B – 7B lora 1e-5 – 5e-5 Full requires 40GB+ VRAM 14B+ lora + 4bit 1e-5 – 2e-5 QLoRA territory

The other parameter worth adjusting at scale is gradient_accumulation_steps — larger models benefit from larger effective batch sizes, so increasing this compensates for the smaller per-device batch you’re forced into by VRAM constraints.

PS: You can find the right --template here if you want to train other models than Qwen3: https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html

Merging the LoRA Adapter

After LoRA training, the output is an adapter — a small set of weight deltas, not a standalone model. Before converting to GGUF, merge it back into the base:

swift export \
  --adapters output/vx-xxx/checkpoint-xxx \
  --merge_lora true \
  --output_dir ./output/merged

ms-swift reads training configuration automatically from the checkpoint directory, so --model doesn’t need to be specified explicitly. After full fine-tuning, this step is unnecessary - the output is already a complete model.

Converting to GGUF

Download a specific llama.cpp release - source and binary must match, since convert_hf_to_gguf.py comes from the source and llama-quantize from the binary:

# Download source and prebuilt binary for the same commit
# Example: https://github.com/ggml-org/llama.cpp/tree/b8994
# Binary: https://github.com/ggml-org/llama.cpp/releases/download/b8994/llama-b8994-bin-ubuntu-vulkan-x64.tar.gz

pip install mistral_common  # required dependency for the convert script

Convert to GGUF (f16 as intermediate format):

python ./llama.cpp/convert_hf_to_gguf.py ./output/merged \
  --outfile ./your_model.gguf

Then quantize. This is the step that actually makes the file usable for local inference:

./llama.cpp/llama-quantize ./your_model.gguf ./your_model_Q4_K_M.gguf Q4_K_M

Q4_K_M is a 4-bit quantization format that preserves most of the model’s capability while reducing file size by roughly 75% compared to f16. It’s the standard choice for local deployment. Other options like Q5_K_M or Q8_0 trade size for quality - adjust based on your inference hardware.

The resulting .gguf file loads directly into LM Studio, Ollama, or any llama.cpp-based inference server.

What You Now Have

A complete pipeline from base model weights to a quantized, locally runnable file - with every parameter exposed and explained. The 0.6B example is deliberately small: fast iteration, immediate feedback, low cost for experimentation. The same pipeline runs on a 70B model. The numbers scale. The logic doesn’t change.

Prompt Injection

Discussion about this post

Ready for more?