Loading lesson...
Loading lesson...

Real-world breakthrough · February 2023 onwards
In February 2023, Meta released LLaMA (Large Language Model Meta AI), a family of open-source language models ranging from 7 billion to 65 billion parameters. Unlike GPT-4 or Claude, these weights were publicly available. Anyone with internet access could download them.
Within weeks, the open-source community demonstrated something remarkable. Stanford researchers released Alpaca, a fine-tuned version of LLaMA-7B trained on 52,000 instruction-following examples generated by GPT-3.5. The fine-tuning cost roughly $600 in cloud compute. Vicuna followed, trained on 70,000 user conversations from ShareGPT, and scored 90% of ChatGPT quality on benchmark evaluations. WizardLM demonstrated that evolving simple instructions into complex ones dramatically improved instruction-following ability.
The explosion happened because of two techniques: LoRA (Low-Rank Adaptation), published by Hu et al. in 2021, and QLoRA (Quantised LoRA), published by Dettmers et al. in May 2023. These reduced the memory required to fine-tune a 7-billion-parameter model from over 100 GB to under 8 GB. A single consumer GPU could now do in hours what previously required a data centre.
If a PhD student with a single GPU and a weekend can fine-tune a capable language model, what does that mean for the cost and accessibility of custom AI?
The Security and Ethics stage ensured your agents are safe and responsible. This stage pushes capability further. Fine-tuning lets you specialise a model for your domain using your data, without sending that data to a third-party API.
With the learning outcomes established, this module begins by examining when prompting is not enough in depth.
Prompt engineering is fast, cheap, and reversible. For most tasks it is the right starting point. Fine-tuning is slower, more expensive, and harder to reverse. It addresses the specific problems prompting cannot solve: consistent output formats without multi-hundred-token instructions, specialised domain knowledge not present in the base model, dramatically reduced inference costs at high volume, and fully local deployments with no API dependency.
The key word is "cannot." If a well-crafted prompt reliably produces the output you need, fine-tuning adds cost and complexity without benefit. Exhaust prompting first. Fine-tuning is the answer when you have tried and the model demonstrably fails on your specific task distribution.
The decision between fine-tuning and prompting is primarily economic: compare the one-time cost of fine-tuning and evaluation against the ongoing cost of larger prompts and more capable models at your request volume.
With an understanding of when prompting is not enough in place, the discussion can now turn to full fine-tuning versus parameter-efficient fine-tuning, which builds directly on these foundations.
Full fine-tuning updates every weight in the model. A 7-billion-parameter model requires approximately 28 GB of GPU memory to store weights at FP32 (32-bit floating point) precision, plus additional memory for activations, gradients, and optimiser states. Total memory requirement: over 100 GB. This demands expensive multi-GPU clusters.
PEFT (Parameter-Efficient Fine-Tuning) methods update only a small fraction of weights while keeping the vast majority frozen. The most widely used PEFT technique is LoRA (Low-Rank Adaptation). Instead of updating the full weight matrix W, LoRA inserts two small trainable matrices A and B alongside it. During the forward pass, the effective weight becomes W + AB. Only A and B are updated during training.
At rank r=8, the number of trainable parameters drops by approximately 10,000x compared to full fine-tuning. For a 7B model: instead of 7 billion trainable parameters, you update roughly 4 million. Memory requirement: around 16 GB with standard LoRA, and under 8 GB with QLoRA.
“We propose Low-Rank Adaptation, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.”
Hu et al., 2021 - arXiv:2106.09685, LoRA: Low-Rank Adaptation of Large Language Models
The key insight is that weight updates during fine-tuning have an intrinsically low rank. Rather than updating the full weight matrix, you can represent the update as a product of two small matrices. The model learns the same adaptation with a fraction of the parameters.
QLoRA (Quantised LoRA), published by Dettmers et al. in 2023, adds one more step: it compresses the frozen base model weights to 4-bit precision before training. This reduces memory by roughly 4x compared to standard LoRA. The trainable adapter matrices still operate in higher precision (bfloat16), so training quality is maintained. QLoRA makes fine-tuning a 7B model possible on a single 16 GB consumer GPU such as the NVIDIA RTX 3090 or 4090, or for free on Google Colab with a T4 GPU.
With an understanding of full fine-tuning versus parameter-efficient fine-tuning in place, the discussion can now turn to dataset preparation, which builds directly on these foundations.
Dataset quality determines fine-tuning quality more than any hyperparameter or architecture choice. The model can only learn patterns present in the training data. Garbage in, garbage out applies with particular force to fine-tuning, because the model will faithfully reproduce not just the style but the errors and inconsistencies in your examples.
The standard format for instruction fine-tuning pairs each example with an instruction, optional input, and the desired output. The model learns to follow the instruction pattern rather than the specific content. Training on 500 diverse, high-quality examples consistently outperforms training on 5,000 low-quality ones.
Before training, audit your dataset for these critical properties. Every example should have an instruction that is specific and consistent in format. Outputs must be correct and ideally reviewed by a domain expert. No example should contain personally identifiable information (PII). Edge cases and failure modes should be explicitly represented, not just easy, clean examples. Split the dataset into training, validation, and test sets at an 80/10/10 ratio minimum.
Common misconception
“Using a more powerful model to generate training data automatically produces a high-quality dataset.”
Generating training data via distillation from a stronger model (for example, using GPT-4o to create examples for training a 7B model) is a legitimate and often effective technique. However, generated data contains errors. If you do not review a sample before training, the fine-tuned model learns those errors confidently. Always sample and human-review generated training data. Automated generation accelerates data collection; it does not replace quality control.
With an understanding of dataset preparation in place, the discussion can now turn to running a qlora fine-tuning job, which builds directly on these foundations.
The Hugging Face ecosystem provides the standard Python libraries for LoRA and QLoRA fine-tuning. You need four packages: transformers (model loading and tokenisation), peft (the PEFT library, which implements LoRA),trl (the TRL library, which provides SFTTrainer for supervised fine-tuning), and bitsandbytes (4-bit quantisation for QLoRA).
The training configuration involves three key choices. First, the LoRA rank (r): higher rank means more trainable parameters and potentially higher quality, but more memory and slower training. Rank 8 to 16 is typical for most tasks. Second, the target modules: which transformer layers to add adapters to. The query and value projection layers (q_proj, v_proj) are standard choices. Third, the learning rate: 2e-4 is a common starting point for LoRA fine-tuning.
After training, evaluate on your held-out test set using task-specific metrics. Training loss decreasing is necessary but not sufficient. For extraction tasks, measure precision, recall, and F1 score. For generation, use ROUGE-L (overlap with reference outputs) and human evaluation. For JSON output tasks, measure schema validity rate and field-level accuracy. Compare the fine-tuned model against the base model and against prompting on the same test set.
“QLoRA uses 4-bit NormalFloat (NF4), a new data type that is information-theoretically optimal for normally distributed weights, and double quantization to reduce the average memory footprint.”
Dettmers et al., 2023 - arXiv:2305.14314, QLoRA: Efficient Finetuning of Quantized LLMs
The NF4 data type exploits the observation that pre-trained model weights are approximately normally distributed. By quantising to only 4 bits using a distribution-aware scheme, QLoRA minimises the quality loss from quantisation. Double quantisation further reduces memory by quantising the quantisation constants themselves.
With an understanding of running a qlora fine-tuning job in place, the discussion can now turn to fine-tuning versus prompting: the decision framework, which builds directly on these foundations.
Use prompting when your volume is low to medium, the base model can produce acceptable outputs with detailed instructions, and you need results immediately. Choose fine-tuning when you face high request volume (10,000 or more calls per day), the model fails consistently even with careful prompting, you need exact and consistent output formats, inference latency must decrease, or data privacy requirements prevent use of external APIs.
One common confusion: fine-tuning is not a cure for hallucination. A fine-tuned model can produce confident, well-formatted, completely incorrect output. Fine-tuning teaches style, format, and domain conventions. For factual accuracy, combine fine-tuning with retrieval-augmented generation (RAG), which provides the model with retrieved evidence at inference time rather than baking facts into weights.
EU AI Act Article 53 imposes obligations on providers of general-purpose AI (GPAI) models with systemic risk. If you fine-tune an open-source model and offer it to third parties, assess whether these obligations apply. Hugging Face model card standards require documenting training data, intended use, and known limitations before publishing a fine-tuned model to the Hub.
Common misconception
“Fine-tuning makes a model more accurate and trustworthy by teaching it facts.”
Fine-tuning adjusts style, format, and behavioural patterns. It does not give a model new factual knowledge in a reliable, retrievable way. Weights encode statistical associations, not structured facts. A fine-tuned model will confidently generate outputs that look like your training examples, including plausible-sounding but fabricated ones. Retrieval-augmented generation (RAG) is the correct tool for factual grounding.
Your company processes 50,000 support tickets per day through an LLM that extracts issue category and urgency using a 2,000-token prompt with GPT-4o. You want to reduce costs by 90%. Which approach is most justified?
What does 'trainable params: 0.056%' mean in a QLoRA fine-tuning run output?
Your fine-tuned extraction model achieves 92% accuracy on held-out test cases. You want to reach 95%. Which intervention is most likely to be effective?
A colleague says your fine-tuned customer support model is now 'more factually reliable' because it was trained on verified company documentation. What is the most accurate response?
Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models
arXiv:2106.09685
The original LoRA paper. Quoted in Section 19.2 to explain the low-rank decomposition mechanism and why it achieves comparable quality to full fine-tuning.
Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs
arXiv:2305.14314
The QLoRA paper. Quoted in Section 19.4 to explain NF4 quantisation and why it enables fine-tuning on consumer hardware.
github.com/huggingface/peft
The standard Python library implementing LoRA, QLoRA, and other PEFT methods. The SFTTrainer and LoraConfig APIs referenced in this module come from this library.
Taori, R. et al. (2023). Stanford Alpaca: An Instruction-following LLaMA model
crfm.stanford.edu/2023/03/13/alpaca.html
The Alpaca model demonstrating that instruction fine-tuning a 7B open-source model on 52,000 examples produces competitive quality. Referenced in the opening case study.
General-purpose AI model obligations
Establishes transparency, documentation, and copyright obligations for providers of GPAI models. Referenced in Section 19.5 in the context of publishing fine-tuned models to third parties.
Module 19 of 25 · Advanced Mastery