Loading lesson...
Loading lesson...
This is the second of 8 Practice & Strategy modules. Module 17 gave you the architecture of a production ML system. Now the question is economic: these systems consume GPU hours, memory, bandwidth, and electricity. Without deliberate optimisation, inference costs scale linearly with traffic and training costs scale with model size. This module teaches you the techniques that break that linearity.

Open-weight revolution · 2023-2024
In February 2023, Meta released Llama, a family of open-weight large language models ranging from 7 billion to 65 billion parameters. The research community quickly discovered that Llama 7B, when fine-tuned, could match the performance of models ten times its size on many benchmarks. Within months, quantised versions were running on consumer laptops.
By mid-2024, Llama 3 had pushed the envelope further. The 70B variant rivalled GPT-4 on several benchmarks, and the community had produced 4-bit quantised versions that ran on a single A100 GPU instead of the eight that the full-precision model required. Knowledge distillation techniques allowed teams to train smaller models that captured most of the large model's capability at a fraction of the serving cost.
The Llama story demonstrates a fundamental principle of ML economics: the cost of intelligence is not fixed. It can be compressed through distillation, quantised to lower precision, pruned of redundant parameters, and deployed at the edge. The question is not whether you can afford modern AI, but which combination of these techniques gives you the performance you need at a cost you can sustain.
If the best AI models cost millions to train and thousands per day to serve, how does anyone outside big tech afford to use them?
Meta's Llama release catalysed an explosion of efficiency research. This module walks through the core techniques, from GPU memory fundamentals through quantisation and distillation, so you can make informed decisions about how to deploy models without bankrupting your organisation.
If you have already optimised model serving in production, use the knowledge checks to confirm your understanding and skip to Module 19: Evaluation at scale.
With the learning outcomes established, this module begins by examining gpu memory and compute fundamentals in depth.
Every model parameter occupies memory. In full precision (FP32), each parameter requires 4 bytes. A 7-billion-parameter model therefore requires approximately 28 GB just for its weights, before accounting for activations, gradients, and optimizer state during training. An NVIDIA A100 GPU has 80 GB of high-bandwidth memory (HBM). A consumer RTX 4090 has 24 GB. The arithmetic determines what you can run and where.
During training, memory consumption is roughly 4x the model size because you store the weights, gradients, optimizer states (which for Adam are two additional copies), and activations. A 7B parameter model at FP32 requires approximately 112 GB for training, which exceeds a single A100. This is why distributed training across multiple GPUs is standard for large models.
During inference, memory requirements drop dramatically because you only need the weights and a small amount of activation memory. A 7B model at FP16 (2 bytes per parameter) requires approximately 14 GB, which fits on a single consumer GPU. This asymmetry between training and inference memory is the foundation of the cost optimisation strategies that follow.
With an understanding of gpu memory and compute fundamentals in place, the discussion can now turn to quantisation: trading precision for efficiency, which builds directly on these foundations.
Quantisation reduces the numerical precision of model weights from 32-bit floating point (FP32) to lower bit widths: FP16 (half precision), INT8 (8-bit integer), or INT4 (4-bit integer). Each reduction halves or quarters the memory footprint and increases throughput because lower-precision arithmetic is faster on modern GPU hardware.
Post-training quantisation (PTQ) applies quantisation to an already-trained model without retraining. The simplest approach rounds each weight to the nearest representable value at the target precision. More sophisticated methods like GPTQ and AWQ analyse the model's weight distributions to minimise the accuracy loss from rounding.
Quantisation-aware training (QAT) simulates quantisation during training, allowing the model to adapt its weights to the reduced precision. QAT typically preserves more accuracy than PTQ but requires retraining.
The practical impact is dramatic. Llama 2 70B at FP16 requires two A100 GPUs (140 GB). At INT4 with GPTQ, the same model fits on a single A100 (roughly 35 GB) with minimal accuracy loss on most benchmarks. For many production use cases, INT8 quantisation is the sweet spot: negligible accuracy loss with a 2x memory reduction and significant throughput improvement.
“4-bit quantisation of LLMs can achieve near-lossless compression, enabling models that previously required multiple GPUs to run on a single device.”
Frantar, E. et al., 'GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers' (2023) - Section 5: Experimental Results
GPTQ demonstrated that large language models could be quantised to 4 bits with minimal perplexity degradation, fundamentally changing the economics of LLM inference and enabling consumer-hardware deployment.
Common misconception
“Quantisation always ruins model quality.”
Modern quantisation methods like GPTQ, AWQ, and QAT preserve nearly all model capability at INT8 and most at INT4. The accuracy loss is typically 1-3% on standard benchmarks, which is often acceptable given the 2-4x reduction in memory and cost. The key is using calibration data that represents your actual use case, not just generic benchmarks. For some tasks (simple classification, retrieval), INT4 is indistinguishable from FP16. For others (complex reasoning, code generation), INT8 is the safer minimum.
With an understanding of quantisation: trading precision for efficiency in place, the discussion can now turn to knowledge distillation: compressing intelligence, which builds directly on these foundations.
Knowledge distillation trains a smaller "student" model to mimic the behaviour of a larger "teacher" model. Instead of training the student on the original labels alone, you train it on the teacher's output probabilities (soft labels). These soft labels carry richer information than hard labels: they encode the teacher's uncertainty and the relationships between classes.
The technique was formalised by Hinton, Vinyals, and Dean in 2015. The core insight is that the dark knowledge in the teacher's probability distribution, which classes the teacher considers similar, what it finds ambiguous, carries information that hard labels discard. A student trained on soft labels often outperforms an identically sized model trained on hard labels alone.
In the LLM era, distillation has become the primary method for creating smaller, deployable models. OpenAI's GPT-4 is widely believed to have been distilled into smaller models for specific tasks. The Alpaca and Vicuna projects demonstrated that fine-tuning a small open model on outputs generated by a larger model could produce surprisingly capable results. This approach, sometimes called "distillation through imitation," is now standard practice.
With an understanding of knowledge distillation: compressing intelligence in place, the discussion can now turn to pruning: removing what the model does not need, which builds directly on these foundations.
Pruning removes weights, neurons, or entire layers from a trained model that contribute little to its output. The lottery ticket hypothesis (Frankle and Carlin, 2019) showed that large neural networks contain sparse subnetworks that, when trained in isolation, match the full network's performance. This suggests that most parameters in a trained model are redundant.
Unstructured pruning sets individual weights to zero based on their magnitude. It can achieve high sparsity (90%+ zeros) but requires specialised hardware or software to exploit the sparsity for actual speedup. Standard GPU matrix multiplication does not benefit from scattered zeros.
Structured pruning removes entire neurons, attention heads, or layers. It produces models that are genuinely smaller and faster on standard hardware because the pruned dimensions are physically removed. The trade-off is that structured pruning is more aggressive and typically requires fine-tuning after pruning to recover accuracy.
In practice, quantisation and distillation are more widely used than pruning for LLMs. Pruning shines for convolutional neural networks and smaller models where structured removal of filters or channels produces immediate speedup on edge devices.
With an understanding of pruning: removing what the model does not need in place, the discussion can now turn to edge deployment: bringing inference to the device, which builds directly on these foundations.
Edge deployment runs inference on the end-user's device (phone, browser, IoT sensor) rather than in a data centre. The benefits are compelling: zero network latency, no per-request API cost, and data never leaves the device (important for privacy). The constraint is compute: a mobile phone has far less memory and processing power than a cloud GPU.
Frameworks like TensorFlow Lite, ONNX Runtime, and Core ML (Apple) optimise models for edge hardware. Techniques from this module, quantisation to INT8, pruning, distillation, are the prerequisite for edge deployment. A 7B parameter model will not run on a phone; a 1B distilled, INT4-quantised variant might.
The decision between cloud and edge is not binary. Many production systems use a hybrid approach: a small, fast model runs on the device for common cases, and complex queries are forwarded to a larger cloud model. This pattern, called speculative decoding in the LLM context, reduces average latency and cost while maintaining quality for difficult inputs.
With an understanding of edge deployment: bringing inference to the device in place, the discussion can now turn to inference cost at scale: the economics of serving, which builds directly on these foundations.
Training a large model is expensive, but it happens once (or on a schedule). Inference happens for every user request, every second of every day. For most production ML systems, inference cost dominates the total cost of ownership. OpenAI reportedly spends more on inference infrastructure than on training.
The key levers for reducing inference cost are: batching(processing multiple requests simultaneously to maximise GPU utilisation), caching (storing and reusing predictions for repeated inputs), model selection (routing easy queries to small models and hard queries to large ones), and the techniques from this module (quantisation, distillation, pruning).
A practical framework for inference cost estimation: cost per request equals (GPU cost per second) times (latency per request) divided by (batch size). An A100 at $2/hour serving a 7B INT8 model at 50ms per request with a batch size of 32 costs approximately $0.0001 per request, or $0.10 per thousand requests. The same model at FP16 with batch size 1 costs roughly $0.003 per request, a 30x difference. Every optimisation technique in this module directly reduces this number.
“The cost of inference, not training, determines the economic viability of deploying large language models in production.”
Patterson, D. et al., 'Carbon Emissions and Large Neural Network Training', arXiv (2021) - Section 4: Inference Costs
This Google Research paper quantified the energy and cost asymmetry between training (one-time, amortised) and inference (continuous, scaling with traffic). It established that inference optimisation is the primary lever for making large models economically viable.
A 13-billion-parameter model stored in FP16 (2 bytes per parameter) needs to run inference on a single GPU with 24 GB of memory. Will it fit?
A team wants to deploy a language model for a customer support chatbot. They have two options: (A) serve GPT-4 via API at $0.03 per 1K tokens, or (B) fine-tune and quantise Llama 3 8B to INT4 and serve it on their own A100 at $2/hour. They expect 500K requests per day averaging 500 tokens each. Which is more cost-effective?
Which model compression technique is most appropriate for deploying a computer vision model on a mobile phone with 4 GB of RAM?
Full paper
Introduced the GPTQ algorithm for 4-bit quantisation of LLMs. Demonstrated that models could be compressed to 25% of their original size with minimal accuracy loss, enabling consumer-hardware deployment.
Hinton, G., Vinyals, O. & Dean, J., 'Distilling the Knowledge in a Neural Network' (2015)
Full paper
Formalised knowledge distillation and introduced the concept of soft labels carrying 'dark knowledge'. This paper established the theoretical and practical foundation for model compression through imitation.
Sections 2-4
Demonstrated that large neural networks contain sparse subnetworks that match full-network performance. This foundational result justifies pruning as a principled compression strategy, not just a heuristic.
Patterson, D. et al., 'Carbon Emissions and Large Neural Network Training', arXiv (2021)
Section 4
Quantified the energy and cost asymmetry between training and inference. Established that inference cost, not training cost, determines the economic viability of large model deployment.
Touvron, H. et al., 'Llama 2: Open Foundation and Fine-Tuned Chat Models', Meta AI (2023)
Sections 2-3
Describes the training methodology and scaling decisions behind the Llama 2 model family. The release catalysed open-weight LLM research and demonstrated that competitive models could be made publicly available.
You now understand how to make models smaller, faster, and cheaper without sacrificing the performance that matters. The next question is: how do you know whether a model is actually good enough? When models generate text, images, or code, traditional metrics like accuracy and F1 are insufficient. Module 19 covers human evaluation protocols, LLM-as-judge, red teaming, and the benchmark suites that the industry uses to compare models.
Module 18 of 24 · AI Practice & Strategy