Loading lesson...
Loading lesson...
This is the sixth of 8 Practice & Strategy modules. You have studied reinforcement learning and RLHF in Module 21. Now you examine the frontier: models that process multiple modalities simultaneously, reason through problems step by step, and exhibit capabilities that were not explicitly trained. Understanding these developments is essential for any practitioner making strategic decisions about AI adoption.

Paradigm shift · May 2024
On 13 May 2024, OpenAI demonstrated GPT-4o (the "o" stands for "omni"), a model that processes text, images, and audio as native inputs and outputs. Unlike previous systems that chained separate models together (a speech-to-text model feeding a language model feeding a text-to-speech model), GPT-4o processes all modalities in a single neural network.
The demonstration showed the model engaging in real-time conversation, reacting to the speaker's tone of voice, describing what it saw through a phone camera, and generating speech with emotional inflection. The average response latency was 320 milliseconds, comparable to human conversational response time.
This was not a research preview. It was deployed to hundreds of millions of users within weeks. The gap between research capability and production deployment had collapsed. For practitioners, this raised an immediate strategic question: how do you plan a product roadmap when the underlying capabilities are advancing this fast?
What changes when a single model processes text, images, and audio natively rather than through separate pipelines?
GPT-4o exemplifies three trends converging simultaneously: multimodal unification (one model for all data types), reasoning improvements (models that think through problems step by step), and rapid deployment at scale. This module examines each trend, the technical ideas behind them, and what they mean for practitioners making strategic decisions about AI adoption.
If multimodal architectures and chain-of-thought reasoning are already familiar, use the knowledge checks to confirm your understanding and skip to Module 23: AI safety and alignment.
With the learning outcomes established, this module begins by examining multimodal models: unifying perception and language in depth.
A multimodal model processes more than one type of input (text, images, audio, video) within a single architecture. Early approaches used separate encoder networks for each modality and fused the representations at a later stage. Modern multimodal models like GPT-4o, Gemini, and Claude 3 process different modalities through a unified transformer, learning cross-modal relationships during pre-training.
The technical insight is that different modalities can be represented as sequences of tokens. Images are divided into patches and encoded as visual tokens. Audio is converted to spectrograms and encoded as audio tokens. The transformer processes all token types in the same attention mechanism, allowing the model to learn relationships between what it sees, hears, and reads.
The practical impact is significant. A customer service system can process a screenshot of an error message, a voice description of the problem, and a text log simultaneously. A medical AI can analyse a scan, read the patient's notes, and hear the clinician's verbal observations in a single pass. The awkward pipeline of separate models with lossy handoffs is replaced by a unified understanding.
“By processing all input modalities through a single model, we can eliminate the information loss that occurs at the boundaries between specialised models in a pipeline.”
Gemini Team, Google, 'Gemini: A Family of Highly Capable Multimodal Models' (2023) - Section 2: Model Architecture
This design principle drove the shift from pipelined systems to natively multimodal models. Each handoff between specialised models loses nuance: tone of voice, spatial layout, temporal alignment. A unified model preserves these cross-modal relationships.
With an understanding of multimodal models: unifying perception and language in place, the discussion can now turn to chain-of-thought: making models show their work, which builds directly on these foundations.
Chain-of-thought (CoT) prompting is a technique where the model is encouraged to produce intermediate reasoning steps before arriving at an answer. Instead of jumping directly from question to answer, the model writes out its thinking: identifying relevant information, breaking the problem into sub-problems, working through each step, and synthesising a conclusion.
The discovery, published by Wei et al. at Google in 2022, was that simply adding "Let's think step by step" to a prompt dramatically improved performance on arithmetic, commonsense reasoning, and symbolic tasks. The model already had the capability; it just needed to be encouraged to use it.
CoT works because it gives the model more computation per problem. A transformer produces one token at a time. Each token generation involves a forward pass through the entire network. When the model writes out ten steps of reasoning, it gets ten forward passes of computation, not one. The intermediate tokens serve as a form of working memory that the model can attend to when generating subsequent tokens.
Common misconception
“Chain-of-thought means the model is actually reasoning like a human.”
Chain-of-thought is a computational strategy, not evidence of understanding. The model produces tokens that look like reasoning, and these tokens genuinely improve accuracy on many tasks. But the model does not have an internal experience of thinking. It is generating the most probable next token given the context, including its own previously generated reasoning tokens. Whether this constitutes reasoning in a philosophically meaningful sense is an open debate, but the practical improvement in task performance is well-documented.
With an understanding of chain-of-thought: making models show their work in place, the discussion can now turn to o1/o3-style reasoning: scaling inference-time compute, which builds directly on these foundations.
OpenAI's o1 model (September 2024) and its successors introduced a new paradigm: instead of producing a single chain of thought, the model can allocate variable amounts of computation at inference time. For easy questions, it responds quickly. For hard problems, it "thinks longer," generating extended internal reasoning chains before producing an answer.
The key insight is that scaling inference-time compute can be as powerful as scaling training-time compute. A model trained on a fixed dataset can solve harder problems if given more time to reason at inference. This is analogous to a human spending more time on a difficult exam question: the knowledge is already there, but applying it requires deliberation.
o1 achieved modern results on competition mathematics (83rd percentile on Codeforces, gold medal on the International Mathematics Olympiad qualifying exam), PhD-level science questions, and complex coding tasks. It also introduced new failure modes: the model can produce plausible-looking reasoning chains that reach incorrect conclusions, and the extended thinking time makes it more expensive per query.
“We find that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).”
OpenAI, 'Learning to Reason with LLMs', OpenAI Research Blog (September 2024) - Scaling behaviour
This observation established that inference-time scaling is a viable alternative to simply training larger models. It opened a new dimension of capability improvement that does not require more training data or larger architectures.
With an understanding of o1/o3-style reasoning: scaling inference-time compute in place, the discussion can now turn to world models and autonomous systems, which builds directly on these foundations.
A world model is an internal representation of how the environment works: what happens when you take certain actions, what is physically possible, and what causal relationships exist. Humans have rich world models. We know that unsupported objects fall, that pushing a glass off a table will break it, and that roads continue around corners even when we cannot see them.
Recent research suggests that large-scale models are developing something resembling world models. Video generation models trained on vast amounts of footage learn plausible physics: objects fall, liquids flow, and shadows move consistently with light sources. Language models trained on enough text develop spatial reasoning capabilities they were never explicitly taught.
The implication for autonomous systems is profound. A self-driving car that can predict the consequences of its actions three seconds into the future makes better decisions than one that reacts purely to the current frame. A robot with a world model can plan multi-step manipulation tasks without needing to try every possibility physically. The combination of multimodal perception, reasoning, and world modelling is the foundation for increasingly autonomous AI systems.
Common misconception
“Large language models are just autocomplete and cannot truly understand anything.”
The 'just autocomplete' framing understates what emerges from next-token prediction at scale. Models trained on enough diverse data develop internal representations that capture syntax, semantics, factual knowledge, and basic reasoning patterns. Whether this constitutes understanding in a philosophical sense is debatable, but the functional capabilities are real and measurable. The productive question is not whether models truly understand, but what tasks they can reliably perform and where they fail.
With an understanding of world models and autonomous systems in place, the discussion can now turn to emergent capabilities and the predictability debate, which builds directly on these foundations.
Emergent capabilities are abilities that appear at a certain scale but are absent in smaller models. A model with 10 billion parameters might fail completely at multi-step arithmetic, while a model with 100 billion parameters succeeds. The capability appears to emerge suddenly rather than improving gradually.
Whether emergence is real or an artefact of measurement is actively debated. Schaeffer et al. (2023) argued that many apparent emergent capabilities are artefacts of using non-linear metrics (like exact match accuracy) rather than continuous metrics (like log-likelihood). When measured continuously, performance often improves smoothly with scale. Others maintain that certain capabilities genuinely require a minimum scale threshold.
For practitioners, the debate matters because it affects planning. If capabilities emerge unpredictably, you cannot reliably forecast what the next generation of models will be able to do. If capabilities scale smoothly, you can make more confident predictions. The current evidence suggests a mix: many capabilities scale predictably, but some do appear more suddenly, particularly those requiring the combination of multiple sub-skills.
What is the key architectural advantage of a natively multimodal model over a pipeline of specialised models?
Why does chain-of-thought prompting improve model performance on reasoning tasks?
Gemini Team, Google, 'Gemini: A Family of Highly Capable Multimodal Models', arXiv (2023)
Sections 1-4
Technical report on Gemini's natively multimodal architecture. Establishes the design principles for processing images, audio, and text through a single transformer. Used for the multimodal architecture discussion.
Full paper
The original chain-of-thought paper. Demonstrated that encouraging models to show intermediate reasoning steps dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks.
OpenAI, 'Learning to Reason with LLMs', OpenAI Research Blog (September 2024)
Full post
Introduces o1 and the concept of inference-time compute scaling. Establishes that test-time computation can be as impactful as training-time computation for improving reasoning capabilities.
Schaeffer, R. et al., 'Are Emergent Abilities of Large Language Models a Mirage?', NeurIPS (2023)
Full paper
Challenges the emergence narrative by showing that apparent discontinuities often result from metric choices rather than genuine phase transitions. Essential counterpoint for the predictability debate.
Sections 1-3, 9
thorough evaluation of GPT-4's capabilities across domains, including examples of apparent world-model reasoning. Provides measured analysis of both capabilities and limitations of frontier models.
You now understand the capabilities frontier: multimodal models, reasoning chains, inference-time scaling, and the emergence debate. These capabilities are powerful and accelerating. The next question is the most consequential one in the field: how do we ensure these systems remain aligned with human values? Module 23 covers AI safety and alignment, from the technical alignment problem to governance frameworks and the existential risk debate.
Module 22 of 24 · AI Practice & Strategy