Loading lesson...
Loading lesson...

Research milestone · December 2023 and September 2024
In December 2023, Google DeepMind released AlphaCode 2, a coding model that solved 43% of competitive programming problems from recent Codeforces contests. This placed it at approximately the 85th percentile of human competitors. For context, competitive programming problems require multi-step algorithm design, mathematical reasoning, and precise implementation. These are tasks that practitioners widely believed would remain exclusively human territory for years.
In September 2024, OpenAI released o1, the first commercially available reasoning model to use extended chain-of-thought generation as a core mechanism rather than a prompting technique. o1 scored at the 89th percentile on competitive programming, passed the US Medical Licensing Examination at above-passing threshold, and achieved scores comparable to PhD students on graduate-level physics, chemistry, and biology benchmark questions. OpenAI followed with o3 in early 2025, which achieved 87.5% on ARC-AGI (Abstraction and Reasoning Corpus), a benchmark previously used to demonstrate tasks AI could not generalise across.
These milestones matter for agent architects not because they signal artificial general intelligence, but because they demonstrate a qualitative shift: complex reasoning tasks that previously required careful prompt engineering and human oversight can now be delegated to a model with dramatically less scaffolding. The practical question is no longer "can the model handle complex reasoning?" but "when is the extra cost and latency of a reasoning model justified?"
If AI models can now score at the 85th percentile in competitive programming and pass PhD-level science exams, what does that change about where humans and agents should work together?
Production deployment is the current state of the art. This module looks forward: emerging agent architectures, new reasoning techniques, computer-use agents, and the open research questions that will shape the next generation of agentic AI.
With the learning outcomes established, this module begins by examining the pace of change and how to think about it in depth.
The agent environment in 2026 would be largely unrecognisable to someone who last examined it in 2022. Reasoning models spend variable compute on hard problems. Computer use agents navigate real graphical user interfaces. Agentic coding tools write and debug entire features autonomously across multiple files. The A2A (Agent-to-Agent) protocol published by Google in early 2025 enables agents from different organisations to delegate tasks to each other directly.
Understanding where research is heading is not academic for practitioners. The capabilities demonstrated in research papers and limited beta programmes today typically become production tools within 12 to 18 months. Designing your agent architecture without awareness of incoming capabilities means you may build expensive bespoke solutions to problems that will be solved by a model API parameter in six months.
The principle for production architects: design for today's reliable capabilities and build extension points for tomorrow's. Do not bet your production architecture on pre-release features. Do read the research to know which extension points are worth building.
With an understanding of the pace of change and how to think about it in place, the discussion can now turn to reasoning models and extended thinking, which builds directly on these foundations.
Standard LLMs generate an answer in a single forward pass through the network. Reasoning models generate an extended internal chain of thought before producing their final answer. This thinking is usually visible to the developer as a separate block in the API response. The model explores multiple approaches, identifies errors in its own reasoning, backtracks, and revises before committing to a final answer. Psychologist Daniel Kahneman's dual-process model describes this as the difference between System 1 thinking (fast, automatic, associative) and System 2 thinking (slow, deliberate, effortful). Standard LLMs are System 1. Reasoning models implement a form of System 2.
In practice, extended thinking significantly outperforms standard mode on tasks involving multi-step mathematical proofs, debugging complex code, architectural trade-off analysis, and scientific reasoning. It shows little benefit over standard mode for factual question answering, simple summarisation, translation, or conversational responses, where the answer does not require extended deliberation.
The cost is real: a 10,000-token thinking budget can cost three to five times more than a standard response and take 30 to 60 seconds. For a customer support agent handling 10,000 queries per day with simple questions, enabling extended thinking on every request would increase costs and latency substantially for no quality improvement. Reserve extended thinking for the subset of tasks where accuracy genuinely matters more than speed and cost.
“System 1 operates automatically and quickly, with little or no effort and no sense of voluntary control. System 2 allocates attention to the effortful mental activities that demand it.”
Kahneman, D. (2011). Thinking, Fast and Slow - Part 1, Chapter 1, Farrar, Straus and Giroux
Reasoning models with extended thinking implement a computational analogue of System 2: they spend tokens deliberating before committing to an answer, much as humans shift from intuitive responses to deliberate analysis for hard problems. The practical implication is the same as in human cognition: engage deliberate reasoning for genuinely hard problems and reserve faster responses for straightforward ones.
With an understanding of reasoning models and extended thinking in place, the discussion can now turn to using extended thinking with the anthropic api, which builds directly on these foundations.
Claude's extended thinking is enabled by passing a thinking parameter to the messages API with a budget_tokens value between 1,024 and 100,000. The budget sets the maximum tokens the model may use for its internal reasoning. A higher budget allows more deliberation on harder problems; it does not guarantee better answers on simple ones.
The API response contains two types of content blocks: thinking blocks (the internal reasoning, with type: "thinking") and text blocks (the final answer, with type: "text"). In most applications you display only the text block to end users. The thinking block is available for debugging and for cases where showing reasoning increases user trust, such as medical decision support or legal analysis tools.
Set max_tokens to the thinking budget plus at least 2,048 additional tokens for the final response. If you set max_tokens equal to the thinking budget, the model may exhaust its token allowance on thinking and produce no final answer. A common mistake is forgetting that the thinking tokens count towards the total output token cost.
Common misconception
“Extended thinking always produces better answers than standard mode.”
Extended thinking improves accuracy on tasks requiring genuine multi-step deliberation: complex maths, algorithm debugging, architectural trade-offs, and scientific reasoning. For factual lookups, simple summarisation, translation, or conversational queries, extended thinking adds cost and latency without measurable quality improvement. The correct question is not 'is extended thinking better?' but 'does this specific task benefit from deliberate multi-step reasoning?' Measure quality on your actual task before enabling extended thinking by default.
With an understanding of using extended thinking with the anthropic api in place, the discussion can now turn to multimodal agents, which builds directly on these foundations.
Modern frontier models process images and documents alongside text. For agents, this opens workflows that were previously impossible without specialised computer vision pipelines. A document processing agent can read a scanned invoice, extract line items, and populate a database record. A monitoring agent can analyse a dashboard screenshot and identify anomalies. A quality assurance agent can compare a UI mockup against a rendered screenshot and report discrepancies.
As of early 2026, image understanding and PDF processing are production-ready across the major frontier models. Video understanding is emerging: some models can analyse short video clips, but reliability and cost at scale remain limiting factors for most production use cases. Audio transcription is typically handled via a separate specialised model (such as OpenAI Whisper) rather than within the main LLM call.
The practical pattern for multimodal agents is to pass images as base64-encoded data in the messages array, with a media_type declaration. Keep images as small as practical while retaining the relevant detail: a 1080p screenshot of a dashboard adds significant tokens; a cropped 400x300 region containing the relevant chart achieves the same analysis at far lower cost. For document analysis, PDF files can be passed directly on models that support native PDF input, avoiding the need for separate OCR (Optical Character Recognition) pre-processing.
With an understanding of multimodal agents in place, the discussion can now turn to computer use agents, which builds directly on these foundations.
Computer use is the ability of an AI agent to interact directly with a computer interface: web browser, desktop application, or file manager. The agent sees the screen as an image, decides what action to take (click, type, scroll, press a key), and the system executes that action. The agent does not use an API or structured integration with the application. It uses the same visual interface a human would.
Anthropic demonstrated computer use with Claude in October 2024. OpenAI Operator and Google's Project Mariner followed with similar capabilities. As of early 2026, computer use is available in beta and is increasingly reliable for structured, repetitive tasks on stable UIs, such as filling in standardised forms or extracting data from a web page that lacks an API.
Current limitations are significant and must inform any production use decision. Error rate increases substantially on complex UIs with many interactive elements, modal dialogues, or dynamic content that changes between screenshots. Latency is high: each action requires a screenshot, LLM inference, and action execution, which typically takes two to five seconds per step. Security boundaries are unclear: an agent with computer use can access any file or application visible on the screen. Any credential visible on screen is accessible to the agent, which requires strict sandboxing. Determinism is low: the same instruction may produce different action sequences across runs.
“Computer use is a new frontier in AI capabilities. It allows Claude to interact with computers in the same way that humans do, opening up a new class of tasks.”
Anthropic (October 2024) - Computer use documentation: docs.anthropic.com/en/docs/build-with-claude/computer-use
The key phrase is 'the same way that humans do.' Computer use agents are generalising across any interface, not just ones that expose APIs. The implication for system design is that any application a human can use becomes accessible to an agent, including legacy systems with no integration path. The limitation is that human-designed interfaces are not optimised for the screenshot-action loop that computer use requires.
With an understanding of computer use agents in place, the discussion can now turn to agentic coding, which builds directly on these foundations.
Agentic coding tools are AI agents with access to a file system, terminal, test runner, and search tools. They operate over multiple files and iterate based on test results, making them qualitatively different from single-response code generation. Examples include Claude Code (Anthropic), GitHub Copilot Workspace, and Cursor Agent.
A traditional LLM code generation interaction looks like: you describe what you want, the model produces code in a single response, you copy it, test it, and iterate. An agentic coding interaction looks like: you describe a feature, the agent reads the existing codebase, writes the implementation across multiple files, runs the tests, fixes the failures, and reports when the tests pass. The agent iterates autonomously on the feedback loop that was previously the developer's job.
Agentic coding is already production-ready for well-defined, testable tasks: adding a new API endpoint, writing tests for existing functions, refactoring a module to a new pattern. It is less reliable for tasks requiring architectural judgment, deep understanding of existing system behaviour, or tasks where the definition of success is ambiguous. The practical guideline: use agentic coding tools for tasks where you can write a clear definition of done and verify the result programmatically.
With an understanding of agentic coding in place, the discussion can now turn to research directions for 2026 and 2027, which builds directly on these foundations.
Three research directions are most likely to change production agent architectures in the near term. The first is the A2A (Agent-to-Agent) protocol, published by Google in early 2025 as an open standard for agents from different organisations and providers to communicate. As MCP (Model Context Protocol) standardised tool interfaces, A2A standardises how agents delegate tasks to other agents. Early adopters include enterprise software vendors building interoperable ecosystems where, for example, a procurement agent can negotiate directly with a supply chain agent at another company.
The second direction is long-horizon task completion. Current agents reliably handle tasks requiring five to twenty steps. Research in 2025 and 2026 is targeting 100 to 1,000-step tasks with persistent state across sessions. The key technical challenges are reliable error recovery after many successful steps, context management across very long tasks, and cost control at extended scale. When these are solved, the class of tasks that can be safely delegated to agents expands significantly.
The third direction is constitutional agents and reinforcement learning from AI feedback (RLAIF). Research at Anthropic, Google DeepMind, and OpenAI is exploring agents that evaluate and improve their own behaviour against a specified set of principles, without requiring human labelling of every training example. This connects to Constitutional AI and has implications for how agent behaviour is aligned with organisational policies at scale.
Common misconception
“A capability demonstrated in a research paper or blog post is ready for production use.”
Research demonstrations and limited betas operate under controlled conditions with high human oversight and cherry-picked examples. Production deployment requires reliability across the full distribution of real inputs, cost efficiency at volume, safety bounds that hold without expert supervision, and integration with existing systems. Plan a 12 to 18 month gap between capability demonstration and production readiness for most frontier capabilities. Build extension points now; deploy when the reliability meets your production bar.
What is the key practical difference between a standard LLM response and an extended thinking response?
A product manager proposes using a computer use agent to automate data entry into a legacy HR system with no API, featuring a complex form-heavy interface with frequent session timeouts. Identify the most significant production risk.
You want to use extended thinking for a customer support agent handling 10,000 queries per day. The vast majority of queries are simple product questions answerable from documentation. What is the most cost-effective approach?
What is the A2A (Agent-to-Agent) protocol, and why is it significant for enterprise agent architecture?
Anthropic Extended Thinking documentation
docs.anthropic.com/en/docs/build-with-claude/extended-thinking
Official API reference for the thinking parameter, budget_tokens values, and the structure of thinking blocks in the response. Referenced in Section 22.3.
Anthropic Computer Use documentation
docs.anthropic.com/en/docs/build-with-claude/computer-use
Official capabilities, limitations, and safety guidance for computer use agents. Quoted in Section 22.5. Released October 2024.
Li, Y. et al. (2023). AlphaCode 2 Technical Report
Google DeepMind, December 2023
Demonstrated 43% solve rate on competitive programming problems, placing the model at approximately the 85th percentile of human competitors. Referenced in the opening case study.
September 2024
Documents o1's performance on medical licensing, competition mathematics, and graduate-level science benchmarks. Referenced in the opening case study as the first commercially available reasoning model.
Google Agent-to-Agent (A2A) Protocol specification
google.github.io/A2A, Published early 2025
The emerging open standard for inter-agent communication across organisations and providers. Referenced in Section 22.7 as the next interoperability layer after MCP.
Kahneman, D. (2011). Thinking, Fast and Slow
Part 1, Chapter 1, Farrar, Straus and Giroux
The dual-process cognitive model (System 1 and System 2) provides the conceptual framing for why extended thinking improves accuracy on hard reasoning tasks. Quoted in Section 22.2.
Module 22 of 25 · Advanced Mastery