The Bitter Lesson

Reference: Sutton, R. S. (March 13, 2019). Essay on personal website, Incomplete Ideas. Not a PDF — HTML essay, captured via ar-crawl. URL

Summary

Sutton’s short but enormously influential essay argues that seventy years of AI research teach one lesson: general methods that leverage computation are ultimately most effective, by a large margin, and attempts to encode human domain knowledge tend to plateau and eventually obstruct progress. The ultimate driver is Moore’s law — the exponentially falling cost of computation — which guarantees that whatever computational budget a method can exploit will become dramatically larger on timescales shorter than a research career. Researchers, pressed for short-term gains, routinely build in human understanding; this helps locally but impedes the methods that scale.

He marshals four case studies to make the point concrete: computer chess (Deep Blue’s brute-force search beat knowledge-engineered systems in 1997, to their practitioners’ dismay); computer Go (a 20-year delay of the same pattern, culminating in search + self-play value learning); speech recognition (statistical HMMs and later deep learning displaced phoneme/vocal-tract-engineered systems); and computer vision (edges, SIFT, generalized cylinders replaced by convolutional nets). The two methods that scale arbitrarily with compute are search and learning. The deeper second lesson: the contents of minds are irredeemably complex, so we should not try to build simple theories of space, objects, multi-agent interaction, or symmetry into systems; we should instead build meta-methods that can discover that complexity themselves. “We want AI agents that can discover like we can, not which contain what we have discovered.”

Key quotes preserved for future reference:

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.”

“The two methods that seem to scale arbitrarily in this way are search and learning.”

“We have to learn the bitter lesson that building in how we think we think does not work in the long run.”

“We want AI agents that can discover like we can, not which contain what we have discovered.”

Key Ideas

Moore’s-law-driven scaling dominates human-knowledge engineering.
Search and learning are the two methods that scale with compute.
Four canonical case studies: chess, Go, speech, vision.
Human-knowledge approaches plateau and hinder further progress.
Minds are irredeemably complex; build meta-methods, not content.
Bitterness: success is over a favoured, human-centric approach.

Connections

Conceptual Contribution

Claim: Across AI’s history, general, computation-hungry methods (especially search and learning) win out over methods that build in human domain knowledge, because compute grows exponentially while human-knowledge investments do not; therefore AI research should focus on meta-methods that discover complexity rather than encode it.
Mechanism: Sutton induces the claim from four episodes (chess, Go, speech recognition, vision) in which knowledge-engineered approaches were surpassed, often reluctantly, by compute-leveraging statistical/search/learning approaches; he identifies search and learning as the two classes of techniques that scale arbitrarily with compute; and draws the meta-conclusion that the contents of minds are too complex to be programmed directly — only discovery procedures generalise.
Concepts introduced/used: Bitter Lesson, Scaling, Search and Learning, Meta-Methods, General-Purpose Methods
Stance: methodological manifesto
Relates to: Retrospectively justifies the architectural bet of Attention Is All You Need and the empirical scaling results of Language Models are Few-Shot Learners; stands in explicit tension with the knowledge-engineering tradition underpinning BDI, Agent-Oriented Programming and Ontologies; frames the debate over whether LLM Agents supersede or complement structured agent architectures.

Tags

#scaling #foundational #methodology #ai-history #manifesto

Backlinks

Linked Pages

LLM Agents

Large-language-model-powered agents: natural-language coordination, tool use, multi-agent orchestration.

Surveys & frameworks

Protocols & communication

Failures & threats

Lineage

Ontologies

Shared, formal vocabularies for meaning. Essential for agents to understand each other.

Sources

Agent-Oriented Programming

Reference: Shoham, Y. (1993). Artificial Intelligence 60, Elsevier. Source file: shoam-aop.pdf. URL

Summary

Shoham introduces Agent-Oriented Programming (AOP) as a specialization of object-oriented programming in which the state of each module — now called an agent — is a mental state composed of beliefs, capabilities, decisions/choices, commitments, and obligations. Interpretation of these components is formally grounded in an extension of standard epistemic logics that adds temporality, obligation, decision, and capability operators.

Agents communicate via speech-act-inspired primitives (inform, request, offer, promise, decline, etc.), constrained by rules such as honesty and consistency. The paper defines a class of agent interpreters and presents AGENT-0, a concrete implemented language whose semantics is tied to the mental-state logic. Shoham also discusses “agentification” — converting arbitrary devices into AOP-programmable agents — and situates AOP against BDI architectures, speech act theory, and McCarthy’s Elephant2000.

Key Ideas

AOP as OOP specialization with mental-state components.
Agents constrained by honesty/consistency on communicative acts.
Speech-act-typed primitives as message types.
AGENT-0 language and interpreter.
“Agentification” of arbitrary devices.

Connections

Conceptual Contribution

Claim: Programming can be fruitfully specialised to an “agent” paradigm in which module state is a formally specified mental state and inter-module communication is a speech act constrained by rational/ethical rules.
Mechanism: Shoham extends standard modal-epistemic logic with temporal, obligation, decision and capability operators, defines the mental-state components of an agent (belief, capability, commitment, choice, obligation), then specifies a generic agent interpreter cycle (update mental state, execute commitments, handle incoming messages) and instantiates it in AGENT-0, whose programs are condition-action rules over mental state and whose messages are typed INFORM/REQUEST/UNREQUEST primitives.
Concepts introduced/used: Agent-Oriented Programming, Mental State, Commitment, AGENT-0, Speech Act Theory, Agentification, BDI, Honesty Constraint
Stance: foundational
Relates to: Establishes the mentalistic ACL stance later criticised by Agent Communication Languages - Rethinking the Principles and surveyed by Intelligent Agents Theory and Practice; AGENT-0’s speech-act primitives anticipate KQML as an Agent Communication Language and FIPA-ACL; the mental-state architecture is reused in An Interaction-oriented Agent Framework for Open Environments.

Tags

BDI

Belief-Desire-Intention

Dominant agent-architecture paradigm: agents maintain beliefs, desires, and intentions and deliberate over them.

In this vault

Language Models are Few-Shot Learners

Reference: Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell et al. (2020). NeurIPS 2020. Source file: 2005.14165.pdf. URL

Summary

Introduces GPT-3, a 175-billion-parameter autoregressive Transformer language model, and shows that scaling up enables task-agnostic few-shot learning purely through in-context demonstrations — no gradient updates or fine-tuning required. The paper establishes the empirical basis for the “prompt as interface” paradigm that underpins modern LLM Agents.

GPT-3 is evaluated across dozens of NLP benchmarks (translation, QA, cloze, Winograd, arithmetic, word unscrambling, SuperGLUE, NLI) in zero-, one-, and few-shot regimes, often matching or exceeding state-of-the-art fine-tuned systems. The authors also examine broader impacts: misuse potential, bias, fairness, and energy cost — topics that later crystallise into the threat surfaces surveyed in AI Agents Under Threat.

Key Ideas

Scaling laws: performance on downstream tasks improves smoothly with model size, compute, and data.
In-context learning: a “meta-learning” inner loop where the model adapts to a task from examples in its prompt window, without weight updates.
Few-shot prompting as a general interface: the prompt becomes the programmable surface for LLM behaviour — the same surface later exploited by Prompt Injection and Jailbreak attacks.
Emergent capabilities (arithmetic, novel-word use) appearing only at scale.
Early catalogue of misuse risks (disinformation, generated news indistinguishable from human-written) foreshadowing agent-era threats.

Connections

Conceptual Contribution

Claim: Sufficiently large autoregressive language models become few-shot learners, performing new tasks from prompt demonstrations alone — establishing the prompt as the universal programming surface for LLM systems.
Mechanism: Train a 175B-parameter Transformer on ~300B tokens of filtered Common Crawl, WebText2, Books, and Wikipedia; evaluate on 40+ benchmarks in zero/one/few-shot settings without gradient updates.
Concepts introduced/used: in-context learning, few-shot prompting, scaling laws, emergent capabilities, prompt-as-interface — all prerequisites for LLM Agents, Tool Use, Prompt Injection, and the threat taxonomy of AI Agents Under Threat.
Stance: empirical/position
Relates to: Foundational substrate cited throughout AI Agents Under Threat; the prompt interface it popularises is the attack surface studied in Prompt Injection, Jailbreak, and ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems.

Tags

Attention Is All You Need

Reference: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Advances in Neural Information Processing Systems 30 (NeurIPS). arXiv:1706.03762. Source file: vaswani_attention_is_all_you_need.pdf. URL

Summary

The paper introduces the Transformer, a sequence-transduction architecture that dispenses entirely with recurrence and convolution and relies solely on attention mechanisms to model dependencies between input and output tokens. An encoder stack of N=6 identical layers, each composed of a multi-head self-attention sub-layer and a position-wise feed-forward sub-layer (with residual connections and layer normalisation), encodes the source; a symmetric decoder stack augments each layer with masked self-attention and cross-attention to the encoder output. Because the architecture has no inherent notion of token order, sinusoidal positional encodings are added to the input embeddings.

The core primitive is scaled dot-product attention, Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V, extended to multi-head attention by linearly projecting Q, K, V h=8 times into lower-dimensional subspaces, running attention in parallel, and concatenating the results. The motivation is computational: self-attention has O(1) sequential operations and O(1) maximum path length between any two positions — much shorter than recurrent or convolutional alternatives — enabling far greater parallelism and better gradient flow over long distances. The Transformer achieves state-of-the-art BLEU on WMT 2014 En–De (28.4) and En–Fr (41.8) at a small fraction of prior training cost and generalises to constituency parsing. It is the architectural substrate on which virtually all subsequent large language models are built.

Key Ideas

Pure attention architecture; no recurrence or convolution.
Scaled dot-product attention with 1/√dₖ scaling to stabilise softmax gradients.
Multi-head attention for attending to different representation subspaces.
Three uses of attention: encoder self-attention, decoder masked self-attention, encoder-decoder cross-attention.
Sinusoidal positional encodings enable extrapolation beyond training lengths.
O(1) path length supports learning of long-range dependencies.
Massive parallelism compared to RNNs; dramatic training cost reduction.

Connections

Conceptual Contribution

Claim: Recurrence and convolution are unnecessary for high-quality sequence transduction; an architecture built entirely from (multi-head, self-)attention and feed-forward layers outperforms prior state-of-the-art on machine translation while training in a fraction of the time.
Mechanism: Replace RNN hidden-state chains with an encoder-decoder stack of identical layers, each using scaled dot-product multi-head attention plus position-wise FFN with residual connections and layer norm; inject order via sinusoidal positional encodings; use masked self-attention in the decoder to preserve autoregression; exploit O(1) sequential complexity and O(1) max path length to parallelise across sequence positions and shorten gradient paths to long-range dependencies.
Concepts introduced/used: Transformer, Self-Attention, Multi-Head Attention, Positional Encoding, Scaled Dot-Product Attention, Encoder-Decoder Attention
Stance: empirical architecture paper
Relates to: Architectural substrate for Language Models are Few-Shot Learners, Toolformer and the entire LLM Agents literature; exemplifies the scaling-over-structure thesis of The Bitter Lesson; reshapes what “agent architecture” means relative to Intelligent Agents Theory and Practice’s 1990s taxonomy.

Tags

General-Purpose Methods

Methods applicable across tasks and domains, scaling with available computation — the methodological ideal of the Bitter Lesson and of McCarthy’s Generality agenda.

In this vault

Meta-Methods

General-purpose methods that build structure from computation rather than encoding it. Contrasted with human-knowledge-engineering.

In this vault

Search and Learning

Sutton: the two general-purpose methods that scale with computation — search (tree, Monte Carlo) and learning (gradient descent over deep function approximators).

In this vault

The Bitter Lesson

Scaling

Increasing the size of models, datasets, and computation — the practical realisation of the Bitter Lesson in modern deep learning.

In this vault

Bitter Lesson

Sutton: across 70 years of AI, general-purpose methods that leverage computation (search, learning) consistently beat methods that encode human knowledge. The hard lesson the field keeps having to relearn.

In this vault

Intelligent Agents Theory and Practice

Intelligent Agents: Theory and Practice

Reference: Wooldridge, M., Jennings, N. R. (1995). Knowledge Engineering Review. Source file: woodridge_intelligent_agents.pdf. URL

Summary

A foundational survey that organizes the agent-based computing field into three tightly coupled concerns: agent theories (formal specifications of what an agent is, often via intentional notions such as belief, desire, intention, obligation), agent architectures (software/hardware designs satisfying those specifications, e.g., BDI, subsumption, layered), and agent languages (programming languages whose primitives embody the theorists’ concepts).

Wooldridge and Jennings distinguish a weak notion of agency (autonomy, social ability, reactivity, pro-activeness) from a stronger AI-centric notion involving mentalistic attributes (knowledge, belief, intention) and sometimes emotion or mobility. They review representational and reasoning formalisms (modal logics for knowledge and belief, intention logics), critique their computational tractability, and survey implementations (AGENT-0, PLACA, Concurrent METATEM). The paper sets the vocabulary used by much subsequent MAS research.

Key Ideas

Three-way division: theory, architecture, language.
Weak vs. strong notion of agency.
Intentional stance justifies mentalistic ascription (Dennett/McCarthy).
BDI and related mental-state architectures.
Survey of 1990s agent languages and applications.

Connections

Conceptual Contribution

Claim: The emerging field of agent-based computing should be structured around the interlocking triad of agent theories, architectures and languages, and tensions between weak (behavioural) and strong (mentalistic) notions of agency should be kept explicit rather than elided.
Mechanism: Wooldridge and Jennings synthesise early 1990s work by (i) defining weak agency (autonomy, social ability, reactivity, pro-activeness) and strong agency (adds belief, desire, intention, emotion, mobility); (ii) surveying modal-logic theories of knowledge/belief/intention and the intentional stance as justification; (iii) cataloguing BDI, subsumption and layered architectures; and (iv) reviewing implemented languages (AGENT-0, PLACA, Concurrent METATEM) and representative applications. Computational tractability of the richer logics is flagged as a central open problem.
Concepts introduced/used: Weak Agency, Strong Agency, Intentional Stance, BDI, Subsumption Architecture, Layered Architecture, Concurrent METATEM, PLACA, Agent Theory-Architecture-Language Triad
Stance: survey
Relates to: Canonises the vocabulary used by Agent-Oriented Programming, Agent Communication Languages - Rethinking the Principles and Trends in Agent Communication Language; its tractability critique anticipates the social-semantic move in An Interaction-oriented Agent Framework for Open Environments.

Tags

Toolformer

Toolformer: Language Models Can Teach Themselves to Use Tools

Reference: Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom (2024). NeurIPS 2023. Source file: 2302.04761.pdf. URL

Summary

Toolformer shows that a language model can teach itself, in a self-supervised way, to decide when to call an external API, which API to call, what arguments to pass, and how to incorporate the returned result into subsequent token prediction. Starting from a handful of human-written demonstrations per tool, the model annotates a large corpus with candidate API calls, executes them, and keeps only the calls whose results reduce the cross-entropy loss on the surrounding tokens. The filtered, API-augmented corpus is then used to finetune the model.

The approach is demonstrated on a calculator, Q&A system, Wikipedia search, machine translation, and calendar. A 6.7B Toolformer (GPT-J-based) outperforms a much larger GPT-3 on several zero-shot downstream tasks without sacrificing core language modelling. The paper is a foundational reference for Tool Use in LLM Agents and therefore for the tool-layer attack surface catalogued in AI Agents Under Threat and MalTool Malicious Tool Attacks.

Key Ideas

Self-supervised tool learning via loss-reduction filtering — no task-specific supervision.
API calls are represented as interleaved special tokens (<API> name(args) -> result </API>) directly inside the token stream.
A single model learns heterogeneous tools rather than one tool per specialist.
Tools compensate for LLM weaknesses (arithmetic, factual recall, freshness, low-resource translation).
Establishes the architectural template — model emits tool-call tokens, external executor returns results, tokens resume — that later MCP/A2A-style protocols generalise.

Connections

Conceptual Contribution

Claim: Language models can learn to use external tools in a self-supervised fashion by keeping only API calls whose responses reduce next-token loss, bootstrapping tool competence from a handful of demonstrations.
Mechanism: Sample candidate API-call positions and arguments via in-context prompting; execute calls; filter by weighted cross-entropy reduction (L_i^- − L_i^+ ≥ τ_f); finetune on the filtered, API-interleaved corpus.
Concepts introduced/used: self-supervised tool learning, API-call tokens, loss-based filtering, Tool Use, LLM Agents — the direct antecedent of Model Context Protocol style tool-calling interfaces.
Stance: constructive
Relates to: Supplies the tool-invocation substrate whose abuses are studied in MalTool Malicious Tool Attacks, Skill Supply Chain Attack, and the action-layer threats in AI Agents Under Threat.