ReAct: Synergizing Reasoning and Acting in Language Models

Reference: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao (2023). ICLR 2023 (Princeton / Google Brain). arXiv:2210.03629. Source file: react-2210.03629.pdf. URL

Summary

ReAct is a prompting paradigm that interleaves verbal reasoning traces (thoughts) with task-specific actions in a single LLM generation, letting the model plan, update its plan with observations from the environment, and query external sources. Unlike chain-of-thought (reasoning only) or action-only agents (e.g. WebGPT-style), ReAct alternates Thought -> Action -> Observation steps, so the model grounds its reasoning in retrieved evidence and uses its reasoning to decide what to act on next.

Evaluated on HotPotQA, Fever (Wikipedia API with search/lookup/finish actions), ALFWorld (text household tasks), and WebShop. On knowledge tasks ReAct outperforms action-only prompting and, combined with chain-of-thought self-consistency, beats CoT alone by reducing hallucination. On decision-making tasks ReAct beats imitation-learning and RL baselines by absolute 34% and 10% respectively, using only one or two in-context examples. The paper is the seminal reference for tool-using LLM agents and the thought/act/observe loop now ubiquitous in LLM agent frameworks.

Key Ideas

Interleave reasoning (thoughts) and acting (tool/API calls) in one LLM trajectory
Augment the action space A with a free-form language space L of “thoughts” that do not affect the environment but update the context
External Wikipedia API (search, lookup, finish) as a minimal tool interface for QA
Back-off combination with CoT self-consistency: use ReAct when it finishes, otherwise fall back to CoT-SC, or vice versa
Human-interpretable, inspectable, and edit-able agent trajectories (“thought editing”)
Few-shot prompts with 1-6 exemplars suffice; fine-tuning on 3k trajectories further boosts small models

Connections

Conceptual Contribution

Claim: Reasoning and acting are synergistic rather than separate capabilities; interleaving them in a single LLM generation produces agents that are more grounded, less hallucination-prone, and more robust than pure reasoning (CoT) or pure acting (WebGPT-style) baselines.
Mechanism: Extend the agent’s action space to A’ = A ∪ L where L is the free-form language of thoughts. Prompt the LLM with few-shot thought/action/observation exemplars. The model decides when to think and when to act; observations from the environment (API results, game state) re-enter the context and condition subsequent thoughts.
Concepts introduced/used: Thought-Act-Observation loop, tool-use prompting, interleaved reasoning/acting, grounded reasoning, hallucination reduction via external knowledge, few-shot agent prompting, thought editing, ReAct + CoT-SC back-off.
Stance: empirical / methods — prompting-level intervention with extensive benchmarks.
Relates to: Direct successor to Chain-of-Thought Prompting — CoT without actions is shown to hallucinate (14% vs ReAct’s 6%). Makes tool use a first-class LLM behaviour, complementary to Toolformer’s self-taught API-call fine-tuning. Foundational for LLM Agents and a key building block in Agents Framework - Zhou et al. The thought/act/observe skeleton is reused by nearly all subsequent agent papers, including Reflexion Language Agents with Verbal Reinforcement Learning (which uses ReAct as its default Actor).

Tags

#llm-agents #tool-use #reasoning #prompting #chain-of-thought

Backlinks

Linked Pages

Reflexion Language Agents with Verbal Reinforcement Learning

Reflexion: Language Agents with Verbal Reinforcement Learning

Reference: Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao (2023). NeurIPS 2023 (Northeastern / MIT / Princeton). arXiv:2303.11366. Source file: reflexion-2303.11366.pdf. URL

Summary

Reflexion reinforces language agents not by updating weights but by verbal feedback: after each episode the agent converts scalar or binary success signals into a natural-language self-critique and stores that critique in an episodic memory buffer that conditions the next attempt. This acts as a semantic gradient, letting the agent learn from trial-and-error in a handful of episodes without fine-tuning. The framework factors an agent into three LLM-instantiated modules: an Actor (generates text and actions, typically a ReAct agent or CoT reasoner), an Evaluator (scores the trajectory), and a Self-Reflection model (produces the verbal feedback written to memory).

On ALFWorld (decision-making) Reflexion completes 130/134 tasks vs ~100 for ReAct alone (+22%); on HotpotQA reasoning it adds ~20%; on the HumanEval coding benchmark it reaches 91% pass@1, surpassing GPT-4’s 80%. The paper also releases LeetcodeHardGym. Reflexion is the canonical reference for self-reflective metacognitive loops in LLM agents.

Key Ideas

Verbal reinforcement: convert sparse scalar/binary rewards into natural-language lessons
Three-module architecture: Actor / Evaluator / Self-Reflection, all LLMs
Short-term memory = trajectory history; long-term memory = accumulated self-reflections (bounded, e.g. last 3)
No gradient updates: policy changes via in-context memory, not weight updates
Self-reflection helps identify both mistaken actions (hallucination) and mistaken plans (inefficient planning)
Works on top of ReAct, CoT, or other actor policies; orthogonal to prompting style
State-of-the-art results on HumanEval, ALFWorld, HotpotQA with only a handful of trials

Connections

Conceptual Contribution

Claim: Language agents can be reinforced effectively without weight updates by converting sparse environmental feedback into verbal self-critiques that persist in episodic memory and steer subsequent behaviour — an emergent property of sufficiently capable LLMs that is far cheaper than RL fine-tuning.
Mechanism: After trajectory τ_t, the Evaluator M_e computes a reward r_t; the Self-Reflection model M_sr takes (τ_t, r_t) and produces a natural-language lesson sr_t appended to a memory buffer mem (|mem| ≤ Ω). The Actor M_a’s policy π_θ is parameterised by (M_a, mem), so on trial t+1 the in-context memory steers action selection. Iteration until Evaluator passes or max trials reached.
Concepts introduced/used: Verbal reinforcement learning, self-reflection, episodic memory for LLM agents, Actor-Evaluator-Self-Reflection decomposition, credit assignment via LLM, LeetcodeHardGym benchmark, in-context policy iteration.
Stance: empirical / methods — framework with ablations across three task families.
Relates to: Stacks on top of ReAct Synergizing Reasoning and Acting in Language Models (Reflexion’s default Actor is ReAct) and is the prototypical Metacognitive Loop for LLM agents. Makes memory-augmented agents a practical design pattern, related to the memory streams of Generative Agents. A core reference in The Rise and Potential of LLM-Based Agents and Agents Framework - Zhou et al, and an instance of the self-improvement / self-reflection family in MAST Taxonomy. Part of the broader family of LLM Agents that learn in-context rather than via gradient descent.

Tags

Agents Framework - Zhou et al

Agents: An Open-source Framework for Autonomous Language Agents

Reference: Wangchunshu Zhou et al. (2023). arXiv:2309.07870v3 (AIWaves, Zhejiang University, ETH Zürich). Source file: 2309.07870v3.pdf. URL

Summary

Introduces AGENTS, an open-source framework for building LLM-powered autonomous agents with first-class support for planning, long/short-term memory, tool use and web navigation, multi-agent communication, human-agent interaction, and fine-grained symbolic control via Standard Operating Procedures (SOPs). SOPs are state-graphs with LLM-editable transition rules and per-state prompt/tool configurations, giving users predictable, tunable control over otherwise stochastic agent behaviour.

The framework is declarative (agents instantiated from config JSON), supports dynamic scheduling of which agent speaks next in multi-agent settings, provides a FastAPI deployment target and an Agent Hub for sharing/forking agents, and includes an automated SOP-generation pipeline (meta-agent).

Key Ideas

SOP as a symbolic plan / state graph for controllable agents
Dynamic scheduling: LLM controller picks next actor rather than fixed order
Memory split: long-term (VectorDB + sentence-transformers) vs short-term scratchpad
Config-driven agent construction reduces boilerplate
Meta-agent auto-generates SOPs from task descriptions via RAG

Connections

Conceptual Contribution

Claim: Autonomous LLM-powered language agents become reliably controllable and customisable when their behaviour is specified by symbolic plans — Standard Operating Procedures (SOPs) — represented as state graphs that an LLM-based controller traverses, rather than by monolithic prompts alone.
Mechanism: Open-source library AGENTS unifying planning, long/short-term memory (VectorDB + scratchpad), tool use & web navigation, multi-agent communication (with LLM-moderator for dynamic scheduling), human-agent interaction (is_human flag), and controllability via SOPs. Includes an automated “meta-agent” that generates SOPs and configs from a task description via RAG.
Concepts introduced/used: Language Agents, Standard Operating Procedures (SOPs), Symbolic Plans, LLM Agents, Agent Hub, Dynamic Scheduling, Long-short Term Memory, Meta-agent, Retrieval-Augmented Generation, Tool Use, Human-in-the-loop
Stance: engineering / framework
Relates to: A concrete instantiation of the role-specialised multi-agent style advocated in Multi-Agent Collaboration in AI - Wasif Tunkel. Its SOP/controller discipline directly targets the failure modes later catalogued in Why Do Multi-Agent LLM Systems Fail. Its inter-agent messaging is more prescriptive than the negotiated protocols of A Scalable Communication Protocol for Networks of LLMs, sitting between classic ACLs (KQML as an Agent Communication Language, FIPA-ACL) and fully emergent LLM communication.

Tags

LLM Agents

Large-language-model-powered agents: natural-language coordination, tool use, multi-agent orchestration.

Surveys & frameworks

Protocols & communication

Failures & threats

Lineage

Toolformer

Toolformer: Language Models Can Teach Themselves to Use Tools

Reference: Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom (2024). NeurIPS 2023. Source file: 2302.04761.pdf. URL

Summary

Toolformer shows that a language model can teach itself, in a self-supervised way, to decide when to call an external API, which API to call, what arguments to pass, and how to incorporate the returned result into subsequent token prediction. Starting from a handful of human-written demonstrations per tool, the model annotates a large corpus with candidate API calls, executes them, and keeps only the calls whose results reduce the cross-entropy loss on the surrounding tokens. The filtered, API-augmented corpus is then used to finetune the model.

The approach is demonstrated on a calculator, Q&A system, Wikipedia search, machine translation, and calendar. A 6.7B Toolformer (GPT-J-based) outperforms a much larger GPT-3 on several zero-shot downstream tasks without sacrificing core language modelling. The paper is a foundational reference for Tool Use in LLM Agents and therefore for the tool-layer attack surface catalogued in AI Agents Under Threat and MalTool Malicious Tool Attacks.

Key Ideas

Self-supervised tool learning via loss-reduction filtering — no task-specific supervision.
API calls are represented as interleaved special tokens (<API> name(args) -> result </API>) directly inside the token stream.
A single model learns heterogeneous tools rather than one tool per specialist.
Tools compensate for LLM weaknesses (arithmetic, factual recall, freshness, low-resource translation).
Establishes the architectural template — model emits tool-call tokens, external executor returns results, tokens resume — that later MCP/A2A-style protocols generalise.

Connections

Conceptual Contribution

Claim: Language models can learn to use external tools in a self-supervised fashion by keeping only API calls whose responses reduce next-token loss, bootstrapping tool competence from a handful of demonstrations.
Mechanism: Sample candidate API-call positions and arguments via in-context prompting; execute calls; filter by weighted cross-entropy reduction (L_i^- − L_i^+ ≥ τ_f); finetune on the filtered, API-interleaved corpus.
Concepts introduced/used: self-supervised tool learning, API-call tokens, loss-based filtering, Tool Use, LLM Agents — the direct antecedent of Model Context Protocol style tool-calling interfaces.
Stance: constructive
Relates to: Supplies the tool-invocation substrate whose abuses are studied in MalTool Malicious Tool Attacks, Skill Supply Chain Attack, and the action-layer threats in AI Agents Under Threat.

Tags

Chain-of-Thought Prompting

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Reference: Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (2022). NeurIPS 2022. Source file: 2201.11903.pdf. URL

Summary

Shows that prompting a sufficiently large language model with a few exemplars that include intermediate natural-language reasoning steps — a chain of thought — dramatically improves performance on arithmetic, commonsense, and symbolic reasoning benchmarks. The technique requires no finetuning, no verifiers, and no task-specific training data beyond a handful of (input, chain-of-thought, output) triples.

Prompting PaLM-540B with eight chain-of-thought exemplars achieves then-state-of-the-art accuracy on GSM8K math word problems, surpassing even a finetuned GPT-3 with a verifier. Gains emerge only at scale (~100B parameters), making CoT one of the canonical examples of emergent capability. CoT underpins most modern agent reasoning loops (ReAct, ToT, Reflexion, LLM-Planner) referenced throughout AI Agents Under Threat and LLM Agents.

Key Ideas

A prompt of input/rationale/output triples unlocks multi-step reasoning without any parameter updates.
CoT is emergent: negligible benefit below ~100B parameters, striking gains above.
Works across arithmetic, commonsense, and symbolic domains — the method is task-general.
Rationale generation enables interpretability windows and lays the groundwork for reasoning-trace auditing in agent security.
The externalised reasoning trace is itself an attack surface — later exploited by indirect Prompt Injection and adversarial rationale manipulation.

Connections

Conceptual Contribution

Claim: Providing a small number of exemplars with explicit intermediate reasoning steps in the prompt elicits multi-step reasoning in sufficiently large language models, yielding state-of-the-art performance without finetuning.
Mechanism: Few-shot prompt template ⟨input, chain-of-thought rationale, output⟩; the model learns to generate its own rationales before the final answer at inference time.
Concepts introduced/used: chain-of-thought prompting, emergent reasoning, prompt-only adaptation — the substrate for ReAct-style agent loops that structure LLM Agents and the reasoning traces attacked in AI Agents Under Threat.
Stance: empirical
Relates to: The reasoning-trace paradigm CoT popularises is the target of Prompt Injection, goal-hijacking jailbreaks, and the brain-layer threats catalogued in AI Agents Under Threat.

Tags

Tool Use

LLM-agent capability of invoking external tools (APIs, code execution, database queries). Standardised through Model Context Protocol.