Reflexion: Language Agents with Verbal Reinforcement Learning

Reference: Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao (2023). NeurIPS 2023 (Northeastern / MIT / Princeton). arXiv:2303.11366. Source file: reflexion-2303.11366.pdf. URL

Summary

Reflexion reinforces language agents not by updating weights but by verbal feedback: after each episode the agent converts scalar or binary success signals into a natural-language self-critique and stores that critique in an episodic memory buffer that conditions the next attempt. This acts as a semantic gradient, letting the agent learn from trial-and-error in a handful of episodes without fine-tuning. The framework factors an agent into three LLM-instantiated modules: an Actor (generates text and actions, typically a ReAct agent or CoT reasoner), an Evaluator (scores the trajectory), and a Self-Reflection model (produces the verbal feedback written to memory).

On ALFWorld (decision-making) Reflexion completes 130/134 tasks vs ~100 for ReAct alone (+22%); on HotpotQA reasoning it adds ~20%; on the HumanEval coding benchmark it reaches 91% pass@1, surpassing GPT-4’s 80%. The paper also releases LeetcodeHardGym. Reflexion is the canonical reference for self-reflective metacognitive loops in LLM agents.

Key Ideas

Verbal reinforcement: convert sparse scalar/binary rewards into natural-language lessons
Three-module architecture: Actor / Evaluator / Self-Reflection, all LLMs
Short-term memory = trajectory history; long-term memory = accumulated self-reflections (bounded, e.g. last 3)
No gradient updates: policy changes via in-context memory, not weight updates
Self-reflection helps identify both mistaken actions (hallucination) and mistaken plans (inefficient planning)
Works on top of ReAct, CoT, or other actor policies; orthogonal to prompting style
State-of-the-art results on HumanEval, ALFWorld, HotpotQA with only a handful of trials

Connections

Conceptual Contribution

Claim: Language agents can be reinforced effectively without weight updates by converting sparse environmental feedback into verbal self-critiques that persist in episodic memory and steer subsequent behaviour — an emergent property of sufficiently capable LLMs that is far cheaper than RL fine-tuning.
Mechanism: After trajectory τ_t, the Evaluator M_e computes a reward r_t; the Self-Reflection model M_sr takes (τ_t, r_t) and produces a natural-language lesson sr_t appended to a memory buffer mem (|mem| ≤ Ω). The Actor M_a’s policy π_θ is parameterised by (M_a, mem), so on trial t+1 the in-context memory steers action selection. Iteration until Evaluator passes or max trials reached.
Concepts introduced/used: Verbal reinforcement learning, self-reflection, episodic memory for LLM agents, Actor-Evaluator-Self-Reflection decomposition, credit assignment via LLM, LeetcodeHardGym benchmark, in-context policy iteration.
Stance: empirical / methods — framework with ablations across three task families.
Relates to: Stacks on top of ReAct Synergizing Reasoning and Acting in Language Models (Reflexion’s default Actor is ReAct) and is the prototypical Metacognitive Loop for LLM agents. Makes memory-augmented agents a practical design pattern, related to the memory streams of Generative Agents. A core reference in The Rise and Potential of LLM-Based Agents and Agents Framework - Zhou et al, and an instance of the self-improvement / self-reflection family in MAST Taxonomy. Part of the broader family of LLM Agents that learn in-context rather than via gradient descent.

Tags

#llm-agents #self-reflection #verbal-rl #memory #metacognition

Backlinks

Linked Pages

LLM Agents

Large-language-model-powered agents: natural-language coordination, tool use, multi-agent orchestration.

Surveys & frameworks

Protocols & communication

Failures & threats

Lineage

MAST Taxonomy

Multi-Agent System Taxonomy (MAST)

Empirically grounded 14-failure-mode taxonomy for LLM multi-agent systems, spanning specification, coordination, and verification.

In this vault

Why Do Multi-Agent LLM Systems Fail

Agents Framework - Zhou et al

Agents: An Open-source Framework for Autonomous Language Agents

Reference: Wangchunshu Zhou et al. (2023). arXiv:2309.07870v3 (AIWaves, Zhejiang University, ETH Zürich). Source file: 2309.07870v3.pdf. URL

Summary

Introduces AGENTS, an open-source framework for building LLM-powered autonomous agents with first-class support for planning, long/short-term memory, tool use and web navigation, multi-agent communication, human-agent interaction, and fine-grained symbolic control via Standard Operating Procedures (SOPs). SOPs are state-graphs with LLM-editable transition rules and per-state prompt/tool configurations, giving users predictable, tunable control over otherwise stochastic agent behaviour.

The framework is declarative (agents instantiated from config JSON), supports dynamic scheduling of which agent speaks next in multi-agent settings, provides a FastAPI deployment target and an Agent Hub for sharing/forking agents, and includes an automated SOP-generation pipeline (meta-agent).

Key Ideas

SOP as a symbolic plan / state graph for controllable agents
Dynamic scheduling: LLM controller picks next actor rather than fixed order
Memory split: long-term (VectorDB + sentence-transformers) vs short-term scratchpad
Config-driven agent construction reduces boilerplate
Meta-agent auto-generates SOPs from task descriptions via RAG

Connections

Conceptual Contribution

Claim: Autonomous LLM-powered language agents become reliably controllable and customisable when their behaviour is specified by symbolic plans — Standard Operating Procedures (SOPs) — represented as state graphs that an LLM-based controller traverses, rather than by monolithic prompts alone.
Mechanism: Open-source library AGENTS unifying planning, long/short-term memory (VectorDB + scratchpad), tool use & web navigation, multi-agent communication (with LLM-moderator for dynamic scheduling), human-agent interaction (is_human flag), and controllability via SOPs. Includes an automated “meta-agent” that generates SOPs and configs from a task description via RAG.
Concepts introduced/used: Language Agents, Standard Operating Procedures (SOPs), Symbolic Plans, LLM Agents, Agent Hub, Dynamic Scheduling, Long-short Term Memory, Meta-agent, Retrieval-Augmented Generation, Tool Use, Human-in-the-loop
Stance: engineering / framework
Relates to: A concrete instantiation of the role-specialised multi-agent style advocated in Multi-Agent Collaboration in AI - Wasif Tunkel. Its SOP/controller discipline directly targets the failure modes later catalogued in Why Do Multi-Agent LLM Systems Fail. Its inter-agent messaging is more prescriptive than the negotiated protocols of A Scalable Communication Protocol for Networks of LLMs, sitting between classic ACLs (KQML as an Agent Communication Language, FIPA-ACL) and fully emergent LLM communication.

Tags

The Rise and Potential of LLM-Based Agents

The Rise and Potential of Large Language Model Based Agents: A Survey

Reference: Xi, Chen, Guo, He, Ding, Hong, Zhang, Wang, Jin, Zhou, Zheng, Fan, Wang, Xiong, Zhou, Wang, Jiang, Zou, Liu, Yin, Dou, Weng, Cheng, Zhang, Qin, Zheng, Qiu, Huang, Gui (2023). Fudan NLP Group, arXiv preprint. Source file: 2309.07864.pdf. URL

Summary

A comprehensive survey of LLM-based agents organised around a three-component conceptual framework — brain, perception, action — that the authors propose as a general template for agent construction. The brain covers natural-language interaction, knowledge, memory, reasoning/planning, and transferability; perception covers textual, visual, auditory, and other inputs; action covers textual output, tool use, and embodied action.

The survey then examines agents in practice (single-agent task/innovation/lifecycle deployments; multi-agent cooperative and adversarial interaction; human-agent instructor-executor and equal-partnership paradigms) and agent societies (personality, social behaviour, environments, society simulation, ethical/social risks). A final discussion chapter covers evaluation, adversarial robustness, trustworthiness, scaling the number of agents, and open problems — directly feeding the threat taxonomy of AI Agents Under Threat.

Key Ideas

Brain/perception/action triad as a unifying architecture for LLM agents.
Single-agent vs multi-agent vs human-agent deployment axes.
Cooperative complementarity and adversarial advancement as the two poles of multi-agent interaction.
Agent society simulation (à la Generative Agents) as both a scientific instrument and a risk surface.
Dedicated treatment of adversarial robustness and trustworthiness as first-class concerns in agent design.

Connections

Conceptual Contribution

Claim: LLM-based agents should be understood through a unified brain/perception/action framework, with deployments spanning single-agent, multi-agent, and human-agent configurations, and societies displaying emergent social phenomena that demand first-class security and trustworthiness analysis.
Mechanism: Literature synthesis organised around the three-component architecture, three deployment paradigms, and an agent-society lens; taxonomy of cooperative vs adversarial multi-agent interaction; catalogue of open problems in robustness, trustworthiness, and scaling.
Concepts introduced/used: brain/perception/action triad, instructor-executor vs equal-partnership paradigms, agent society, adversarial robustness, Tool Use, Multi-Agent Systems, LLM Agents.
Stance: survey
Relates to: The brain/perception/action decomposition directly informs the perception/brain/action threat axes used by AI Agents Under Threat; multi-agent cooperation analysis motivates ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems and the inter-agent risks catalogued in SoK The Attack Surface of Agentic AI.

Tags

Generative Agents

Generative Agents: Interactive Simulacra of Human Behavior

Reference: Park, O’Brien, Cai, Morris, Liang, Bernstein (2023). UIST ’23. Source file: 2304.03442.pdf. URL

Summary

Introduces generative agents: LLM-powered simulacra that populate a Sims-like sandbox with 25 characters who wake, plan their day, converse, form opinions, remember past events, reflect, and coordinate group activities (e.g., autonomously spreading invitations to a Valentine’s Day party). The agent architecture extends an LLM with three components: a memory stream (natural-language log of experiences with recency/importance/relevance retrieval), reflection (higher-level inferences synthesised from memories), and planning (recursive decomposition of daily goals into action sequences), with reflections and plans fed back into the memory stream.

The paper is a widely-cited foundational reference for agent memory and social simulation, and is invoked throughout AI Agents Under Threat as the canonical multi-agent LLM society whose emergent behaviours define the attack surface for Memory Poisoning and inter-agent risks.

Key Ideas

Memory stream as a long-term, natural-language, retrieval-indexed experience log.
Retrieval scoring combines recency, importance, and semantic relevance.
Reflection: periodic self-prompted synthesis of memories into higher-level beliefs, propositions, and abstractions.
Planning: top-down decomposition of daily goals into hierarchical schedules that feed back into memory.
Believable individual and emergent group behaviour (information diffusion, relationship formation, coordination) arising without scripted dialogue trees.

Connections

Conceptual Contribution

Claim: Believable long-horizon human-like behaviour in LLM agents can be produced by augmenting the model with an architectural trio — memory stream, reflection, planning — that together let the agent retrieve, generalise, and act over long time horizons.
Mechanism: Natural-language memory stream with recency+importance+relevance retrieval; periodic reflection that distils memories into higher-level beliefs; recursive top-down planning that writes plans back into memory; all components implemented as LLM prompts over ChatGPT.
Concepts introduced/used: memory stream, reflection, recursive planning, believable simulacra, emergent social dynamics — the substrate for Memory Poisoning and inter-agent cascade attacks in AI Agents Under Threat and ClawWorm Self-Propagating Attacks Across LLM Agent Ecosystems.
Stance: constructive/empirical
Relates to: Establishes the memory+reflection+planning template whose failure modes are analysed in AI Agents Under Threat (brain/memory threats) and whose emergent multi-agent phenomena motivate Multi-Agent Systems security work.

Tags

Metacognitive Loop

Self-monitoring feedback loop in which an agent observes its own behaviour against expectations and revises.

In this vault

ReAct Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

Reference: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao (2023). ICLR 2023 (Princeton / Google Brain). arXiv:2210.03629. Source file: react-2210.03629.pdf. URL

Summary

ReAct is a prompting paradigm that interleaves verbal reasoning traces (thoughts) with task-specific actions in a single LLM generation, letting the model plan, update its plan with observations from the environment, and query external sources. Unlike chain-of-thought (reasoning only) or action-only agents (e.g. WebGPT-style), ReAct alternates Thought -> Action -> Observation steps, so the model grounds its reasoning in retrieved evidence and uses its reasoning to decide what to act on next.

Evaluated on HotPotQA, Fever (Wikipedia API with search/lookup/finish actions), ALFWorld (text household tasks), and WebShop. On knowledge tasks ReAct outperforms action-only prompting and, combined with chain-of-thought self-consistency, beats CoT alone by reducing hallucination. On decision-making tasks ReAct beats imitation-learning and RL baselines by absolute 34% and 10% respectively, using only one or two in-context examples. The paper is the seminal reference for tool-using LLM agents and the thought/act/observe loop now ubiquitous in LLM agent frameworks.

Key Ideas

Interleave reasoning (thoughts) and acting (tool/API calls) in one LLM trajectory
Augment the action space A with a free-form language space L of “thoughts” that do not affect the environment but update the context
External Wikipedia API (search, lookup, finish) as a minimal tool interface for QA
Back-off combination with CoT self-consistency: use ReAct when it finishes, otherwise fall back to CoT-SC, or vice versa
Human-interpretable, inspectable, and edit-able agent trajectories (“thought editing”)
Few-shot prompts with 1-6 exemplars suffice; fine-tuning on 3k trajectories further boosts small models

Connections

Conceptual Contribution

Claim: Reasoning and acting are synergistic rather than separate capabilities; interleaving them in a single LLM generation produces agents that are more grounded, less hallucination-prone, and more robust than pure reasoning (CoT) or pure acting (WebGPT-style) baselines.
Mechanism: Extend the agent’s action space to A’ = A ∪ L where L is the free-form language of thoughts. Prompt the LLM with few-shot thought/action/observation exemplars. The model decides when to think and when to act; observations from the environment (API results, game state) re-enter the context and condition subsequent thoughts.
Concepts introduced/used: Thought-Act-Observation loop, tool-use prompting, interleaved reasoning/acting, grounded reasoning, hallucination reduction via external knowledge, few-shot agent prompting, thought editing, ReAct + CoT-SC back-off.
Stance: empirical / methods — prompting-level intervention with extensive benchmarks.
Relates to: Direct successor to Chain-of-Thought Prompting — CoT without actions is shown to hallucinate (14% vs ReAct’s 6%). Makes tool use a first-class LLM behaviour, complementary to Toolformer’s self-taught API-call fine-tuning. Foundational for LLM Agents and a key building block in Agents Framework - Zhou et al. The thought/act/observe skeleton is reused by nearly all subsequent agent papers, including Reflexion Language Agents with Verbal Reinforcement Learning (which uses ReAct as its default Actor).