AI Agent Loop Patterns

  1. Home
  2. AI
  3. Agent, RAG, MCP & ML
  4. AI Agent Loop Patterns
AI Agent Loop Pattern diagram: user input, plan, think act observe loop, branch, final result
One AI Agent Loop Pattern: plan high-level steps, then per step run think → act → observe until the task is done.

Agents (Russell & Norvig)

In the textbook sense (Russell & Norvig), an agent is anything that perceives its environment through sensors and acts through actuators to pursue a task—often described as mapping percepts to actions over time, with goals, performance measures, and constraints (PEAS-style problem framing). Their taxonomy lines up five designs: simple reflex, model-based reflex, goal-based, utility-based, and learning agents—adding internal state, explicit goals, preference over outcomes, and adaptation from data as you move along the scale. Modern LLM agents are one engineering instantiation: the model reads state (prompt, tool output, memory), proposes the next action, and the runtime executes tools or updates—still the same perceive → decide → act cycle, with stochastic policies and richer natural-language interfaces than classical planners.

AI Agent Loop Patterns

An agent is an autonomous system that uses an LLM with tool calling to interact with external systems (databases, APIs, file systems) to perform actions, not just generate text.

This article is about AI Agent Loop Patterns: the repeating cycles of reasoning, tool use, and observation (the Re-Act family and close relatives). The LLM is the reasoning engine — it decides what to do, which tools to call, and how to read tool output. Unlike a chatbot that only emits text, these loops let an agent search, query, run code, or call APIs until a task completes.

Below are twelve common Re-Act-style loop patterns and extensions — from plain think–act– observe to dialogue, parallel tools, sequential iterations, reflection, memory, planning, chain-of-thought, tree-of-thought, sandbox execution, and learning. Pros and cons sit under each card; Python sketches and model tables are collapsible.

Re-Act

Re-Act

Reasoning + acting: producing reasoning steps (what to do next and why) and actions (tool calling), then observe (the value returned from the tool) and continues until the task is done. Each observe step is the only place the model sees real tool output, so the loop is grounded in facts rather than pure speculation. The run ends when the model stops requesting tools and returns a final answer (or hands off explicitly).

LOOP: THINK / REASON → ACT → OBSERVE → … → RESULT

Pros: Clear audit trail (think → act → observe). Tool results constrain hallucination on factual questions. Works with most chat models that support function calling. Easy to cap steps or tools for safety.

Cons: Latency and cost scale with loop length. Poor tool schemas or ambiguous prompts cause thrashing (repeated useless calls). The model may still misinterpret correct tool output if the task is underspecified.

#GPT-4o

#Claude 3.5

#Tool use

Conversational Re-Act (ReSpAct)

Conversational Re-Act (ReSpAct)

Reason + Speak + Act: talks to the user, asks for clarification, reports back, then acts. With the user in the loop.

LOOP: INPUT → THINK → (OPTIONALLY) SPEAK/ASK → THINK → ACT → OBSERVE → … → FINAL RESULT

Pros: Fewer wrong-tool calls from ambiguous input. Better trust and transparency. Catches missing parameters early instead of failing inside a silent loop.

Cons: More round-trips and longer wall-clock time. Harder to automate in batch pipelines. Dialogue policy must avoid annoying or redundant questions.

When you might use two models

  • Router + worker: One small model routes (e.g. "needs clarification" vs "ready to act"), another does the main Re-Act loop.
  • Specialized roles: One model for dialogue/clarification, another for heavy reasoning or tool use.

For most setups, a single general-purpose chat model (e.g. GPT-4o or Claude 3.5 Sonnet) is enough.

#GPT-4o

#Claude 3.5

#Dialogue

Re-Act Description

Re-Act Description

An agent that follows the Re-Act pattern with an optional Describe step: it narrates planned or completed actions around tool use. Use it for transparency, UX, or audit logs; skip it when you want less latency or noise—the core loop stays Think → Act → Observe.

LOOP: INPUT → THINK → (OPTIONALLY) DESCRIBE (WHAT IT WILL DO OR HAS DONE) → ACT → OBSERVE → … → FINAL RESULT

Pros: Easier debugging and compliance-friendly traces. Can improve UX when tool payloads are noisy or binary.

Cons: Extra tokens and latency. Descriptions can drift from what tools actually did if not grounded in the same Observe payload. Not ideal for lowest-latency automation.

#GPT-4o

#Claude 3.5

#Describe

Multi-Action Re-Act

Multi-Action Re-Act

In one step the agent can output several tool calls (e.g. search + calculator), then observe all results. Same loop, but Act can be multiple actions per step. Modern chat models often expose parallel or batched tool use so independent lookups can run together and merge in Observe. When steps must be strictly ordered, keep a single Act or split across turns—parallel multi-action assumes no hidden dependency between those tools.

LOOP: INPUT → THINK → ACT (ONE OR MORE TOOL CALLS, E.G. SEARCH + CALCULATOR) → OBSERVE (ALL RESULTS) → THINK → … → FINAL ANSWER

Pros: Lower latency than serial one-tool-per-turn when dependencies allow. Fewer overall LLM calls for the same ground truth.

Cons: Risky when calls are not independent — ordering bugs or duplicate side effects. Larger Observe payloads can overflow context. Not all hosts expose true parallel execution.

#GPT-4o

#Claude 3.5

#Parallel tools

Iterative Re-Act

Iterative Re-Act

Multiple sequential Re-Act rounds: after each Observe, the model may Think and Act again with full history—useful when the next tool choice depends on the last result, you want tighter control than packing many tools into one turn, or you need a clear iteration budget for cost and safety. Same atomic loop as Re-Act; emphasis is on repeated cycles until done or a max-step cap, not parallel multi-action in a single step.

LOOP: INPUT → THINK → ACT → OBSERVE → THINK → ACT → OBSERVE → … → FINAL ANSWER

Pros: Each new round sees fresh tool output before the next choice—good for dependent steps and auditable turn-by-turn traces. Easy to cap max_rounds for cost and safety.

Cons: More turns than multi-action when tools could have run in parallel. Latency and token use grow with loop depth; weak stop conditions or vague tools cause thrashing.

#GPT-4o

#Claude 3.5

#Sequential

Re-Act + Reflection

Re-Act + Reflection

After an attempt, the agent reflects (critique, self-correction) and then retries with an updated strategy.

LOOP: INPUT → REASON → ACT → OBSERVE → REFLECT → (MAYBE) REASON → ACT → … → FINAL ANSWER

Pros: Catches mistakes that a single pass would ship. Pairs well with a dedicated critic model or stricter reflection prompt. Improves robustness on complex tasks.

Cons: At least one extra LLM call per cycle — higher cost and latency. Reflection can still miss root causes or over-correct. Needs clear stop conditions to avoid infinite retries.

Two implementation choices

  • One model: Same LLM handles Reason, Act, and Reflect. Simpler, usually enough. Use a critic-style prompt for the reflection step.
  • Two models: Separate model for reflection (critic). Use when reflection must catch subtle errors or tasks are reasoning-heavy. Stronger reasoning models (o1, DeepSeek-R1) excel as critics.

For most setups, a single capable model is sufficient. Add a specialized reflection model when quality of critique matters more.

#GPT-4o

#Claude 3.5

#Reflect

Re-Act + Memory

Re-Act + Memory

Re-Act loop + long-term or episodic memory (store important data, reuse in later steps). Memory is explicit read/write on top of Observe: tool outputs are ephemeral unless you persist them, while memory holds facts, preferences, or state across turns. Implementations range from a vector store keyed by session to structured slots the model updates after each Observe.

LOOP: INPUT → REASON → ACT → OBSERVE → MEMORY READ/WRITE → … → FINAL RESULT

Pros: Stable behavior over long horizons. Can reduce duplicate API calls. Supports personalization when memory is scoped per user or session.

Cons: Stale or wrong memories poison future turns — needs versioning, TTL, or retrieval with citations. Extra storage and privacy review. Vector memory can retrieve irrelevant facts if queries are vague.

#GPT-4o

#Claude 3.5

#Memory

Re-Act with Planning

Re-Act with Planning

Plan first (e.g. high-level steps or subgoals), then run a Re-Act loop within each step. Plan-and-execute with Re-Act as the execution engine.

LOOP: INPUT → PLAN → THINK → ACT → OBSERVE → … → FINAL RESULT

Pros: Reduces aimless tool use. Easier to parallelize or hand off sub-steps. Plan can be shown to users for approval.

Cons: Brittle if the initial plan is wrong — may need replanning logic. Two-layer control adds complexity. Planner and executor can disagree on what “done” means for a step.

One or two models

  • One model: Same LLM does Plan (decomposition) and Execute (Re-Act per step). Plan is a different prompt; execution is the usual Think → Act → Observe loop.
  • Two models: Separate planner for high-level decomposition, executor for per-step Re-Act. Use when planning is complex (e.g. o1 for planner, cheaper model for execution) or for cost optimization.

For most setups, a single capable model is sufficient. Add a specialized planner when tasks need strong decomposition or long-horizon planning.

#GPT-4o

#Claude 3.5

#Plan

Chain Of Thought Re-Act

Chain Of Thought Re-Act

Chain-of-thought with detailed, step-by-step reasoning in text.

LOOP: INPUT → THINK (COT) → ACT → OBSERVE → THINK (COT) → … → FINAL RESULT

Pros: Often higher accuracy on structured problems. Reasoning traces help humans trust or audit the path to an action. Works with a single model — no extra critic required.

Cons: Verbose traces increase tokens and latency. CoT can still be wrong while sounding confident. Some deployments prefer not to store raw chain-of-thought for privacy.

One model

  • CoT (chain-of-thought) is produced by the same LLM that runs the Re-Act loop. The Think step is prompted for detailed step-by-step reasoning; the model outputs both reasoning and tool calls in the same flow.
  • No need for two models. When CoT quality matters (complex logic, math, multi-step tasks), choose models that excel at structured reasoning.

Reasoning models (o1, DeepSeek-R1) and strong general models (GPT-4o) handle CoT well. Use the Best for CoT list when step-by-step reasoning is central.

#GPT-4o

#Claude 3.5

#CoT

Tree-of-thought plus Re-Act

Tree-of-thought plus Re-Act

Before acting, the model explores several partial reasoning paths (a tree of candidate thoughts or subplans), prunes or scores them, then commits to a preferred branch and runs the standard Think → Act → Observe loop under that choice. Brings search-over-reasoning structure to hard problems; costs more LLM calls than a single CoT line—control breadth and depth for budget.

LOOP: INPUT → (BRANCH: MULTIPLE CANDIDATE THOUGHTS) → SELECT BEST → THINK → ACT → OBSERVE → … → FINAL RESULT

Pros: Explores several approaches before committing—often better on puzzles, planning, or when the first guess is wrong. Pairs well with strong reasoning models; combines with the same Re-Act tool loop after selection.

Cons: Multiplies up-front token cost (branching + selection). Heuristic picks can eliminate the right branch if scoring is weak. Needs explicit limits on branch count and depth.

#GPT-4o

#Claude 3.5

#ToT

Sandbox Execution Agent

Sandbox Execution Agent

Re-Act (or a planner–executor) where the main Act is not a thin API call but code or shell in an isolated environment: the model proposes commands or scripts, the runtime executes them in a sandbox (container, micro-VM, or restricted worker), and Observe is stdout, stderr, exit code, and optionally files or metrics. Suited to coding tasks, data wrangling, and reproducible runs; you trade latency and hosting cost for real compute and file I/O with guardrails (timeouts, network policy, resource caps).

LOOP: INPUT → THINK → (COMPOSE COMMAND OR SCRIPT) → EXECUTE IN SANDBOX → OBSERVE (STDOUT/STDERR/EXIT/ARTIFACTS) → … → FINAL RESULT

Pros: Real compute and filesystem in a controlled box—code can run, tests can execute, and agents can recover from failed commands using stderr. Easier to align “do work” with production coding copilots than only HTTP tools.

Cons: Sandboxing is non-trivial (escape risk, cost per session, image prep). Long logs or big artifacts blow context; needs truncation and summarization. Misconfigured network or mounts can still leak data or cost.

Model vs runtime

  • The policy can be the same tool-calling models as standard Re-Act; the difference is the tool implementation (spawn in E2B, Firecracker, gVisor, your worker pool).
  • Stronger coding models (and long context) help when Observe is large diffs, logs, or stack traces; the sandbox is where you enforce CPU, network, and disk policy.

If execution is all remote APIs with no local shell, you are closer to plain Re-Act. Sandbox agents shine when the environment must run user or generated code.

#GPT-4o

#Claude 3.5

#Sandbox

Re-Act + Learning

Re-Act + Learning

Agent updates its policy (or a retrievable knowledge store) from experience: RL from rewards, fine-tuning from feedback, or storing corrected strategies for reuse.

LOOP: INPUT → THINK → ACT → OBSERVE → (IF FEEDBACK/REWARD) → UPDATE → … → FINAL RESULT

Pros: System improves without hand-editing every prompt. Experience replay or stored strategies is cheaper than full RL for many teams.

Cons: Risk of learning the wrong pattern from noisy feedback. Fine-tuning and RL need data hygiene, eval harnesses, and rollback. “Learning” without guardrails can amplify biases or unsafe shortcuts.

Three learning modes

  • Storing strategies: General chat models. Corrected strategies are stored and retrieved (like Re-Act + Memory). No model update.
  • Fine-tuning: Models that support fine-tuning (e.g. GPT-4o-mini, Qwen). Requires feedback data to update weights.
  • RL from rewards: Smaller open models (Qwen 7B, DeepSeek-Coder). RL is costly; smaller models are more practical for training.

Model choice depends on which learning mode you implement. Storing is simplest; fine-tuning and RL need compatible model infrastructure.

#GPT-4o

#Claude 3.5

#Learn

Conclusion

You rarely need every loop pattern at once. Start with a minimal Re-Act loop and clear tool contracts, then add dialogue, reflection, or memory when user trust, accuracy, or long sessions demand it. For how those pieces fit into sense, think, act, observe, and finish, see Anatomy of an AI agent. For common use-case stacks (how to combine these loops with RAG, memory, tools), see Production Agent-RAG Architectures.