How AI Agents Actually Work Behind the Scenes

I asked an AI agent to book a meeting, research a competitor, and draft a proposal, all in one prompt. It confidently did all three. One of the results was fabricated. One was three months out of date. And the meeting invite went to the wrong person.

The agent wasn't broken. It was working exactly as designed. The problem was I didn't understand what "working as designed" actually meant under the hood.

That changed when I stopped treating agents as black boxes and started reading what they were actually doing: the prompts, the tool calls, the memory lookups, the planning loops. What I found was fascinating, occasionally terrifying, and completely explainable once you know the architecture.

This is that explanation. And if you want to see these patterns applied in a real production system, check out InsightPilot and SOM.ai, two projects where I built multi-step agent pipelines from scratch.

What an AI Agent Actually Is

Most definitions of AI agents are either too vague ("an AI that takes actions") or too narrow ("an LLM with tool access"). Here's the one I've settled on:

An AI agent is a system that perceives inputs, reasons about them using a language model, decides which actions to take, executes those actions through tools, observes the results, and repeats until a goal is achieved or a stopping condition is met.

The key word is loop. A single LLM call is not an agent. An agent is what happens when you put a model inside a feedback cycle with the world.

The four core components of that loop:

┌─────────────────────────────────────────────────────────┐
│                     AGENT LOOP                          │
│                                                         │
│   Perception ──► Reasoning ──► Action ──► Observation   │
│        ▲                                      │         │
│        └──────────────────────────────────────┘         │
└─────────────────────────────────────────────────────────┘

Perception: what the agent receives: user input, tool results, memory retrievals, environment state
Reasoning: what the LLM does with that input: plans, decides, reflects
Action: what the agent does: calls a tool, writes to memory, sends a message, terminates
Observation: what comes back: tool output, error, confirmation, new state

Everything else, including orchestration layers, memory systems, and safety controls, is scaffolding built around this loop.

The Full System Architecture

Here's how a production AI agent system actually fits together end to end:

                        ┌─────────────┐
                        │  User Input │
                        └──────┬──────┘
                               │
                        ┌──────▼──────┐
                        │ Orchestrator│  ◄── system prompt + agent config
                        └──────┬──────┘
                               │
              ┌────────────────▼─────────────────┐
              │           LLM Core               │
              │  (planning + reasoning + output) │
              └───┬──────────────┬───────────────┘
                  │              │
         ┌────────▼──┐    ┌──────▼────────┐
         │ Tool Router│    │ Memory Manager│
         └────┬───────┘    └──────┬────────┘
              │                   │
    ┌─────────▼──────┐   ┌────────▼────────┐
    │  Tool Execution│   │  Vector Store /  │
    │  (APIs, code,  │   │  Episodic Cache  │
    │   search, DBs)  │   └─────────────────┘
    └────────┬────────┘
             │
    ┌────────▼────────┐
    │  Observation    │
    │  (results +     │──────────────────► back to Orchestrator
    │   error state)  │
    └─────────────────┘

Every box in this diagram is a real engineering decision. Let's go through each one.

Component Breakdown

The LLM Core: Reasoning Engine, Not Magic Oracle

The language model is the reasoning engine. It reads the full context window, including the system prompt, conversation history, tool results, and memory retrievals, and produces either a text response or a structured action (tool call, plan step, final answer).

What the model doesn't have: persistent state, real-time data, the ability to actually execute anything. It only produces tokens. Everything else is infrastructure interpreting and acting on those tokens.

The model's context window is the agent's working memory. Everything the agent "knows" in a given reasoning step lives in that window. This makes context management (what goes in, what gets compressed, what gets dropped) one of the most consequential engineering decisions in any agent system.

If you're building a RAG-based agent, chunking strategy directly determines what ends up in that context window. I wrote a deep-dive on exactly that: The Definitive Guide to Chunking Strategies for LLMs.

Tool Use: Where Agents Touch the World

Tool use is what separates an agent from a chatbot. The model outputs a structured tool call; the orchestrator intercepts it, executes the tool, and feeds the result back into the context.

A minimal tool definition looks like this:

tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Use when the user asks about recent events or facts you may not have.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "run_python",
        "description": "Execute a Python code snippet and return stdout.",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {
                    "type": "string",
                    "description": "Valid Python code to execute"
                }
            },
            "required": ["code"]
        }
    }
]

The LLM doesn't call the tool directly. It outputs something like:

{
  "tool": "search_web",
  "parameters": { "query": "Q1 2026 SaaS churn benchmarks" }
}

The orchestrator parses this, runs the actual search, and appends the result to the context before the next LLM call. The model never "runs" anything. It only describes what to run.

This indirection is both a strength (safety, observability) and a source of failure (the model can hallucinate tool names, parameters, or assume tools exist that don't).

Orchestration: The Traffic Controller

The orchestrator manages the agent loop: sending prompts to the model, routing tool calls to executors, handling errors, enforcing stop conditions, and maintaining message history.

A minimal ReAct-style orchestrator in Python:

import anthropic
import json

client = anthropic.Anthropic()

def run_agent(user_query: str, tools: list, max_steps: int = 10) -> str:
    messages = [{"role": "user", "content": user_query}]

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=tools,
            messages=messages
        )

        # Agent produced a final answer
        if response.stop_reason == "end_turn":
            return extract_text(response)

        # Agent wants to use a tool
        if response.stop_reason == "tool_use":
            tool_calls = [b for b in response.content if b.type == "tool_use"]
            tool_results = []

            for call in tool_calls:
                result = execute_tool(call.name, call.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": call.id,
                    "content": json.dumps(result)
                })

            # Append assistant turn + tool results to history
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    return "Max steps reached without resolution."

This loop is the skeleton of almost every production agent. The complexity lives in execute_tool, error handling, and what happens when the model loops without making progress.

Memory: The Four Layers

Memory in agent systems isn't one thing. It's four distinct layers with different scopes and engineering requirements:

┌────────────────────────────────────────────────────────┐
│  MEMORY LAYERS                                         │
│                                                        │
│  In-context    │ Current conversation + tool results   │
│  (working)     │ Scope: single run. Fast. Limited.     │
│                │                                       │
│  Episodic      │ Past conversations, stored + indexed  │
│  (external)    │ Scope: across runs. Vector search.    │
│                │                                       │
│  Semantic      │ Facts, documents, knowledge bases     │
│  (external)    │ Scope: static or slowly updated.      │
│                │                                       │
│  Procedural    │ Tool definitions, system prompts,     │
│  (baked-in)    │ few-shot examples, fine-tune weights  │
└────────────────────────────────────────────────────────┘

The most common mistake in early agent systems is treating in-context memory as the only memory. It works until the context fills up or the conversation ends. Production systems need all four layers, with explicit logic for when to read from and write to each.

Safety Controls: The Layer Nobody Builds First (And Should)

Safety controls sit between the orchestrator and the outside world. They include:

Input validation: sanitize user inputs before they reach the model
Output filtering: intercept model outputs before tool execution
Tool scope limiting: whitelist which tools each agent persona can access
Action confirmation: require human approval for irreversible actions (writes, sends, deletes)
Rate limiting and cost caps: prevent runaway loops from burning budget
Audit logging: record every tool call, input, and output for post-hoc debugging

Here's the full guard pattern in Python:

import asyncio
import logging
from dataclasses import dataclass, field
from typing import Any, Optional

logger = logging.getLogger(__name__)

IRREVERSIBLE_TOOLS = {"send_email", "delete_record", "push_to_production", "charge_payment"}

@dataclass
class AgentConfig:
    allowed_tools: list[str]
    auto_approve: bool = False
    tool_timeout_ms: int = 5000
    max_cost_usd: float = 1.0
    total_cost_usd: float = field(default=0.0, init=False)

@dataclass
class ToolResult:
    output: Optional[Any] = None
    error: Optional[str] = None
    cost_usd: float = 0.0

async def safe_tool_execute(
    tool_name: str,
    params: dict,
    config: AgentConfig,
    approve_fn=None,
    execute_fn=None,
) -> ToolResult:
    # 1. Scope check: is this tool allowed for this agent?
    if tool_name not in config.allowed_tools:
        logger.warning(f"Blocked tool call: '{tool_name}' not in allowed list.")
        return ToolResult(error=f"Tool '{tool_name}' not permitted for this agent.")

    # 2. Cost cap: prevent runaway spend
    if config.total_cost_usd >= config.max_cost_usd:
        return ToolResult(error=f"Cost cap of ${config.max_cost_usd:.2f} reached. Halting.")

    # 3. Human-in-the-loop for irreversible actions
    if tool_name in IRREVERSIBLE_TOOLS and not config.auto_approve:
        if approve_fn is None:
            return ToolResult(error=f"'{tool_name}' requires human approval but no approve_fn provided.")
        approved = await approve_fn(tool_name, params)
        if not approved:
            logger.info(f"User rejected irreversible action: '{tool_name}'")
            return ToolResult(error="Action rejected by user.")

    # 4. Execute with timeout
    try:
        timeout_secs = config.tool_timeout_ms / 1000
        result: ToolResult = await asyncio.wait_for(
            execute_fn(tool_name, params),
            timeout=timeout_secs
        )
        # 5. Track cumulative cost
        config.total_cost_usd += result.cost_usd
        logger.info(f"Tool '{tool_name}' succeeded. Session cost: ${config.total_cost_usd:.4f}")
        return result

    except asyncio.TimeoutError:
        logger.error(f"Tool '{tool_name}' timed out after {config.tool_timeout_ms}ms.")
        return ToolResult(error=f"Tool '{tool_name}' timed out.")

    except Exception as e:
        logger.exception(f"Tool '{tool_name}' raised an unexpected error.")
        return ToolResult(error=f"Unexpected error in '{tool_name}': {str(e)}")


# --- Usage example ---

async def mock_approve(tool_name: str, params: dict) -> bool:
    print(f"\n[APPROVAL REQUIRED] Tool: {tool_name}\nParams: {params}")
    return input("Approve? (y/n): ").strip().lower() == "y"

async def mock_execute(tool_name: str, params: dict) -> ToolResult:
    # Replace with real tool dispatch logic
    return ToolResult(output=f"Result from {tool_name}", cost_usd=0.002)

async def main():
    config = AgentConfig(
        allowed_tools=["search_web", "run_python", "send_email"],
        auto_approve=False,
        tool_timeout_ms=3000,
        max_cost_usd=0.50
    )

    result = await safe_tool_execute(
        tool_name="send_email",
        params={"to": "team@company.com", "subject": "Agent Report"},
        config=config,
        approve_fn=mock_approve,
        execute_fn=mock_execute,
    )

    print(result)

Interaction Patterns

Global Planning vs. Reactive Execution

Two dominant patterns, and knowing when to use each is a real architectural decision.

Global planning (Plan-then-Execute): The agent first produces a full plan, a sequence of steps toward the goal, then executes each step. Good for well-defined tasks with predictable tool behavior. Brittle when the environment changes mid-execution.

User Goal
    │
    ▼
[PLAN STEP]  ──►  Step 1: search market data
                  Step 2: run analysis script
                  Step 3: draft report
                  Step 4: send summary email
    │
    ▼
[EXECUTE each step sequentially]

Reactive execution (ReAct-style): The agent reasons and acts one step at a time, using each observation to decide the next action. More adaptive. More token-expensive. Better for open-ended, exploratory tasks.

Observe → Think → Act → Observe → Think → Act → ...

Most production systems use a hybrid: global planning for task decomposition, reactive execution within each sub-task.

State Management Across Steps

Agent state, including what the agent knows, what it has done, and what it is waiting on, needs to be explicitly tracked and serialized. Don't rely on the model to remember it across turns.

from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class AgentState:
    session_id: str
    goal: str
    plan: list[str] = field(default_factory=list)
    completed_steps: list[str] = field(default_factory=list)
    tool_results: dict[str, str] = field(default_factory=dict)
    current_step_index: int = 0
    status: str = "running"  # running | waiting | complete | failed
    error: Optional[str] = None

    def advance(self):
        self.completed_steps.append(self.plan[self.current_step_index])
        self.current_step_index += 1
        if self.current_step_index >= len(self.plan):
            self.status = "complete"

    def to_json(self) -> str:
        return json.dumps(self.__dict__, indent=2)

    @classmethod
    def from_json(cls, data: str) -> "AgentState":
        return cls(**json.loads(data))

Serialize this state to a database between agent turns. It's what enables pause-and-resume, human-in-the-loop approval, and post-mortem debugging.

Real-World Scenarios

Scenario 1: Customer Support Agent

A Tier-1 support agent for a SaaS product. The agent handles inbound tickets, looks up account data, checks known issues, drafts responses, and escalates when it can't resolve.

This is exactly what I built at R Systems for Edgenta: an intelligent support desk that streamlined ticket handling and query resolution end to end. You can read more about that work on my experience page.

Tool set: lookup_account, search_knowledge_base, check_open_incidents, create_ticket, escalate_to_human

Critical design decisions:

System prompt includes explicit escalation triggers (billing disputes, data loss, detected frustration)
escalate_to_human is always available and never blocked by safety controls
All tool calls are logged to the CRM with the agent's reasoning step attached
The agent never sends emails directly. It drafts, and a human confirmation step triggers send

Failure mode to guard against: The model confidently drafts a resolution based on outdated knowledge base articles. Guard: always call check_open_incidents before drafting. If a known issue exists, reference it explicitly. Never synthesize a fix from first principles when a documented answer exists.

Scenario 2: Autonomous Data Analyst

An internal agent that accepts a natural language analysis request, writes and runs Python to query a database, interprets results, and returns a structured report with caveats.

I built a production version of this exact pattern. InsightPilot transforms natural language questions into interactive charts and insights, with 95% faster time-to-insight for non-technical teams. The agent architecture underneath it follows the design below almost exactly.

Tool set: run_sql_query, execute_python, write_to_report, request_clarification

System prompt design:

You are a data analyst agent. When given an analysis request:
1. Clarify any ambiguous metrics or time ranges before querying.
2. Write SQL to retrieve the relevant data. Always add LIMIT 10000 as a safeguard.
3. Use Python to analyze and visualize the result.
4. Summarize findings in plain English. Flag any anomalies or data quality issues.
5. State confidence level and list assumptions explicitly.

You do NOT have access to production write operations. Read-only only.
Never interpret missing data as zero. Flag it as unknown.

Critical design decisions:

SQL queries run against a read-only replica. Write operations are blocked at infrastructure level, not just prompt level.
The agent is required to state assumptions and confidence explicitly in every report.
request_clarification is a first-class tool. The agent is rewarded for asking rather than guessing.

Evaluation and Debugging

What to Measure

Evaluating agents is harder than evaluating single LLM calls because failures compound across steps. The metrics that matter in production:

Task completion rate: Did the agent actually achieve the stated goal, end to end? This is the only metric that truly matters to users.
Step efficiency: Did it take 12 tool calls to do a 3-step task? High step counts signal planning failures or prompt issues, not tool failures.
Hallucination rate: Did the agent fabricate tool results, cite sources that don't exist, or invent intermediate facts? Measure this separately from task completion.
Latency per step: Break down wall-clock time by step: LLM call vs. tool execution vs. orchestrator overhead. The bottleneck is usually not where you expect.
Error recovery rate: When a tool fails or returns empty, does the agent adapt and recover, or does it spiral into retries and then give a confident wrong answer?
Human escalation rate: For agents with an escalation path, is it escalating appropriately? Too low means it's overconfident. Too high means the system prompt or tool set needs work.

Debugging Strategies That Actually Work

Trace every step. Log the full context sent to the model at each step, the model's output, the tool called, and the tool's result. Don't log just the final answer. The failure is almost always in the middle.

Replay broken runs. Store agent state snapshots so you can replay a failed run from any step with a modified prompt or tool response. Invaluable for fixing edge cases without running the whole task again.

Inject deliberate failures. Test what happens when a tool returns an error, returns empty results, or times out. Most agent loops handle the happy path well. The failure modes reveal the real weaknesses.

Watch for infinite loops. The most common production failure: the agent calls the same tool repeatedly with slightly different parameters, making no progress. Add a loop detector:

def detect_loop(tool_calls: list[dict], window: int = 5) -> bool:
    if len(tool_calls) < window:
        return False
    recent = tool_calls[-window:]
    tool_names = [c["name"] for c in recent]
    most_common = max(set(tool_names), key=tool_names.count)
    # Flag if more than 80% of recent calls are the same tool
    return tool_names.count(most_common) / window > 0.8

Current Limitations and Where This Goes Next

Context window as working memory. Even at 200K tokens, long-running agents hit limits. Summarization, compression, and hierarchical memory are active research areas, but every approach involves lossy compression of the agent's history. There's no clean solution yet. The chunking strategies I covered in this post directly affect how much useful signal you can pack into that window.

Reliability compounds. A single-step LLM call might be 95% reliable. A 10-step agent with tool calls compounds that probability. At 95% per step, a 10-step task succeeds roughly 60% of the time. Reliability engineering for agents means designing each step to fail gracefully, not just optimizing each step in isolation.

Goal drift. Agents can drift from the original goal over many steps, especially when tool results introduce new information. Maintaining goal fidelity across a long execution trace is harder than it sounds and an open alignment problem.

Multimodal agents. The next generation of agents perceive images, audio, video, and structured data natively. Architecturally this adds perception modules before the reasoning loop, but the core loop stays the same. The hard problems shift to grounding (connecting what the model sees to what it should do) and latency.

Edge deployment. Running agent loops on-device requires model compression and rethinking which parts of the architecture live locally vs. in the cloud. The orchestrator can run locally; the LLM core probably can't yet for most real tasks.

What This Changes About How You Build

The mental model shift that mattered most for me: stop thinking about agents as smart assistants and start thinking about them as distributed systems with an LLM as the decision node.

All the lessons from distributed systems apply: fail gracefully, make state explicit, design for idempotency, log everything, test failure modes before happy paths. The LLM doesn't change these fundamentals. It just adds a new kind of nondeterminism to manage.

Build the observability layer before you build the capabilities. You can't debug what you can't see.

And when your agent does something unexpected — and it will — the answer is almost always in the trace, sitting there in the context window, waiting for you to read it.

Thanks for reading ! Until next time , Stay curious. ~ Vansh Garg