What is an AI Agent?
Everyone's building one. Almost nobody can define one.
The Pain
You shipped a chatbot last quarter. It understands customer intent beautifully. It generates helpful, well-formatted answers. Your PM called it “magical” in the demo.
Then a customer asks for a refund. The chatbot says “I’ll process that for you!” and does absolutely nothing. It generated a confident sentence about processing a refund without actually processing it. It has no access to your order database. It can’t call your refund API. It can’t even check whether the return window is still open. It just… talks.
The customer screenshots the response, posts it on X, and now your VP of Customer Success is texting your CEO.
You’ve hit the wall that every team building on LLMs hits eventually: the model can reason about what to do, but it can’t do it. It has no hands.
An AI agent gives it hands. But what does that actually mean, and why does everyone from OpenAI to your CEO keep saying 2026 is “the year of the agent”? Let’s break it down.
The TL;DR
An AI agent is a software system that uses an LLM as its “brain” to reason about a goal, decide which tools to use, and take actions in a loop until the job is done.
An LLM without an agent is a senior engineer who can only talk. It can diagnose the problem on a whiteboard but can’t touch the server. An agent gives it a terminal.
The core loop is simple: Think → Act → Observe → Repeat. The agent reasons, calls a function, checks the result, and decides whether to keep going or deliver an answer.
Klarna’s agent handled two-thirds of all customer service chats and cut resolution time from 11 minutes to under 2. Then they had to rehire humans. Constrained tasks work. Open-ended judgment doesn’t.
The LLM is rarely the problem. Deloitte1 found only about 1 in 10 orgs have agents in production. What kills them: auth, error handling, audit trails, compliance.
Let’s get into it.
Before Agents, There Were (Pretty Good) Chatbots
To understand why agents matter, you need to understand what came before them. And why it stopped being enough.
A traditional chatbot (think the old-school kind, pre-LLM) was basically a big if/else tree:
Customer says “refund” → show refund policy.
Customer says “hours” → show business hours.
Then LLMs arrived, and chatbots got way better at understanding what you meant. You could say “hey, I got the wrong size and I want to return this” and the LLM would understand you wanted a return, even though you never said the word “return.” Huge upgrade.
An LLM is a senior engineer who can only talk. It can debug your system on a whiteboard, but it can’t SSH into the server, run the query, or push the fix. All reasoning, zero execution.
An agent is what happens when you give that engineer a terminal.
The Engineer With No Terminal
Here's what that looks like under the hood. A regular LLM chatbot, no tools, no agent framework.
Customer: “I want a refund for order #4821. I returned the item two days ago.”
What the LLM does:
Understands the intent: refund request
Generates a helpful-sounding response: “I’d be happy to help with your refund! Let me look into order #4821 for you.”
…but it can’t actually look up order #4821. It doesn’t have access to your order database. It doesn’t know if the return was received. It doesn’t know if the refund window is still open.
So it either hallucinates details (”Your refund of $47.99 has been processed,” when it has no idea what the order total was) or it punts (”Please contact our support team at support@company.com“).
The specific problems:
No access to real-time data. The LLM’s knowledge is frozen at training time. It doesn’t know your customer’s order status, your current inventory, or your refund policy (which you updated last Thursday).
No ability to take actions. Even if the LLM perfectly understood the situation, it can’t call your refund API, update a database record, or send a confirmation email. It’s text in, text out. That’s it.
No ability to reason across multiple steps. Complex tasks require checking one thing, using that result to decide the next step, and chaining several operations together. A single LLM call is one-shot. It doesn’t have a mechanism to pause, go get more information, and come back.
⚠️ Confusion Alert: “But ChatGPT can browse the web and run code!” Yes, because ChatGPT is an agent. When you see it search the web, it’s using tools via an agent loop. The base LLM underneath can’t do any of that on its own.
By this point, you’re probably thinking: okay, so how do you give the LLM access to tools and let it reason in a loop? That’s exactly what an AI agent is.
How AI Agents Actually Work
An AI agent exists because of the problems above. Here’s how it works.
The core idea is dead simple. Instead of calling the LLM once and returning the result, you put the LLM in a loop:
The Agent Loop (Think → Act → Observe → Repeat)

Let’s walk through it with the refund example:
The customer says “Refund order #4821.” The agent’s first move is to think: I need to look up this order before I can do anything. So it acts, calling order_lookup("4821"). It observes the result: order status is “returned,” total is $47.99, return window is open.
Back to think: the return is confirmed and the window is still open, so I should process the refund. It acts again, calling process_refund("4821", 47.99). It observes: refund completed, reference number RF-9920.
One more think: I have everything I need. The agent exits the loop and responds to the customer: “Your refund of $47.99 for order #4821 has been processed (ref: RF-9920). You should see it in 3-5 business days.”
Two loops. Two tool calls. One complete task, handled end to end without a human touching it.
That’s it. The agent loops between reasoning and executing until it has what it needs.
This pattern has a name in the research literature: ReAct (Reasoning + Acting), from a 2023 paper by Yao et al2. If you've looked at any agent framework (LangChain, LlamaIndex, CrewAI, OpenAI's Assistants API), you've already seen it. They all implement some variation of this loop under the hood.
The Three Components
Every agent has three parts:
The Brain (LLM). Does the reasoning: decides what to do next, interprets results, handles edge cases. This is why model quality matters more for agents than for chatbots. A chatbot with a weak model gives a mediocre answer. An agent with a weak model calls the wrong tool, gets a confusing result, and spirals. Every step in the loop is a decision, and the model is making all of them.
The Tools. Functions the agent can call: database queries, API calls, web search, code execution, file operations. Anything you can wrap in a function signature, the agent can use. That's the important part: the agent doesn't know how your tools work internally. It reads a description of each tool (name, parameters, what it returns) and decides when to call it. Good tool descriptions make good agents. Vague ones make agents that hallucinate tool calls.
The Memory/State. The running context of the conversation and actions taken so far. Without this, the agent would forget what it already looked up between steps. In the refund example, memory is how the agent knows the order status was "returned" when it gets to step two. It sounds obvious, but managing what the agent remembers (and what it forgets) becomes a real engineering problem once conversations run long or span multiple sessions.
The Brain is the engineer. The Tools are the terminal. The Memory is how it tracks what it already tried.
🔍 Deeper Look: The original ReAct paper by Yao et al. showed that combining reasoning traces with tool use outperformed both pure chain-of-thought prompting and pure action-taking on tasks like question answering and fact verification. Key insight: the reasoning traces help the model recover from errors and avoid hallucinating tool calls. Paper
Here’s what a minimal agent actually looks like in Python with LangChain:
from langchain.agents import create_react_agent
from langchain_openai import ChatOpenAI
from langchain.tools import tool
@tool
def order_lookup(order_id: str) -> dict:
"""Look up an order by ID. Returns status, total, and return window."""
return db.orders.find_one({"id": order_id})
@tool
def process_refund(order_id: str, amount: float) -> dict:
"""Process a refund for a given order."""
return payments.refund(order_id=order_id, amount=amount)
agent = create_react_agent(
model=ChatOpenAI(model="gpt-4"),
tools=[order_lookup, process_refund],
prompt="You are a customer service agent for Acme Corp..."
)
# The agent loops internally until it has a final answer
result = agent.invoke({"input": "Refund order #4821"})That’s about 15 lines of code. I was genuinely surprised the first time I wired this up: the create_react_agent function handles the Think → Act → Observe loop for you. You just define the tools and the prompt. If you’re thinking “that can’t be all there is,” you’re right. The loop is 15 lines. The other 10,000 lines are error handling, auth, and making sure it doesn't refund orders that were never placed
Who’s Actually Building With This
Let’s ground this in production reality.
Klarna went all-in on AI agents for customer service in early 2024. Their agent handled 2.3 million conversations in its first month, covering two-thirds of all customer chats, and cut average resolution time from 11 minutes to under 2. By Q3 2025, the agent was doing the equivalent work of 853 full-time employees and had saved $60 million. Then they had to rehire human agents. Customers were getting generic, templated answers on complex issues: billing disputes with multiple orders, edge-case refund policies, anything that required judgment across several systems (Fortune).
🏗️ Engineering Lesson: Klarna routed everything through the same agent path without adequate escalation for complex queries. Simple questions worked great. Multi-step reasoning across ambiguous inputs didn’t. The routing layer that decides what the agent should even attempt matters as much as the agent itself.
Coding agents are arguably the most mature category. GitHub Copilot, Cursor, and Claude Code all use the same Think → Act → Observe loop, except the tools are file system access, terminal commands, and test runners. You describe a bug, the agent reads your codebase, writes a fix, runs the tests, sees what fails, and tries again.
Shopify Sidekick is an agent that lets merchants manage their stores through natural language: analyze customer segments, update products, create discounts, generate reports. It pairs the agent loop with RAG to pull live product and customer data into context (we're covering how RAG works in Week 3). Under the hood, Shopify built LLM-powered judges to evaluate Sidekick's decisions, calibrating them against human judgment until agreement scores approached human-to-human baselines. They also built a merchant simulator that replays real conversations through candidate system changes before deploying them. Their engineering team presented the full architecture at ICML 2025 (Shopify Engineering).
What Can Go Wrong (and What’s Overhyped)
Tool call hallucinations. Agents sometimes invoke tools that don’t exist or pass arguments that don’t make sense. This is especially common with smaller models. Without fine-tuning or good few-shot examples, performance drops below even basic chain-of-thought prompting.
Compounding errors. Each step in the agent loop has a small chance of going wrong. Chain five steps together and you've compounded those error rates. The dangerous part is that the final answer still looks polished. You don't realize step five went wrong until a user reports it. The best frontier models complete about 24% of real-world knowledge work tasks correctly on the first attempt3.
The governance gap. Gartner predicts over 40% of agentic AI projects will be scrapped by 2027. The LLMs work fine. The problem is everything around them: identity management, audit trails, error handling, compliance4.
The hype gap. The phrase “year of the agent” has been thrown around since 2024, and we’re still mostly in pilot mode. Deloitte’s 2025 Tech Trends report found only about 1 in 10 organizations have agents in production, with another 38% running pilots. The technology works. The organizational readiness doesn’t. If someone tells you AI agents will automate your entire business by Q4, ask them how their last AI pilot went.
That said, the narrow, well-scoped agent is genuinely delivering value today. Handling refunds, triaging tickets, searching codebases, summarizing research. The pattern that works: constrained domains, clear tool definitions, and humans in the loop for edge cases.
The One Thing to Remember
An AI agent doesn’t make the LLM smarter. The intelligence was already there. The agent gives it agency: a terminal to act on what it already knows.
Where to Next?
📖 Go Deeper: Cursor vs Claude Code: the two most capable AI coding agents compared head-to-head. Which one fits your workflow?
🔗 Go simpler: What is RAG?: agents often use RAG to access external knowledge. If you’ve been hearing “RAG” everywhere, this explains why
🔀 Go adjacent: What is MCP?: the new standard for connecting LLMs to tools. Think USB-C for AI models.
This is Issue #1. I built this newsletter because every AI agent explainer I found either talked down to engineers or skipped the architecture/code entirely.
So tell me: did I get it right? What should I cover next? And if you’ve shipped an agent to production, what broke first? Leave a comment below.
Deloitte Tech Trends: https://www.deloitte.com/us/en/insights/topics/technology-management/tech-trends.html (December 2025. 11% in production, 38% piloting.)
ReAct paper (Yao et al.): https://arxiv.org/abs/2210.03629 (ICLR 2023)
APEX-Agents benchmark: https://arxiv.org/abs/2601.14242 (January 2026 paper by Mercor. Gemini 3 Flash at 24%, most frontier models around 18%.)
Gartner 40% prediction: https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 (June 2025 press release)


