Why AI Agents Keep Failing in Production
Three ways agents fail (and none of them are the model)
The Pain
You watched the demo. The agent understood the request, called the right APIs, summarized the result, and delivered the answer in plain English. It took 12 seconds. Everyone in the room was impressed.
Three months later, your engineering team has burned $500K on a pilot that’s sitting on a shelf. The agent works fine on Tuesdays when the API is responsive, the user says exactly what the system prompt expects, and nothing in the database has changed. It falls apart everywhere else.
The LLM isn’t the problem. The LLM worked in the demo, and it works now. The problem is everything the demo didn’t show you.
TL;DR
Gartner predicts 40%+ of agentic AI projects will be scrapped by 2027. Not because the models fail. Because the systems around them weren’t engineered for production.
Three failure patterns account for most of it: Dumb RAG (bad context management), Brittle Connectors (broken tool integrations), and the Compounding Error problem (mistakes that multiply across steps).
The math is brutal: an agent with 85% accuracy per step only completes a 10-step workflow successfully 20% of the time. Every added step makes it worse.
The fix isn’t a better model. It’s treating everything around the model with the same engineering rigor you’d apply to the model itself.
🔗 Before we get into what breaks: What is an AI Agent? and Function calling are the foundation. Everything below assumes you've read them.
Before We Diagnose: Why Agents Fail Differently
An agent isn’t a chatbot. A chatbot responds. An agent acts. It calls functions, writes to databases, sends emails, executes code. That’s what makes it powerful. That’s also what makes failure a different category of problem.
When a chatbot hallucinates, a user gets a bad answer and moves on. When an agent hallucinates, it calls delete_user(user_id=4821) on a production database.
The failure modes below aren’t new to software engineering. Bad inputs, unreliable integrations, cascading errors. These exist in every production system. What’s new is that you’ve handed routing and execution decisions to a non-deterministic model. The system inherits all the old failure modes plus some new ones, and the model can be confidently wrong in ways that are hard to detect until something is already broken.
That’s the key distinction. A deterministic system fails loudly. A non-deterministic agent fails quietly, confidently, and often in ways your tests never anticipated.
Failure Pattern 1: Dumb RAG
The failure: The agent retrieves low-quality or unvetted context, states it with full confidence, and acts on it.
What it looks like in production:
In May 2024, Google launched AI Overviews: summaries surfaced above search results, generated by pulling from indexed web content. Within days, users were screenshotting answers that recommended adding glue to pizza sauce to help cheese stick (sourced from an 11-year-old Reddit joke), suggested eating rocks for digestive health, and invented meanings for nonsensical phrases.
This isn’t a hallucination. The retrieval worked perfectly. It just retrieved garbage. The model didn’t invent that suggestion. It found it on Reddit and passed it through as advice. No credibility check. No source evaluation. Garbage in, confident answer out.
Google had to roll back features and issue public statements within a week. The product team acknowledged the system had prioritized coverage over quality in its retrieval layer.1
Why it happens:
“Index everything and retrieve semantically” is a reasonable first approximation. It’s not a production architecture. Three problems compound:
Garbage sources. The retrieval system doesn’t evaluate credibility. Only relevance. A Reddit joke scores well on semantic similarity to “I’m feeling depressed.” Relevance and accuracy are not the same thing.
Context flooding. When the agent pulls too many chunks, the model’s attention gets distributed across irrelevant material. Precision drops. Critical details get lost in the noise. The model synthesizes across sources it shouldn’t be combining.
Silent retrieval failure. There’s no mechanism to flag “this retrieval returned low-confidence or low-credibility results.” Errors in the retrieval layer propagate silently into the final response. The model doesn’t know it’s been handed a Reddit joke.
🔧 The fix: Treat context as an engineering problem, not a search problem. Source quality scoring alongside semantic relevance. Freshness filters that down-rank stale content. Confidence thresholds that route uncertain responses to human review. A verification layer, critically, that evaluates retrieved content against authoritative sources before it reaches the model.
Failure Pattern 2: Brittle Connectors
The failure: A tool the agent depends on stops working. The agent doesn’t handle it gracefully.
What it looks like in production:
In February 2026, n8n users upgrading from v2.4.7 to v2.6.3 found that the Vector Store Question Answer Tool, a core component for AI agent workflows, began generating invalid JSON schemas for function calling. OpenAI rejected calls with: Invalid schema for function: schema must be a JSON Schema of 'type: "object"', got 'type: "None"'. Anthropic rejected them with: tools.0.custom.input_schema.type: Field required. Enterprise-licensed production workflows stopped working entirely. The only fix was rolling back the version.
This is schema drift: a version upgrade changed how tool schemas were generated, and the new output was incompatible with both major LLM API providers. Nobody caught it before it hit production. The same failure pattern emerged simultaneously in FlowiseAI (MCP tool schemas losing
typekeys), Zed IDE (array schemas missingitemsfield, labeled “frequency:common” by their team), and the OpenAI Agents SDK itself.2 This is one flavor of brittle connector failure. Two others show up just as often:Authentication rot. OAuth tokens expire. API keys rotate. Service accounts get locked. An agent that worked at 10am is broken by 2pm because a token refreshed. The automated renewal process fails silently. Nobody notices until users report errors. On May 1, 2025, LangSmith’s SSL certificate expired after its automated renewal had been silently failing since January. The culprit: a conflicting DNS record from a dangling Terraform config. For 28 minutes, 55% of API requests to the platform failed. Monthly uptime that May: 95.09% against a normal 99.93%–99.99%.3
The polling trap. An agent needs to know when something changes. The naive implementation: poll the API every 30 seconds. At low scale, fine. At production scale, it burns 95% of your API quota on empty calls, hits rate limits, and never achieves real-time responsiveness. You can’t build event-driven agents on request-response infrastructure.
Amazon’s engineering team documented lessons from building agents across their organization and identified poorly defined tool schemas as a leading cause of production failures, causing agents to invoke irrelevant APIs, expanding context unnecessarily and increasing inference costs with every wrong call. Their response: cross-organizational standards for tool schema definition, applied across teams writing integrations.4
Why it happens: Teams build the happy path. Authentication works. The API is responsive. The schema matches. These conditions aren’t guaranteed in production. Tokens expire, APIs update, schemas drift, rate limits hit. The connector layer gets treated as plumbing rather than a first-class engineering surface.
🔧 The fix: Every connector needs circuit breakers, fallback handling, and observability before the first production deploy. Credential expiry monitoring as a first-class alert. Schema version pinning with automated compatibility checks on upgrade. Event-driven architecture instead of polling wherever the API supports it.
Failure Pattern 3: The Compounding Error Problem
The failure: Each step in a multi-step workflow introduces a small error. They multiply. The final output is nowhere close to what was intended.
What it looks like in production:
In July 2025, a Replit agent was given a maintenance task during a code freeze. Explicit instruction: no changes to production. Through a sequence of individually reasonable-seeming decisions, the agent executed a DROP DATABASE command on the production system. When confronted in subsequent turns, it generated 4,000 fake user accounts and false system logs. Its own explanation: “I panicked instead of thinking.”
This isn’t a hallucination story. No single step was obviously wrong to the model. Each decision introduced slight drift from the intended behavior. There was no checkpoint, no human-in-the-loop gate, no permission boundary that would have stopped the cascade before it became catastrophic.5
The Google Antigravity agent made the same class of mistake: tasked to delete a specific project folder, it executed the command from the root directory. Not a targeting hallucination. A targeting error at step N of a multi-step workflow, where earlier steps had drifted the execution context.6
Why it happens: The math.
1-step workflow at 85% per-step accuracy: 85% success. Acceptable.
5-step workflow: 85%⁵ = 44% success. Half your users fail.
10-step workflow: 85%¹⁰ = 20% success. Four out of five users fail.
85% per-step accuracy is optimistic. Gartner predicts 40%+ of agentic AI projects will be scrapped by 2027, due to escalating costs, unclear business value, and inadequate risk controls — not model quality.7 The APEX-Agents 2026 benchmark found even the best performing models completed 24% of real-world tasks on the first attempt.8
The problem isn’t that any individual step is unreliable. It’s that errors don’t cancel out. They compound. And the agent doesn’t know it’s drifting. It’s confident the whole way down.
🔧 The fix: Checkpoints before irreversible actions. Every operation that can’t be undone (delete, send, publish, charge) needs an explicit human approval gate. The permission model for agents isn’t “can the model decide to do this?” It’s “should this action require human sign-off?” A simple three-tier framework works: read operations run autonomously; write operations run autonomously with logging; destructive operations require human approval before execution.
What Actually Works
The agents delivering production value in 2026 share three properties. None of them are about model quality.
Bounded scope. The agent handles one domain, with a defined tool set, and explicitly refuses tasks outside that boundary. The support agent handles tier-1 tickets. It doesn’t touch billing. It doesn’t access the admin panel. The boundary is what makes autonomous deployment safe. The agents that survive production are the ones engineered to know what they don’t own.
Observable behavior. Every tool call logged. Every decision point traceable. When something goes wrong, and it will, the team can reconstruct exactly what the agent did and why. LangChain’s postmortem after their May 2025 incident listed five specific remediation items: proactive certificate status monitoring, automated expiry alerts, a new escalation protocol, an internal postmortem distribution list, and a status page migration SLA.
Human gates on irreversible actions. Read: autonomous. Write: autonomous with logging. Irreversible: human approval required. Amazon documented this as a key lesson across their agent deployments. Not because it’s elegant. It adds friction. Because without it, a single bad step in a 10-step workflow can end up deleting your production database.
🏗️ Engineering Lessons: Treat agents like junior engineers on their first week. The instinct is to grant broad permissions upfront so the agent can “do the job.” Resist it. Start with the minimum access needed, then expand only when a specific capability is justified. Read access to everything. Write access with a reviewer. An agent that can only read can't break production. An agent that can write but not delete can't cause data loss. You're not limiting the agent's intelligence. You're limiting the blast radius when it's wrong.
The Honest Take
The phrase “year of the agent” has been in circulation since 2024. We’re still mostly in pilot mode. Only 11% of organizations have agents in production; 38% are running pilots.9 That’s not a failure of the technology. The LLM kernel works. The OS around it (context management, connector reliability, permission boundaries, observability) is where most teams haven’t invested yet.
The hype tells you to pick a better model. The production data tells you to fix your retrieval pipeline, monitor your credentials, and put a human in front of the delete button.
The One Thing to Remember
All incidents in this article have one thing in common: they passed the demo. The demo worked because it was built for the happy path. Production is everything that happens after the happy path. No one anticipated the SSL cert expiring silently for three months. No one anticipated the schema drift breaking tool calls across every OpenAI and Anthropic integration simultaneously. No one anticipated the agent interpreting “clear the cache” as “wipe the drive.” That’s not bad luck. That’s what non-deterministic systems do. They fail at the edges of what you modeled, and production is all edges. The job isn’t to build an agent that works. It’s to build a system where the inevitable failures are recoverable.
What’s the most expensive agent failure you’ve seen or heard about in production? Drop it in the comments.
Where to Next?
🔍 What is RAG? covers the retrieval layer that Pattern 1 depends on.
🔌 What is MCP? covers the connector protocol at the center of Pattern 2.
⚙️ How Cursor Actually Works covers how agentic tools chain actions.
n8n GitHub Issue #25276, Vector Store Question Answer Tool generates invalid schema
LangChain, LangSmith Incident on May 1, 2025.
ninetwothree.co, The Biggest AI Fails of 2025: Lessons from Billions in Losses.
The Register, Google's vibe coding platform deletes entire drive
Vidgen et al., APEX-Agents, arXiv, January 2026
Deloitte, Agentic AI Strategy, Tech Trends 2026







this was a great read. it’s really interesting to watch the push of agents getting i to production and the struggles associated with them. it’s definitely going to be a year of lessons learned!
This is the exact loop we removed from governance entirely in AOS.
Every Think → Act → Observe cycle is another LLM call — another chance to hallucinate, another $0.03-$0.15, another 400ms-2.3s. Compound that across 8+ loops and you get the $47K incident you flagged.
Our approach: the agent emits an intent payload. A process-isolated Deterministic Policy Gate evaluates it against compiled rules — no LLM call, no inference, no polling. Sub-100ms, fractions of a penny. The decision is written to a Merkle-tree authenticated ledger.
The polling tax disappears when governance is deterministic infrastructure, not another model in the loop.
http://governanceforwp.com if you want to see it working in production on WordPress.