How AI Agents Actually Fail in Production

The AI Agent Discourse Has an Optimism Problem

Almost everything written about AI agents is written before they hit production. The demos work. The benchmarks are impressive. The architecture diagrams are clean. Then someone actually runs the thing across a live client project, and the gap between what the agent was supposed to do and what it actually does starts to become visible in specific, repeatable, sometimes expensive ways.

I have been building agentic systems in production for the past year - a 28-skill AI-native delivery pipeline that runs across multiple live client engagements, a 13-skill dev workflow system used on every BetaCraft project, a macOS voice-to-prompt app that processes input through eight specialist modes. None of these failed in the ways the theoretical risk lists predicted. They failed in different ways. More interesting ways. The kind of ways you only see when something is actually running.

This is the sequel to my earlier piece on the agentic harness - the scaffolding that makes agents useful. If the harness is what makes agents work, this is the catalogue of what breaks them when the harness is wrong, incomplete, or absent. Written from production, not from a demo.

Pull Quote · 01
AI agents do not fail the way the risk lists predict. They fail in smaller, more specific, more repeatable ways. The kind you only see after the third time the same thing breaks.

Failure Mode One: Agents Fail at Boundaries, Not at Tasks

The most reliable observation I have from running agents in production: a well-configured agent almost never fails at its core task. It fails at the edges of its domain. Give a code agent a clear spec and it will write the code. Give it a spec that requires it to make a product decision and it will make a product decision - confidently, often wrongly, and without flagging that it has left its lane.

On early versions of the PM-Dev Closed Loop system, I had a single agent trying to handle both PM-layer and dev-layer work. The failure was not that it could not write code or could not draft a client email. It was that when a task landed in the grey zone between the two domains - "translate this client requirement into a technical spec" - the agent would make assumptions about what the client wanted without the context to validate them, or make assumptions about what was technically feasible without the codebase context to support them. Confident, plausible, wrong.

The solution was the bridge pattern: a hard boundary between the PM agent (BAM) and the dev agent (TPM), with a shared folder as the only interface. BAM writes REQUIREMENT-NNN.md. TPM reads it and builds. Neither agent can reach into the other's domain. The boundary is structural, not instructional - it is not "please stay in your lane," it is "this is the only lane that exists."

The lesson generalised. Every agentic system I have built since has defined explicit domain boundaries first, before writing a single prompt. The domain boundary is an architectural decision, not a configuration setting. If you leave it implicit, the agent will find its own boundary - usually at the worst possible moment.

Signs you have a boundary problem

The agent is "helpful" in the wrong direction: It answers questions you did not ask, often substituting assumptions for missing context. Helpfulness without constraint is a boundary failure.
Outputs are plausible but unverifiable: The agent produces work that looks right but cannot be checked against a source of truth because it has drifted outside the domain where sources of truth exist.
The same task produces different results in different sessions: No consistency means no grounding. The agent is reconstructing its context from scratch each time because the domain boundary was never made explicit.

Failure Mode Two: Context Degradation Across Long Sessions

Context windows are large now. Larger than anyone expected three years ago. But large is not infinite, and the way context degrades as a session extends is not gradual and even - it is sudden and uneven. The agent stops referencing earlier decisions. It starts repeating work it has already done. It loses track of constraints that were established at the start of the session but are now buried under thousands of tokens of subsequent output.

On the vibe-* framework, this was the failure that drove the decision to produce structured document files at the end of every phase rather than relying on conversational context. FEATURE_SPEC.md, FEATURE_PLAN.md, FEATURE_TASKS.md - these are not just organisational artefacts. They are context anchors. When a new session starts, the agent reads these files before doing anything else. It is not resuming a conversation. It is resuming a state.

The pattern that works: treat every session boundary as a context reset and design for it deliberately. If information needs to survive a session boundary, it needs to be written to a file. If a decision needs to be available in the next session, it needs to be logged. The agent's conversational memory is ephemeral. The file system is not.

The deeper issue is that context degradation is invisible until it causes a problem. The agent does not announce "I have forgotten the constraints from two hours ago." It just starts behaving as if they were never set. The only defence is a structured state management approach that does not rely on the agent remembering things across time.

Pull Quote · 03
The agent does not announce when it has forgotten a constraint. It just starts behaving as if the constraint was never set. Design for context reset, not context persistence.

Failure Mode Three: The Proposal Layer Antipattern

This one took me a while to identify because it is subtle. When you first build an agentic system, there is a temptation to add a proposal layer - a step where the agent drafts what it plans to do before it does it, and you approve or reject the draft. This feels responsible. It feels like the right way to keep a human in the loop.

In practice, for most tasks, it creates a friction trap. The agent produces a proposal. You read it, approve it, and the agent does exactly what it said it would do. You have added a round-trip to every action without adding any real decision point. Worse, the proposal step trains you to skim - because most proposals are approved - which means the one proposal that has a real problem in it gets the same shallow attention as all the ones that do not.

I removed the proposal layer from the vibe-* system early on after watching it slow down sessions without catching errors that the review gate at the end would have caught anyway. The replacement architecture was direct writes with session logging - the agent acts, logs what it did, and the human reviews the log. The review gate (vibe-review) is the accountability mechanism, not the proposal step.

The distinction matters: a review gate is a structural checkpoint where a human makes a real decision about whether work meets the bar. A proposal layer is a bureaucratic step where a human approves a plan that the agent would have executed anyway. The first adds accountability. The second adds friction while simulating accountability.

Failure Mode Four: Human Gates in the Wrong Places

Related to the proposal problem, but different: the question of where human oversight should sit in an agentic pipeline is not obvious, and getting it wrong in either direction is expensive. Too many gates and the agent is just an autocomplete tool that requires approval for everything. Too few and consequential decisions get made without a human in the path.

The principle that helped me calibrate this: human gates should sit at decision points, not at execution points. A decision point is a moment where the output of the next step depends on a value judgement that the agent cannot make reliably from context alone - whether a feature is in or out of scope, whether a client communication should be firm or accommodating, whether a bug fix should address the symptom or the root cause. An execution point is where the agent is doing something it is good at - writing code to a spec, producing a document from a template, generating test cases from a feature description.

In the PM-Dev Closed Loop, the human gates are at propose-decision (before a requirement enters the bridge) and vibe-review (before code merges). Both are decision points - "is this what we want to build?" and "does this meet the bar?" Everything between them is execution, and the agents run without interruption. The sessions are faster, the agents are more effective, and the gates that matter actually get attention because there are only two of them.

What Actually Works: The Four Principles

After a year of running agentic systems in production, the design principles that have held up are simpler than I expected when I started.

Make domain boundaries structural, not instructional: Do not tell agents to stay in their lane. Build the lane so that stepping outside it is architecturally impossible. The bridge folder, the separation between BAM and TPM, the read-only skill files - all of these are structural boundaries, not prompt instructions.
Design every session as if it starts from zero: Write state to files. Log decisions. Structure the session opening to load context explicitly rather than assuming the agent remembers. The context window is a working memory, not a database.
Replace proposal layers with review gates: Let the agent act and log what it did. Review the output at a structural checkpoint, not at every step. The review gate is where accountability lives - not the approval step before execution.
Put human gates at decision points, not execution points: A human in the loop is valuable at "what should we build?" and "does this meet the bar?" It is overhead at "write the code for this spec" and "generate the test cases."

None of these are obvious before you hit the failure that teaches them. The boundary failure is invisible until the agent makes a confident product decision it had no business making. The context degradation is invisible until an agent contradicts a decision from two hours earlier. The proposal trap is invisible until you realise you have been approving every proposal without reading them carefully.

The optimism problem in AI agent discourse is not that people are wrong about what agents can do. It is that the failure modes are underrepresented because you only learn them by building something that actually runs. The demos never show the third hour of a long session when context has degraded and the agent is working from a corrupted understanding of the task. The architecture diagrams never show the boundary failure that happens when a task lands in the grey zone between two domains. Those are the moments that shape the design of systems that actually work.

Takeaway
Agents fail at boundaries, not at tasks. They forget across sessions. They make proposal layers feel like accountability when they are just friction. And they need human gates at decision points, not execution points.

The design principle Build the lane so stepping outside it is architecturally impossible. Instructions are not enough.