In November 2022, a grieving customer asked the chatbot on a major airline’s website how its bereavement fares worked. The chatbot told him he could apply for the discount after booking, retroactively. He booked, then learned the airline had no such policy. When he took the dispute to the British Columbia Civil Resolution Tribunal, the airline argued that its chatbot was, in effect, a separate entity responsible for its own answers. The tribunal called that a remarkable submission and held the company liable.
That case, Moffatt v. Air Canada, is worth keeping in mind every time anyone ships an AI agent into production, because it captures the real risk so precisely. The agent did not crash. It did not throw an error. It confidently produced a wrong answer, and the cost landed on the business. Most teams building agentic systems are optimising for what the agent can do. The harder and more important question is what happens when it fails, because it will, and failure in these systems is usually silent.
Why agentic systems fail differently
A multi-step agent is a chain. It reasons, calls a tool, reads the result, reasons again, calls another tool, and continues until it believes the task is done. Every link is a place where things can go wrong, and the failures do not look like traditional software failures. There is rarely a stack trace. The agent simply drifts off course and keeps going with total confidence. Designing for that means starting from the failure modes rather than the happy path.
Across the agentic systems I have built and run in production, four failure modes account for almost everything that goes wrong.
- Context exhaustion. The agent runs out of usable context partway through a long task and quietly loses track of what it was doing.
- Tool call loops. The agent calls a tool, dislikes the result, calls it again with a tiny variation, and gets stuck repeating itself.
- Ambiguous routing. The agent reaches a fork where more than one next step looks valid, and it picks wrongly with no signal that the choice was a guess.
- State loss. A step fails or restarts, and the agent has no durable record of how far it had got, so it either redoes work or skips it.
FRAME: five layers built around failure
The model I keep returning to is FRAME, which stands for Failure-Recovery Architecture for Multi-step Execution. It is not a library you install. It is five layers of thinking you apply to any agent, whatever model or tooling sits underneath. Each layer answers one question.
- Failure classification. Before writing any recovery logic, name the ways this specific agent can fail. The four modes above are a starting set. Make the agent’s failures a finite, named list rather than an open-ended surprise.
- Recovery logic. For each named failure, define one concrete response in advance. A tool call loop gets a hard attempt limit and a fallback. Context exhaustion triggers a summarise-and-continue step. The recovery is decided beforehand, not improvised mid-run.
- Awareness boundaries. Decide what each step is allowed to see and change. A step that only needs to read should not be able to write. Scoping the agent’s reach is what stops a small mistake from becoming a large one.
- Monitoring hooks. Instrument every transition so you can see, after the fact, where an agent went off course. Without this, a silent failure is invisible until a customer reports it, which is exactly how the airline found out.
- Escalation protocol. Define the point at which the agent stops and hands to a human, and make that handoff graceful rather than a dead end. An agent that knows when to give up is more trustworthy than one that always produces an answer.
Recovery in practice: the tool call loop
To make this concrete, take the most common failure I see, the tool call loop. An agent calls a search tool, the result is not quite what it wanted, so it calls again with a slightly reworded query, and again, and again. Left alone it will burn through its budget making near-identical calls, each one feeling locally reasonable. The recovery logic is not sophisticated. You cap the attempts, and you decide in advance what happens when the cap is hit.
let attempts = 0;
while (attempts < MAX_TOOL_ATTEMPTS) {
const result = await callTool(query);
if (isGoodEnough(result)) return result;
query = refine(query, result);
attempts++;
}
return escalate('tool loop hit attempt limit', { query });That single guard converts an open-ended, budget-eating loop into a bounded operation that either succeeds or escalates cleanly. Every one of the four failure modes gets a guard like this, decided ahead of time rather than discovered at runtime. The point is not the specific limit you choose. It is that the agent can no longer fail in an unbounded way, because you named the failure and gave it an exit.
The shift this forces
The reason most agents look impressive in a demo and disappoint in production is that demos exercise the happy path and production exercises everything else. FRAME flips the order in which you design. You begin by enumerating how the thing breaks, you attach a defined response to each break, you constrain what each step can touch, you make failures observable, and you give the system a dignified way to stop. The capability comes afterwards, inside those guardrails.
The airline’s defence, that the chatbot was somehow its own responsible entity, failed for an obvious reason. The agent was part of the business, and so were its mistakes. That is the right mental model for anyone deploying these systems. Your agent’s failures are your failures. The only question is whether you designed for them on purpose, or discovered them the way Air Canada did, in front of a tribunal.
A reliable agent is not one that never fails. No such thing exists. It is one whose failures are named, contained, visible, and recoverable. Build the architecture around that and you ship something you can defend, in production and, if it comes to it, anywhere else.
