AI Demos Fail for a Boring Reason: Recovery

There is a pattern in modern AI products that is easy to miss.

The impressive part is usually the reasoning.

The disappointing part is usually the recovery.

A system writes a good draft, but crashes when a tool times out. It successfully completes three steps in a workflow, then duplicates the fourth because the process restarted. It collects useful information, then loses the trail because memory was stored in the wrong place. It waits for a human approval, then resumes with stale context. It calls an API twice because the model “forgot” the first call had already committed.

None of these failures are exotic.

They are ordinary software failures.

They just happen inside “agent” products.

This matters because useful automation is not defined by a perfect path. It is defined by what happens when the path is imperfect.

Reliable AI is not only about what the agent can do when everything goes right.

It is about what the system can preserve when everything goes wrong.

Recovery is where demos become products

A demo usually shows the happy path.

textcopy-ready

user asks
↓
agent plans
↓
agent calls tool
↓
agent returns result

A product has to survive the unhappy path.

textcopy-ready

user asks
↓
agent plans
↓
tool times out
↓
retry budget applies
↓
state is preserved
↓
side effects are checked
↓
workflow resumes
↓
human approval still exists
↓
result completes correctly

The second version is less exciting to demo.

It is also the version users actually need.

Temporal’s durable execution documentation makes this point in traditional workflow language: a workflow execution is durable, reliable, and scalable; recovery uses event history so execution can resume from the latest recorded state.^{Temporal Workflow Execution}

AI workflows inherit all of that complexity and add nondeterministic model behavior on top.

That is why recovery cannot be an afterthought.

The world is full of interruptions

Real workflows are interrupted by ordinary events:

Interruption	Product symptom	Runtime question
API timeout	The workflow stalls or retries blindly.	What was the retry budget?
Rate limit	The agent keeps trying and increases cost.	Should the workflow back off, sleep, or reroute?
Worker crash	Progress disappears.	Which step was last committed?
Restart	The system repeats work.	Which side effects already happened?
Invalid external data	The model reasons from bad input.	Was validation performed before commit?
Human delay	The workflow resumes with stale state.	What changed while waiting?
Tool schema mismatch	The agent calls the right tool incorrectly.	Was parameter accuracy checked?
Partial side effect	The API succeeded but the local process failed.	How is idempotency enforced?

A runtime that treats an AI workflow like a temporary script will fail in exactly these moments.

A runtime that treats the workflow as durable can stop, record, resume, and continue.

That difference is the difference between a toy and a system.

Recovery is a user experience feature

People often talk about reliability as if it were back-end plumbing.

For AI workflows, recovery is visible to the user.

A user notices when:

the same email is sent twice
the workflow starts over from scratch
a draft disappears after a restart
human approvals get lost
yesterday’s context overrides today’s instruction
long-running work silently dies
the system asks the user to explain everything again

These are not only engineering failures.

They are product failures.

MirrorNeuron treats recovery as part of the product promise: durable workflows, explicit state, retries, sleep and resume, and the ability to run workflows from a laptop to a cluster without changing the workflow idea.^{MirrorNeuron Home}^{MirrorNeuron Docs}

The core benchmark: Fault Recovery Rate

For customers, the recovery benchmark should be a hard number.

textcopy-ready

fault_recovery_rate =
  workflows_completed_correctly_after_injected_failures
  / workflows_with_injected_failures

A serious runtime should report this across a fault-injection suite, not just claim it abstractly.

MirrorNeuron's current internal benchmark result is:

textcopy-ready

fault recovery rate: 99.2%
benchmark base: 124 / 125 injected failures
target: 99.0%
fault classes covered: worker, tool, loop, and approval failures

That number should be read as a benchmark result for the current evaluation suite, not a universal guarantee across every possible failure mode.

But the principle is stable:

if recovery is not measured, reliability is mostly a story.

What a recovery benchmark should inject

A useful benchmark should break the system on purpose.

Fault class	Example injection	Passing behavior
Worker failure	Kill the worker during an LLM call.	Resume from last committed step.
Tool timeout	Delay a tool response beyond timeout.	Retry within budget or pause cleanly.
Tool partial success	Tool succeeds but local process crashes before marking complete.	Detect committed side effect and avoid duplicate action.
Invalid output	Model returns malformed JSON.	Reject, repair, or route to verifier without corrupting state.
External data change	Source record changes while workflow waits.	Refresh or flag stale context before continuing.
Human approval delay	Approval arrives hours later.	Resume with current state and recorded approval.
Node loss	Cluster node disappears mid-run.	Fail over without losing workflow state.
Retry storm	Many workflows hit the same failing tool.	Apply backpressure and prevent runaway cost.

This is where a durable runtime has to prove its value.

Not in a perfect demo.

In a controlled disaster.

Recovery has three layers

Recovery is often discussed as if it were one thing.

It is not.

A serious AI runtime needs at least three recovery layers.

1. Execution recovery

Execution recovery asks:

Can the workflow continue after process, machine, or network failure?

This requires persisted state, checkpoints, event logs, and resume semantics.

2. Semantic recovery

Semantic recovery asks:

Can the agent recover from wrong, missing, stale, or malformed context?

This requires validation, context refresh, source provenance, memory boundaries, and sometimes human review.

3. Side-effect recovery

Side-effect recovery asks:

Can the system avoid doing the dangerous thing twice?

This requires idempotency keys, commit boundaries, tool-call logs, approval state, and explicit records of external actions.

The third layer is where many agent demos quietly fail.

Generating a duplicate answer is annoying.

Sending a duplicate payment, message, ticket update, database mutation, or trade is a different category of problem.

The commit boundary matters

A model response should not automatically become truth.

A tool call should not automatically become an approved state transition.

The runtime needs a commit boundary.

textcopy-ready

model proposes
↓
runtime validates
↓
policy checks
↓
side effects execute
↓
result is recorded
↓
state is committed

That boundary is where recovery becomes possible.

If state is committed before validation, the workflow can preserve the wrong thing.

If state is never committed, the workflow can lose progress.

If side effects are not recorded, retries become dangerous.

Recovery changes the economics

Recovery is also a cost issue.

Every failed workflow has hidden cost:

textcopy-ready

wasted model calls
wasted tool calls
human repair time
duplicated work
lost trust
support burden
opportunity cost

The right economic metric is not raw token spend.

It is cost per successful workflow:

textcopy-ready

cost_per_successful_workflow =
  (model_cost + tool_cost + compute_cost + human_repair_cost)
  / successful_completed_workflows

A system with more careful runtime machinery can look slower or heavier on a single step, but be cheaper across the whole workflow because it avoids restarts, duplicate side effects, and human rescue.

This is the number customers and investors should care about.

The recovery scorecard

A buyer evaluating an AI runtime should ask for a recovery scorecard that connects directly to the five hard metrics:

Buyer metric	Recovery-specific question
Workflow Completion Rate	After normal variance and failures, how often does the workflow still finish correctly?
Fault Recovery Rate	After injected failures, how often does it resume from the right point?
Tool Execution Accuracy	Are retries and tool parameters correct after recovery?
Cost per Successful Workflow	How much cost is wasted on restarts, loops, and duplicate work?
Human Intervention Rate	How often does a person need to repair the workflow rather than approve it?

This is the practical distinction between “agent framework” and “AI workflow runtime.”

An agent framework helps you build behaviors.

A runtime helps those behaviors survive reality.

What first-time users should feel

A good recovery model should make AI feel calmer.

The user should not have to babysit every step.

They should be able to inspect progress, pause, resume, approve, retry, and understand what happened.

They should trust that if a machine sleeps, a tool fails, or a process restarts, the workflow does not lose its mind.

That is not magic.

It is runtime design.

The takeaway

The unsexy part of AI may become the most important part.

Recovery is not a footnote.

It is where demos become dependable systems.

The next serious benchmark for AI workflows is not only:

Can the agent reason?

It is:

Can the workflow run, fail, recover, and continue without losing truth?

That is the benchmark MirrorNeuron is built around.

References

MirrorNeuron Home: MirrorNeuron product page. https://www.mirrorneuron.io/
MirrorNeuron Docs: “MirrorNeuron: durable multi-agent workflow runtime.” https://doc.mirrorneuron.io/
Temporal Workflow Execution: Temporal Docs. “Workflow Execution overview.” https://docs.temporal.io/workflow-execution
LangGraph Durable Execution: LangChain Docs. “LangGraph Overview.” https://docs.langchain.com/oss/python/langgraph/overview
AWS Agent Evaluation: AWS. “Evaluating AI agents: Real-world lessons from building agentic systems at Amazon.” 2026. https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/