Back to Blog

AI Demos Fail for a Boring Reason: Recovery

AIReliabilityEngineering
2026-04-17 Homer Quan

There is a pattern in modern AI products that is easy to miss.

The impressive part is usually the reasoning.

The disappointing part is usually the recovery.

A system writes a good draft, but crashes when a tool times out. It successfully completes three steps in a workflow, then duplicates the fourth because the process restarted. It collects useful information, then loses the trail because memory was stored in the wrong place. It waits for a human approval, then resumes with stale context. It calls an API twice because the model “forgot” the first call had already committed.

None of these failures are exotic.

They are ordinary software failures.

They just happen inside “agent” products.

This matters because useful automation is not defined by a perfect path. It is defined by what happens when the path is imperfect.

Reliable AI is not only about what the agent can do when everything goes right.

It is about what the system can preserve when everything goes wrong.

Recovery is where demos become products

A demo usually shows the happy path.

textcopy-ready
user asks agent plans agent calls tool agent returns result

A product has to survive the unhappy path.

textcopy-ready
user asks agent plans tool times out retry budget applies state is preserved side effects are checked workflow resumes human approval still exists result completes correctly

The second version is less exciting to demo.

It is also the version users actually need.

Temporal’s durable execution documentation makes this point in traditional workflow language: a workflow execution is durable, reliable, and scalable; recovery uses event history so execution can resume from the latest recorded state.Temporal Workflow Execution

AI workflows inherit all of that complexity and add nondeterministic model behavior on top.

That is why recovery cannot be an afterthought.

The world is full of interruptions

Real workflows are interrupted by ordinary events:

InterruptionProduct symptomRuntime question
API timeoutThe workflow stalls or retries blindly.What was the retry budget?
Rate limitThe agent keeps trying and increases cost.Should the workflow back off, sleep, or reroute?
Worker crashProgress disappears.Which step was last committed?
RestartThe system repeats work.Which side effects already happened?
Invalid external dataThe model reasons from bad input.Was validation performed before commit?
Human delayThe workflow resumes with stale state.What changed while waiting?
Tool schema mismatchThe agent calls the right tool incorrectly.Was parameter accuracy checked?
Partial side effectThe API succeeded but the local process failed.How is idempotency enforced?

A runtime that treats an AI workflow like a temporary script will fail in exactly these moments.

A runtime that treats the workflow as durable can stop, record, resume, and continue.

That difference is the difference between a toy and a system.

Recovery is a user experience feature

People often talk about reliability as if it were back-end plumbing.

For AI workflows, recovery is visible to the user.

A user notices when:

  • the same email is sent twice
  • the workflow starts over from scratch
  • a draft disappears after a restart
  • human approvals get lost
  • yesterday’s context overrides today’s instruction
  • long-running work silently dies
  • the system asks the user to explain everything again

These are not only engineering failures.

They are product failures.

MirrorNeuron treats recovery as part of the product promise: durable workflows, explicit state, retries, sleep and resume, and the ability to run workflows from a laptop to a cluster without changing the workflow idea.MirrorNeuron HomeMirrorNeuron Docs

The core benchmark: Fault Recovery Rate

For customers, the recovery benchmark should be a hard number.

textcopy-ready
fault_recovery_rate = workflows_completed_correctly_after_injected_failures / workflows_with_injected_failures

A serious runtime should report this across a fault-injection suite, not just claim it abstractly.

MirrorNeuron's current internal benchmark result is:

textcopy-ready
fault recovery rate: 99.2% benchmark base: 124 / 125 injected failures target: 99.0% fault classes covered: worker, tool, loop, and approval failures

That number should be read as a benchmark result for the current evaluation suite, not a universal guarantee across every possible failure mode.

But the principle is stable:

if recovery is not measured, reliability is mostly a story.

What a recovery benchmark should inject

A useful benchmark should break the system on purpose.

Fault classExample injectionPassing behavior
Worker failureKill the worker during an LLM call.Resume from last committed step.
Tool timeoutDelay a tool response beyond timeout.Retry within budget or pause cleanly.
Tool partial successTool succeeds but local process crashes before marking complete.Detect committed side effect and avoid duplicate action.
Invalid outputModel returns malformed JSON.Reject, repair, or route to verifier without corrupting state.
External data changeSource record changes while workflow waits.Refresh or flag stale context before continuing.
Human approval delayApproval arrives hours later.Resume with current state and recorded approval.
Node lossCluster node disappears mid-run.Fail over without losing workflow state.
Retry stormMany workflows hit the same failing tool.Apply backpressure and prevent runaway cost.

This is where a durable runtime has to prove its value.

Not in a perfect demo.

In a controlled disaster.

Recovery has three layers

Recovery is often discussed as if it were one thing.

It is not.

A serious AI runtime needs at least three recovery layers.

1. Execution recovery

Execution recovery asks:

Can the workflow continue after process, machine, or network failure?

This requires persisted state, checkpoints, event logs, and resume semantics.

2. Semantic recovery

Semantic recovery asks:

Can the agent recover from wrong, missing, stale, or malformed context?

This requires validation, context refresh, source provenance, memory boundaries, and sometimes human review.

3. Side-effect recovery

Side-effect recovery asks:

Can the system avoid doing the dangerous thing twice?

This requires idempotency keys, commit boundaries, tool-call logs, approval state, and explicit records of external actions.

The third layer is where many agent demos quietly fail.

Generating a duplicate answer is annoying.

Sending a duplicate payment, message, ticket update, database mutation, or trade is a different category of problem.

The commit boundary matters

A model response should not automatically become truth.

A tool call should not automatically become an approved state transition.

The runtime needs a commit boundary.

textcopy-ready
model proposes runtime validates policy checks side effects execute result is recorded state is committed

That boundary is where recovery becomes possible.

If state is committed before validation, the workflow can preserve the wrong thing.

If state is never committed, the workflow can lose progress.

If side effects are not recorded, retries become dangerous.

Recovery changes the economics

Recovery is also a cost issue.

Every failed workflow has hidden cost:

textcopy-ready
wasted model calls wasted tool calls human repair time duplicated work lost trust support burden opportunity cost

The right economic metric is not raw token spend.

It is cost per successful workflow:

textcopy-ready
cost_per_successful_workflow = (model_cost + tool_cost + compute_cost + human_repair_cost) / successful_completed_workflows

A system with more careful runtime machinery can look slower or heavier on a single step, but be cheaper across the whole workflow because it avoids restarts, duplicate side effects, and human rescue.

This is the number customers and investors should care about.

The recovery scorecard

A buyer evaluating an AI runtime should ask for a recovery scorecard that connects directly to the five hard metrics:

Buyer metricRecovery-specific question
Workflow Completion RateAfter normal variance and failures, how often does the workflow still finish correctly?
Fault Recovery RateAfter injected failures, how often does it resume from the right point?
Tool Execution AccuracyAre retries and tool parameters correct after recovery?
Cost per Successful WorkflowHow much cost is wasted on restarts, loops, and duplicate work?
Human Intervention RateHow often does a person need to repair the workflow rather than approve it?

This is the practical distinction between “agent framework” and “AI workflow runtime.”

An agent framework helps you build behaviors.

A runtime helps those behaviors survive reality.

What first-time users should feel

A good recovery model should make AI feel calmer.

The user should not have to babysit every step.

They should be able to inspect progress, pause, resume, approve, retry, and understand what happened.

They should trust that if a machine sleeps, a tool fails, or a process restarts, the workflow does not lose its mind.

That is not magic.

It is runtime design.

The takeaway

The unsexy part of AI may become the most important part.

Recovery is not a footnote.

It is where demos become dependable systems.

The next serious benchmark for AI workflows is not only:

Can the agent reason?

It is:

Can the workflow run, fail, recover, and continue without losing truth?

That is the benchmark MirrorNeuron is built around.


References