Self-Healing

How Solen detects agent failures, diagnoses root cause, and resumes from the last checkpoint — without paging an engineer.

Solen monitors every agent run instrumented with withSolen() and can autonomously repair failures: stalls, tool-call loops, token budget overruns, and step errors.

How the loop works

1. Detection

The Runtime Kernel watches step cadence on every active run. A failure is any of:

Stall — no step activity for longer than your workspace threshold (default 300 seconds)
Tool-call loop — the same tool called repeatedly with identical parameters
Token budget exceeded — run exceeds configured maxTokens
Step failure — a step reports status: failed

Example detection event:

Stall detected

03:42:08  research-agent.workspace-prod · step: 47
⚠ Stall detected · no step activity for 312s · threshold: 300s

2. Diagnosis

LearnShift pattern-matches first. If no known fix exists, the Runtime Kernel runs AI diagnosis on the last steps, tool calls, and LLM responses:

Diagnosis

{
  "rootCause": "tool_call_loop",
  "detail": "search_web called 23x with identical params",
  "affectedStep": 47,
  "confidence": 97,
  "evidence": [
    "search_web x23 identical query",
    "no llm_call between tool calls",
    "stall threshold exceeded"
  ]
}

Confidence is scored 0–100. It determines whether the fix applies automatically or waits for your approval.

3. Fix application

For known patterns, LearnShift applies the fix in under 100ms with zero LLM calls. For novel failures, the Runtime Kernel generates a heal action — for example injecting a step break or rolling back to a prior checkpoint:

Pattern match fix

Pattern match found (confidence: 0.97):
+ inject_step_break_after_repeated_tool_call
+ resume_from_checkpoint: step_44

4. Autonomy gate

Mode	Confidence	Behavior
`supervised`	Any	Heal proposal shown in UI: you approve before the agent resumes
`autonomous`	≥ 70	Fix applies immediately, agent resumes from checkpoint
`autonomous`	< 70	Falls back to supervised: shown for approval

Free tier defaults to supervised. Paid plans can enable autonomous in Settings → Workspace → Agent Policy.

5. Resume and verification

After the fix is applied, the agent resumes from its last checkpoint — not from step 1. The Runtime Kernel verifies step activity is restored and sustained for 60 seconds before marking the incident resolved:

Resume

03:42:21  Applying fix · resuming agent from checkpoint step_44...
03:42:33  Agent resumed · step 44 → step 48 · RUNNING
03:42:38  ✓ Verified: step activity restored · stall cleared · sustained 60s

6. Rollback

If the healed run fails verification, the incident becomes heal_failed. The previous checkpoint is preserved. You can manually trigger POST /api/agent-runtime/runs/:runId/rollback to resume from the last known-good state.

Real heal timing

From production (tool-call loop incident):

Production example

Stall detected:      03:42:08
Pattern match:       03:42:15
Agent resumed:       03:42:33
Incident resolved:   03:42:39
Total:               31 seconds · zero human intervention

Heal history API

Every heal is stored and accessible via the run detail endpoint:

GET /api/agent-runtime/runs/:runId
Authorization: Bearer <api-key>

Key response fields:

Field	Description
`triggeredAt`	When the failure rule fired
`appliedAt`	When the fix was applied
`resolvedAt`	When step activity was verified sustained
`status`	`resolved`, `heal_failed`, `applying`, `needs_review`
`resumeFromStep`	Checkpoint step the agent resumed from
`diagnosis.rootCause`	Root cause (e.g. tool_call_loop, stall)
`diagnosis.confidence`	0–100 confidence score

Safety model

Rail	Behavior
Confidence threshold	Heals below 70 always require human approval
Attempt limit	Max 3 heal attempts per incident. After 3, incident escalates
Cooldown	A single agent cannot trigger more than one incident per 5 minutes
Open incident block	If an unresolved incident exists, no new incident is created until it closes
Checkpoint preservation	Previous checkpoints are never deleted during a heal attempt
Sustained verification	Agent must show restored step activity for 60 seconds before incident is marked resolved

Receive heal events

Configure a webhook to receive heal lifecycle events. See Integrations → Webhooks.

Webhook payload

{
  "event": "heal.resolved",
  "agentId": "research-agent",
  "runId": "clx9k2m00000008l4abc123de",
  "triggeredAt": "2026-05-17T03:42:08Z",
  "resolvedAt": "2026-05-17T03:42:39Z",
  "diagnosis": {
    "rootCause": "tool_call_loop",
    "confidence": 97
  },
  "resumeFromStep": 44,
  "outcome": "success"
}

Read about autonomy modes →