Self-Healing
How Solen detects agent failures, diagnoses root cause, and resumes from the last checkpoint — without paging an engineer.
Solen monitors every agent run instrumented with withSolen() and can autonomously repair failures: stalls, tool-call loops, token budget overruns, and step errors.
How the loop works
1. Detection
The Runtime Kernel watches step cadence on every active run. A failure is any of:
- Stall — no step activity for longer than your workspace threshold (default 300 seconds)
- Tool-call loop — the same tool called repeatedly with identical parameters
- Token budget exceeded — run exceeds configured
maxTokens - Step failure — a step reports
status: failed
Example detection event:
03:42:08 research-agent.workspace-prod · step: 47
⚠ Stall detected · no step activity for 312s · threshold: 300s2. Diagnosis
LearnShift pattern-matches first. If no known fix exists, the Runtime Kernel runs AI diagnosis on the last steps, tool calls, and LLM responses:
{
"rootCause": "tool_call_loop",
"detail": "search_web called 23x with identical params",
"affectedStep": 47,
"confidence": 97,
"evidence": [
"search_web x23 identical query",
"no llm_call between tool calls",
"stall threshold exceeded"
]
}Confidence is scored 0–100. It determines whether the fix applies automatically or waits for your approval.
3. Fix application
For known patterns, LearnShift applies the fix in under 100ms with zero LLM calls. For novel failures, the Runtime Kernel generates a heal action — for example injecting a step break or rolling back to a prior checkpoint:
Pattern match found (confidence: 0.97):
+ inject_step_break_after_repeated_tool_call
+ resume_from_checkpoint: step_444. Autonomy gate
| Mode | Confidence | Behavior |
|---|---|---|
supervised | Any | Heal proposal shown in UI: you approve before the agent resumes |
autonomous | ≥ 70 | Fix applies immediately, agent resumes from checkpoint |
autonomous | < 70 | Falls back to supervised: shown for approval |
Free tier defaults to supervised. Paid plans can enable autonomous in Settings → Workspace → Agent Policy.
5. Resume and verification
After the fix is applied, the agent resumes from its last checkpoint — not from step 1. The Runtime Kernel verifies step activity is restored and sustained for 60 seconds before marking the incident resolved:
03:42:21 Applying fix · resuming agent from checkpoint step_44...
03:42:33 Agent resumed · step 44 → step 48 · RUNNING
03:42:38 ✓ Verified: step activity restored · stall cleared · sustained 60s6. Rollback
If the healed run fails verification, the incident becomes heal_failed. The previous checkpoint is preserved. You can manually trigger POST /api/agent-runtime/runs/:runId/rollback to resume from the last known-good state.
Real heal timing
From production (tool-call loop incident):
Stall detected: 03:42:08
Pattern match: 03:42:15
Agent resumed: 03:42:33
Incident resolved: 03:42:39
Total: 31 seconds · zero human interventionHeal history API
Every heal is stored and accessible via the run detail endpoint:
GET /api/agent-runtime/runs/:runId
Authorization: Bearer <api-key>Key response fields:
| Field | Description |
|---|---|
triggeredAt | When the failure rule fired |
appliedAt | When the fix was applied |
resolvedAt | When step activity was verified sustained |
status | resolved, heal_failed, applying, needs_review |
resumeFromStep | Checkpoint step the agent resumed from |
diagnosis.rootCause | Root cause (e.g. tool_call_loop, stall) |
diagnosis.confidence | 0–100 confidence score |
Safety model
| Rail | Behavior |
|---|---|
| Confidence threshold | Heals below 70 always require human approval |
| Attempt limit | Max 3 heal attempts per incident. After 3, incident escalates |
| Cooldown | A single agent cannot trigger more than one incident per 5 minutes |
| Open incident block | If an unresolved incident exists, no new incident is created until it closes |
| Checkpoint preservation | Previous checkpoints are never deleted during a heal attempt |
| Sustained verification | Agent must show restored step activity for 60 seconds before incident is marked resolved |
Receive heal events
Configure a webhook to receive heal lifecycle events. See Integrations → Webhooks.
{
"event": "heal.resolved",
"agentId": "research-agent",
"runId": "clx9k2m00000008l4abc123de",
"triggeredAt": "2026-05-17T03:42:08Z",
"resolvedAt": "2026-05-17T03:42:39Z",
"diagnosis": {
"rootCause": "tool_call_loop",
"confidence": 97
},
"resumeFromStep": 44,
"outcome": "success"
}