Runs as recorded state
An Orcho run records its lifecycle in two complementary stores. The event stream is the truth; the snapshot is a cache. Everything the status and resume tooling reports is a projection of that recorded state, not a guess about what a process was doing when it stopped.
events.jsonl into a typed projection, and a
read-only check compares it against meta.json.
A process killed between the two writes leaves the stores disagreeing —
that drift becomes named problem codes, not silent
corruption. Repair never happens behind your back:
orcho repair-state is an explicit offline step, dry-run by
default, that applies a minimal crash-safe mutation to bring
meta.json back in line with the projection.
Two stores, one authority
Section titled “Two stores, one authority”| Surface | Role |
|---|---|
events.jsonl | Durable, append-only record of what happened: phases started and ended, the run paused on a handoff, halted, failed, or was interrupted. |
meta.json | A flat, materialized snapshot of that truth, mutated in place so readers get current state in one read without folding the whole stream. |
This split is what makes crashes survivable. Appending an event and mutating
the snapshot are separate writes, so a process killed between them leaves the
two stores disagreeing — for example, a halt decision recorded in the stream
before the snapshot’s status flipped. Because the stream is the authority,
that disagreement is detectable and nameable rather than silent corruption.
Three responsibilities
Section titled “Three responsibilities”The run_state layer keeps the two stores coherent. It is split into isolated
parts, each with a single job:
- Projection + consistency (read-only). A pure reducer folds the event stream into a typed snapshot; a consistency checker names known torn shapes by stable problem codes. This part never writes.
- Repair (opt-in, offline).
repair_run_stateconsumes the consistency diagnosis and, for a strictly limited set of self-healable shapes, proposes (dry-run by default) or applies a minimal, crash-safemeta.jsonmutation that brings the snapshot back in line with the event-derived projection. - State-transition helpers. Focused writers own active handoff transitions and terminal transitions in a run’s flat state mapping. They do no file IO and emit no events; persistence stays with the caller.
- Terminal-outcome reduction. Which of
done,awaiting_human_review, orhaltedcloses a clean run is not decided at the call sites: a pure reducer makes that decision once (pre-delivery, and again for the post-delivery no-diff reconcile), and the writers above only apply it.
The status graph
Section titled “The status graph”events.jsonl events.jsonl, and its
class decides what resume may do.
Green terminals are settled — the active handoff is
cleared. Red failed keeps the handoff: an
undecided decision is never erased. The amber pause hands
the wheel to the operator, whose decision is the only way back to
running. A process that dies mid-flight leaves any live
status torn as interrupted — repaired
offline or decided by the operator, never silently resumed.
| Status | Class | Resume behavior |
|---|---|---|
running | live | Continue from checkpoint. |
awaiting_phase_handoff, awaiting_gate_decision, awaiting_human_review | operator pause | Require the decision before resume. |
done, halted, cancelled | settled terminal | Not resumed as the same run; follow-up only. |
failed | terminal / diagnostic | No blind resume; inspect state first. |
interrupted | torn / diagnostic | Repair or operator decision required. |
The transitions have named owners. An operator decision on a paused run moves
it back to running via continue_handoff, continue_with_waiver_handoff, or
retry_feedback_handoff; a halt decision goes through mark_run_halted;
success through mark_run_done; a phase failure through mark_run_failed; a
process that dies mid-flight is marked by mark_run_interrupted.
One rule in that graph is load-bearing: settled terminals must not carry an
active handoff, so done and halted clear it. failed and interrupted
preserve it — a run that stopped while carrying an undecided handoff still
needs the operator to act on that state, so nothing is allowed to erase it.
awaiting_human_review — the clean terminal for plan-only work kinds
(planning, research) — preserves it for the same reason: the produced plan is
exactly the artifact the operator’s pending decision is about.
A phase ended is not a phase completed
Section titled “A phase ended is not a phase completed”A phase.end event records that a phase ended, not that it completed. A
phase counts as a completed checkpoint only when its outcome is a success
outcome: ok, or an outcome starting with skipped. Everything else —
halted: …, failed, rejected, error, incomplete, unknown strings —
is not a completed checkpoint.
The allowlist fails safe: a new outcome token is treated as not-completed until it is deliberately added. The practical consequence is that resume can never skip past a halted phase as if it finished — the bug this rule closes was a halted implement phase being checkpointed as complete, so resume jumped straight to review against partial work.
What this means for the operator
Section titled “What this means for the operator”statusand resume tooling read recorded state, so what they report is what the run actually wrote down — including a pause that needs your decision.- Resume is gated by the same projection: completed checkpoints are skipped, unfinished phases re-enter, and paused runs require the decision first.
- Repair never happens behind your back. It is an explicit offline step, dry-run by default:
orcho repair-state RUN_ID --workspace PATH # diagnose onlyorcho repair-state RUN_ID --workspace PATH --apply # apply safe repairs--apply handles only the known self-healable shapes, crash-safely and
idempotently. An interrupted run that still carries an undecided handoff is
refused — that needs an operator decision through the handoff API, not an
automatic status flip.
Deep reference
Section titled “Deep reference”The canonical engineering docs live with the code: