Runs as recorded state

An Orcho run records its lifecycle in two complementary stores. The event stream is the truth; the snapshot is a cache. Everything the status and resume tooling reports is a projection of that recorded state, not a guess about what a process was doing when it stopped.

run state · two stores, one authority

The event stream is the truth; the snapshot is a cache. A pure reducer folds events.jsonl into a typed projection, and a read-only check compares it against meta.json. A process killed between the two writes leaves the stores disagreeing — that drift becomes named problem codes, not silent corruption. Repair never happens behind your back: orcho repair-state is an explicit offline step, dry-run by default, that applies a minimal crash-safe mutation to bring meta.json back in line with the projection.

Two stores, one authority

Surface	Role
`events.jsonl`	Durable, append-only record of what happened: phases started and ended, the run paused on a handoff, halted, failed, or was interrupted.
`meta.json`	A flat, materialized snapshot of that truth, mutated in place so readers get current state in one read without folding the whole stream.

This split is what makes crashes survivable. Appending an event and mutating the snapshot are separate writes, so a process killed between them leaves the two stores disagreeing — for example, a halt decision recorded in the stream before the snapshot’s status flipped. Because the stream is the authority, that disagreement is detectable and nameable rather than silent corruption.

Three responsibilities

The run_state layer keeps the two stores coherent. It is split into isolated parts, each with a single job:

Projection + consistency (read-only). A pure reducer folds the event stream into a typed snapshot; a consistency checker names known torn shapes by stable problem codes. This part never writes.
Repair (opt-in, offline). repair_run_state consumes the consistency diagnosis and, for a strictly limited set of self-healable shapes, proposes (dry-run by default) or applies a minimal, crash-safe meta.json mutation that brings the snapshot back in line with the event-derived projection.
State-transition helpers. Focused writers own active handoff transitions and terminal transitions in a run’s flat state mapping. They do no file IO and emit no events; persistence stays with the caller.
Terminal-outcome reduction. Which of done, awaiting_human_review, or halted closes a clean run is not decided at the call sites: a pure reducer makes that decision once (pre-delivery, and again for the post-delivery no-diff reconcile), and the writers above only apply it.

The status graph

run status graph · projected from events.jsonl

Every status is a projection of events.jsonl, and its class decides what resume may do. Green terminals are settled — the active handoff is cleared. Red failed keeps the handoff: an undecided decision is never erased. The amber pause hands the wheel to the operator, whose decision is the only way back to running. A process that dies mid-flight leaves any live status torn as interrupted — repaired offline or decided by the operator, never silently resumed.

Status	Class	Resume behavior
`running`	live	Continue from checkpoint.
`awaiting_phase_handoff`, `awaiting_gate_decision`, `awaiting_human_review`	operator pause	Require the decision before resume.
`done`, `halted`, `cancelled`	settled terminal	Not resumed as the same run; follow-up only.
`failed`	terminal / diagnostic	No blind resume; inspect state first.
`interrupted`	torn / diagnostic	Repair or operator decision required.

The transitions have named owners. An operator decision on a paused run moves it back to running via continue_handoff, continue_with_waiver_handoff, or retry_feedback_handoff; a halt decision goes through mark_run_halted; success through mark_run_done; a phase failure through mark_run_failed; a process that dies mid-flight is marked by mark_run_interrupted.

One rule in that graph is load-bearing: settled terminals must not carry an active handoff, so done and halted clear it. failed and interrupted preserve it — a run that stopped while carrying an undecided handoff still needs the operator to act on that state, so nothing is allowed to erase it. awaiting_human_review — the clean terminal for plan-only work kinds (planning, research) — preserves it for the same reason: the produced plan is exactly the artifact the operator’s pending decision is about.

A phase ended is not a phase completed

A phase.end event records that a phase ended, not that it completed. A phase counts as a completed checkpoint only when its outcome is a success outcome: ok, or an outcome starting with skipped. Everything else — halted: …, failed, rejected, error, incomplete, unknown strings — is not a completed checkpoint.

The allowlist fails safe: a new outcome token is treated as not-completed until it is deliberately added. The practical consequence is that resume can never skip past a halted phase as if it finished — the bug this rule closes was a halted implement phase being checkpointed as complete, so resume jumped straight to review against partial work.

What this means for the operator

status and resume tooling read recorded state, so what they report is what the run actually wrote down — including a pause that needs your decision.
Resume is gated by the same projection: completed checkpoints are skipped, unfinished phases re-enter, and paused runs require the decision first.
Repair never happens behind your back. It is an explicit offline step, dry-run by default:

orcho repair-state RUN_ID --workspace PATH            # diagnose only
orcho repair-state RUN_ID --workspace PATH --apply    # apply safe repairs

--apply handles only the known self-healable shapes, crash-safely and idempotently. An interrupted run that still carries an undecided handoff is refused — that needs an operator decision through the handoff API, not an automatic status flip.

Deep reference

The canonical engineering docs live with the code: