A real run, annotated

Case study

Most docs pages explain one mechanism with a sanitized example. This page reads one real receipt end to end: run 20260704_163427, a feature of Orcho’s own CLI, shipped through Orcho at full spend on 2026-07-04. Nothing below is synthetic; the receipt is trimmed for length and every number is unedited.

outcome review loop cost anatomy scope honesty

Cost accounting Verification receipts

real spend2h 45m hands-offapproved with receipts

run 20260704_163427release approved

[DONE] Pipeline complete
✓ plan=ok | validate_plan=ok | implement=ok
  | review_changes=ok | repair_changes=ok
  | final_acceptance=ok
Tasks: 3 planned · 3 completed · 0 failed
Review findings: 1 (P1=1) — resolved, 0 active
Run findings: 4 — all resolved
Open risks: none
Time: 2h 45m | API-equiv: ~$116.67

One command in, an approved release out — with the trail of how it got there. This page walks that trail.

The outcome

The run took a feature task — a compressed line-by-line summary grammar for the CLI’s --output summary view — and carried it through the full pipeline: plan, plan validation, implementation, review, repair, final acceptance.

✓ plan=ok | validate_plan=ok | implement=ok
  | review_changes=ok | repair_changes=ok | final_acceptance=ok

Tasks: 3 planned · 3 completed · 0 failed · 0 incomplete
Release: approved
Open risks: none

Two details are easy to miss and worth naming:

The plan did not pass on its first attempt. Both plan and validate_plan show attempts=2: plan validation found a P1 finding and pushed the plan back before any implementation started. That P1 was resolved on record.
The plan’s three tasks were decomposed into six DAG subtasks (T1–T6) for implementation, each with its own cost, time, and tool attribution in the full receipt.

There is a pleasant recursion here: the feature this run shipped is the compressed summary grammar for the very receipt surface you are reading about.

The loop that converged

Review and repair are the run’s control flow, and this receipt shows them under real load:

review_changes    attempts=7
repair_changes    attempts=5
Review findings: 1 (P1=1) | resolved: 1 | active: 0

Seven review rounds and five repair rounds sound expensive until you look at what they bought: findings fell to zero active, and the reviewer read deeply — its session peaked at 69% of a 258k context window. The reviewer and the implementer are also different vendors: Claude (claude-opus-4-8) implements, Codex (gpt-5.5) reviews, across 8 sessions. The author never grades its own work.

Mid-run, the engine’s advisor intervened once on its own:

Agent advice: calls=1 · applied_retries=1 · api-equiv $0.09

One stuck attempt was pushed to a retry without a human touching the run — at an API-equivalent cost of nine cents.

Where the cost lives

The headline number is API-equivalent, not a bill — see Cost accounting for the model. What the receipt lets you do is decompose it:

Usage: 109,878,715 tokens (in=109,414,287 out=464,428)
API-equiv: ~$116.67

implement         78.1M tok   attempts=2  $74.67  (96% cache-read)
review_changes    12.1M tok   attempts=7  $12.83  (91% cache-read)
repair_changes    18.1M tok   attempts=5  $25.95  (96% cache-read)
plan + validate + final                    ~$3.23

Read the shape, not just the total:

Output tokens are 464k of 109.9M — about 0.4%. Almost the entire volume is input: agents re-reading their context as the run progresses.
Around 95% of that input was served from provider cache. Fresh, full-priced token traffic is a small fraction of the headline figure.
Cost concentrates where the work is: implementation carries 64% of the API-equivalent spend, the repair loop 22%, review 11%.

This is the reason Orcho reports cost per phase and per subtask instead of one number: a $116.67 headline and a “96% cache-read implement phase” describe two very different runs.

The receipt tells on the run

The release was approved — and the receipt still carries two honest warnings.

First, scope expansion. The worker touched 14 files it never declared in the task’s ownership contract (mostly test files it added coverage to, plus one support module):

Scope expansion risk: 14 files flagged — unverified · no-explanation

The detector fired, classified the touches as non-blocking, and printed every path in the full receipt. An approved release does not silence the flags: the next reader sees exactly what the agent did beyond its declared scope.

Second, gate residue. All five verification receipts ran and passed before final acceptance, and the receipt still marks them stale — they were recorded before the delivery commit moved HEAD:

pre-final auto-run: 5 ran / 5 pass
blocking (require): broad-non-e2e, verification-unit, cli-sdk-unit
warning (warn): env-provenance, lint — shipping allowed by policy
note: stale = passed before a later HEAD move, not a failed check

stale is a provenance statement, not a failure — the receipt explains this in its own footnote. Which gates block and which merely warn is policy; see Verification receipts for the classification model.

Reading a run like this yourself

Every number on this page comes from artifacts any Orcho run leaves behind: the final summary, events.jsonl, metrics.json, findings, and verification receipts. The Evidence bundle page maps the artifact set; Feature run anatomy shows the same stream live, phase by phase.

Cost accounting — the API-equivalent model and cache anatomy.
Verification receipts — proof that checks ran, and where.
False-ready delivery — what happens when a run does not converge.
Handoffs and advisors — the advisor that pushed the $0.09 retry.