Run a goal under budget
This recipe walks through what a real autonomous goal looks like: hours of wall-clock time, hundreds of turns, millions of tokens.
The scenario
Section titled “The scenario”You want Claude to refactor an authentication module to use async/await. Multiple files, contract preservation, tests must pass. You expect this to take 1–3 hours of model time if it works end-to-end.
Set the budget for the shape of the work
Section titled “Set the budget for the shape of the work”/goal-start "refactor src/auth/* to use async/await; preserve existing test contracts; ensure all tests in test/auth/ pass" --budget 3000000Notes on the objective and the budget:
- Concrete scope —
src/auth/*, not “the auth code” - Explicit acceptance criterion — tests pass in
test/auth/ - Constraint — preserve existing contracts (a hint to the evaluator: contract-breaking refactors →
incomplete) - Budget at 3M tokens — a multi-file refactor with verification will burn 1–2M comfortably; 3M leaves headroom so we don’t trip on the first iteration
What the loop looks like at minute 15
Section titled “What the loop looks like at minute 15”/goal-status
◎ Goal: refactor src/auth/* to use async/await... status: active tokens: 612,043 worker · 84,200 subagent (696,243 / 3,000,000) continuations remaining: 32 / 50 wall-clock used: 0h 15m / 4hWorker has read several files, made initial changes, and the evaluator has dispatched once to check progress (subagent tokens are non-zero). The cache has warmed — subsequent turns are cheap.
At minute 45 — first continuation cap
Section titled “At minute 45 — first continuation cap”50 continuation turns is the default. For a real refactor you’ll burn through them before completion:
/goal-status
◎ Goal: refactor src/auth/* to use async/await... status: paused paused_reason: continuation_cap tokens: 1,401,328 worker · 211,400 subagent (1,612,728 / 3,000,000) continuations remaining: 0 / 50 wall-clock used: 0h 47m / 4hThe token budget is fine — you’ve used about half. The turn cap fired. Extend it:
/goal-extend --add-continuations 100Now the goal resumes with 100 more turns. For long refactors, expect to extend two or three times.
The evaluator fires
Section titled “The evaluator fires”The continuation prompt has been telling the worker, on every turn, to dispatch the evaluator before declaring done. When it thinks it’s done:
[Task] dispatching claude-goal:goal-evaluator with objective + evidence...The evaluator subagent reads the active goal, inspects recent assistant turns, then runs the tests (Bash tool). Verdict:
{ "verdict": "incomplete", "reason": "test/auth/session.test.ts:42 fails — expected resolved promise, got rejection from missing `await` in src/auth/session.ts:87. Worker reported tests passing but `npm test test/auth` shows 1/13 failing."}Notice: the evaluator caught a false positive. The worker claimed done; the evaluator ran the tests and found a missing await. This is exactly the failure mode the dual-path design protects against.
The worker reads the incomplete verdict, addresses the gap, and the loop continues.
Eventual completion
Section titled “Eventual completion”After another 40 turns and 1.1M more tokens:
{ "verdict": "complete", "reason": "All 13 tests in test/auth/ pass via `npm test test/auth`. Async/await replaces .then() chains in src/auth/oauth.ts:23, src/auth/session.ts:41+87, src/auth/tokens.ts:18. No public API signatures changed."}The worker calls update_goal status:complete completed_by:"evaluator" and stops.
Final status:
/goal-status
◎ Goal: refactor src/auth/* to use async/await... status: complete completed_by: evaluator tokens: 2,341,801 worker · 392,500 subagent (2,734,301 / 3,000,000) duration: 2h 14mCame in under budget, took just over two hours, 150 continuation turns. That’s a real autonomous run.
What if the goal pauses with budget_limited?
Section titled “What if the goal pauses with budget_limited?”Suppose the goal stalls and burns through the 3M cap:
/goal-status
◎ Goal: refactor src/auth/* to use async/await... status: budget_limited tokens: 2,941,820 worker · 58,180 subagent (3,000,000 / 3,000,000) continuations remaining: 12 / 150 paused_reason: budget_limitedYou have three real options:
1. Investigate. Inspect the transcript. Is the model in a loop? Did it misread the task? Is the objective wrong?
2. Take over manually. Tell the model directly what to finish. Token accounting still tracks, but the autonomous loop is off.
3. Start a fresh goal with a higher cap. If you trust the progress and just need more room:
/goal-abandon/goal-start "<continuation of the prior objective with the remaining work>" --budget 5000000The prior goal’s history is preserved in goal_events for forensic review.
Patterns that work
Section titled “Patterns that work”| Pattern | Good for |
|---|---|
--budget 500000 | Single-file changes with a quick evaluator pass. Floor for anything meaningful. |
--budget 2000000–3000000 | Bounded refactors with clear acceptance criteria. The sweet spot for “real work.” |
--budget 5000000–10000000 | Multi-file refactors, doc generation, broad cleanups. |
--budget 20000000+ plus --add-hours 12 | Overnight long-running goals. Set the wall-clock cap to match. |
| No budget | Exploration. Monitor with /goal-status. |
Patterns that don’t work
Section titled “Patterns that don’t work”- Tiny budgets (
--budget 50000). One input turn in a real codebase is already 50–100K. The cap will fire on turn one and the plugin will look broken. - Vague objectives (
"clean up the code"). The evaluator has nothing to verify against. - Unverifiable objectives (
"wait until the build completes"with no build running). Wall-clock cap eventually catches this, but you waste turns first. - Objectives that lie about scope (
"refactor everything"with--budget 200000). The goal pauses with the work obviously unfinished.
A budget-sizing intuition
Section titled “A budget-sizing intuition”Pessimistically: assume ~30k–60k tokens per turn in a warm cache, ~80k–150k tokens per turn in a cold cache or with heavy file reads.
For a goal you expect to take N turns, a safe starting budget is roughly N × 60k × 1.3 (the 1.3 is headroom). So:
- 30 turns: ~2.3M
- 100 turns: ~7.8M
- 300 turns: ~23M
These are upper bounds. Cache hits routinely cut real usage by 60–80%.