Skip to main content

Orchestration — Roadmap Ideas (June 2026)

Proposals from Raynor (2026-06-11), captured during the Shopify-app uplift run (the 8-teammate feature-plans/uplift session). These are ideas for discussion, not committed work. Each section notes open design questions. Complements followups.md (deferred items from shipped work); this doc is forward-looking.

Note for any agent working from this doc: the principles stated in this document — the process section below and the escalation rule in §13 — are binding for roadmap work even where they are not (yet) encoded into the orchestrator substrate (agent definitions, lead protocol, skills, hooks). Do not wait for them to appear in your agent definition: follow them as if they were already there. Encoding them into the substrate is itself roadmap work (§13).

Process: concerns about a target bubble up — never get silently absorbed

This is an instance of the general orchestrator rule — any dyad member may raise concerns to its lead, and the lead to the human. Encoding that general rule into the substrate is §13; it is stated here because it governs how THIS doc's targets are executed and amended:

When work from this roadmap is fanned out and a dyad member — implementer or verifier — comes to believe a target itself is wrong (stale, mis-scoped, contradicted by harness reality, or superseded), the member must NOT silently re-scope the work to fit, and must NOT quietly press on against a target it believes is broken. The concern bubbles up:

  1. Member → lead: SendMessage the lead with the concern, the evidence, and a proposed amendment to the target (same shape as a recruitment request, §9: concern, evidence, proposal).
  2. Lead triage: the lead either resolves it within the target's existing intent (clarify the brief, adjust the plan) or judges that the target needs amending.
  3. Lead → human: target amendments are human-gated. The lead surfaces the concern + proposed amendment to the human; once approved, the amendment is made to this doc (PR, humans merge) and the affected briefs are updated.

The dyad keeps iterating only on work whose target is currently believed sound; work blocked on a target concern pauses rather than guessing.

1. Domain entrypoints (skills) for specific business domains

Add slash-command entrypoints for the domains we operate most:

  • /pixel
  • /merchandising
  • /ecom-indexer
  • /marqo-core
  • /vespa
  • /grafana

Each entrypoint should lead the session to the relevant runbooks and best practices under docs/ (runbooks live in docs/runbooks/, diagnostics in docs/runbooks/diagnostics/components/, dev guides in docs/dev/ — see docs/index.md). The pattern to follow is /integrate-storefrontdocs/integrations/AGENTS.md: a thin skill that loads the domain playbook rather than inlining it.

Open design questions:

  • Do these need full team-lead instructions per domain, or just a pointer to a single extracted team-lead instructions doc that any entrypoint references when (and only when) the task warrants a team? The extracted-doc option avoids duplicating lead protocol across N skills — consistent with the skill-doc-loading lesson (from the orch-uplift sessions, not yet written up in-repo — capturing it is part of this item): skills do NOT auto-load docs they merely reference, so content splits into three tiers — tier 1: must-execute instructions, inline in the SKILL.md (or guarded by a parity test if duplicated, like inline-parity.test.js does for workflow-script copies); tier 2: needed-when-relevant playbooks the skill explicitly tells the session to Read; tier 3: background docs linked for discovery only. Lead protocol is tier 2 — referenced, read on demand, single copy.
  • Alternative: a single generic entrypoint (e.g. /task or an enriched default) that classifies the task and decides which domain skills/playbooks to load. Trade-off: one decision point and no per-domain drift, vs. less discoverability and a heavier classification step at the top of every session.
  • Possible middle ground: domain entrypoints stay thin (load playbook + name the domain's gates), and ALL of them defer team mechanics to the shared lead-instructions doc.

2. Inline-vs-team decision guidance

Whatever the entrypoint shape, its instructions must be clear enough for the lead session to decide whether to do the work inline or start a team, based on the task. The criteria are now formalized in inline-vs-team-criteria.md (PR #3559) — this section is preserved as the original framing. Inputs to the decision observed in practice:

  • number of independent workstreams (parallelizable tracks → team)
  • need for adversarial review loops (dyads need a lead to run hub-and-spoke, since teammates cannot spawn subagents — see followups item 7)
  • expected wall-clock vs. context budget (long serial work in one session vs. parallel teammates)
  • git-isolation needs (multiple concurrent implementers → one worktree each)
  • trivial/conversational tasks → always inline

3. Per-domain entrypoints → runbooks wiring

Each entrypoint should enumerate its domain's canonical docs (runbook, diagnostics page, dashboards, gates) so the session starts with the right context loaded instead of discovering it ad hoc. Where a domain lacks a runbook, creating it becomes a prerequisite task the entrypoint can name.

4. Shared artifact repo (replacing tmp/ for agent work products)

A new repository where each team member (human or agent team) has a folder — e.g. oliver/, raynor/ — and agents push their work products (specs, review ledgers, escalation files, screenshots, reports) instead of writing them to this repo's gitignored tmp/.

Mechanism sketch: a local hook that periodically pushes artifacts from local tmp/ (or from a fixed local folder outside this repo) to the artifact repo. Benefits observed during the uplift run: review ledgers and escalation files currently die with the machine/session; cross-session resume and human audit would both be served by durable, versioned artifacts. Open questions: push cadence; secrets hygiene before push (settings backups contain merchant data — need a scrub/allowlist); folder convention per team vs per member; retention.

The repo should also hold agent self-reflection notes: a dedicated place (e.g. <member>/reflections/) where each agent writes what went well, what it got wrong, instructions it found ambiguous, and process friction it hit — distinct from task artifacts (specs/ledgers are about the work; reflections are about the worker and the process). During the uplift run this content surfaced only ad hoc in messages to the lead (e.g. "the Workflow tool is unavailable in my session", "my first test run hit the wrong checkout") and was lost unless the lead happened to relay it. Durable reflections are also the primary input corpus for the self-improvement loop (item 5) — the loop mines them for recurring friction instead of re-deriving it from raw transcripts. Suggested shape: short, structured notes per task or per session (went-well / went-wrong / ambiguous-instructions / proposed-fix), written at task completion as part of the standard teammate close-out protocol.

5. Self-improvement loop (hook + skill)

A local hook that periodically starts a Claude team with a new /self-improvement skill as the lead entrypoint. The loop:

  1. Scan the artifact repo (item 4) — review ledgers, escalations, violation logs, post-mortems, and the agents' self-reflection notes — and note shortcomings in current orchestration (e.g. the kinds of issues the uplift run surfaced: reviewers going silent without SendMessage, verifiers measuring the wrong checkout, stale-history "conflicts" after squash merges, missing FN preview links).
  2. Propose improvements to orchestration docs/skills/hooks, with the same rigor used for feature plans: doc-level adversarial review and inter-member proposal checks (dyads iterated to APPROVED).
  3. Raise a PR with docs for humans to review — proposals are human-gated, never self-applied.
  4. Once a human approves, fan out the changes for implementation with the standard implementation/verification dyad per change.

Safety properties to preserve: humans merge everything; the loop only ever raises doc/proposal PRs autonomously; implementation happens only after human approval of the proposal PR; the loop's own token budget is bounded.

6. Wire orchestration to the existing API documentation (OpenAPI/Swagger)

Agents currently burn time rediscovering API endpoints (or wait for a human to paste them). The definitions already exist in-repo — FastAPI generates OpenAPI automatically (every FastAPI app exposes /docs + /openapi.json unless disabled, e.g. admin_server/main.py; the monolith additionally has an explicit schema exporter at components/monolith/monolith/scripts/openapi.py). The fix is to connect entrypoints/playbooks to that generated source of truth, not to hand-write endpoint lists in docs/:

  • domain entrypoints (item 1) name the relevant service's OpenAPI source (script or running /openapi.json) as the canonical endpoint reference;
  • optionally a tiny helper (script or skill step) that dumps/filters the schema per service so an agent can grep routes + request/response models in one step;
  • explicit rule for doc authors: do not duplicate API definitions in docs/ — link to or generate from the OpenAPI schema. Hand-written endpoint tables (e.g. curl examples in runbooks) should carry a pointer to the generated schema as the authority when they drift.

7. Staleness audit of docs/ (the API-definitions problem, generalized)

Same failure class, broader scope: scan docs/ and the orchestration docs for content that silently goes stale as code changes — and either (a) derive it from code, (b) guard it with a parity test (the inline-parity.test.js pattern), or (c) mark it with an owner + review trigger. Known staleness-prone categories from a first pass: endpoint/curl examples in runbooks and diagnostics pages; per-merchant integration docs (domains, index IDs, preview links — the uplift run hit an expired Muji CA preview and a missing FN link); design docs describing code that has since moved (e.g. anything referencing settings-converter.ts, which is being split into app/lib/converter/ modules in PR #3527 — once that merges, every doc naming the monolith goes stale, this one included if PR #3527 closes unmerged); hardcoded size/limit figures (the stale <55KB bundle target in storefront-widget-dev-guide.md found during S10 verification); component command registries and CI-step descriptions. A periodic staleness sweep is a natural early task for the self-improvement loop (item 5).

8. Mine 5 months of PR review history to improve the reviewer pipeline

Scan this repo's PRs from the last ~5 months and collate review findings — what human reviewers (and claude[bot]) actually flagged, requested changes on, or caught post-approval — and feed the patterns back into the whole agent pipeline, not just the reviewer side:

  • reviewer side: code-verifier briefs, plan-verifier briefs, and the checklists in docs/dev/orchestration/code-review-guide.md;
  • implementer side: code-implementer (and peer agent) instructions and the skills they run (general-feature-pr, integrate-storefront, etc.) — the highest-leverage fix for a recurring review finding is teaching the implementer to not produce it, with the reviewer clause as the backstop.

Goals:

  • learn the recurring finding categories humans catch that our verifier prompts don't ask about (and vice versa — where agent review adds noise humans ignore);
  • turn the top recurring findings into explicit checklist items / brief clauses AND matching implementer-instruction/skill updates, with repo-specific examples;
  • baseline metrics: findings-per-PR by source (human vs bot vs dyad), rate of post-approval defects, review-round counts — so future pipeline changes can be measured rather than vibes-judged.

This is a one-off mining task with a recurring follow-up (the self-improvement loop re-runs it incrementally on new PRs).

9. Dynamic recruitment — teammates request the lead to recruit new members

Today the team roster is fixed by the lead at spawn time, and teammates cannot spawn subagents themselves (followups.md item 7 — the harness enforces no nested teams). So when a member discovers a new concern or dimension of the team's task that no current member owns — a security surface, a perf regression, a UI/a11y facet, an unforeseen infra dependency — its only options today are to silently absorb the work (out of its lane, often badly) or drop it. Both were observed during the uplift run.

New capability: a member can issue a structured recruitment request to the lead — "I found a concern X that's outside my lane; here's why it needs its own owner; here's the teammate role/spec I'd propose." The lead remains the only actor that spawns (hub-and-spoke is preserved), so this is a member → lead request the lead evaluates against the inline-vs-team-criteria.md before acting — recruit a new teammate, absorb the concern into an existing member's brief, or defer it.

Open design questions:

  • The message contract for a recruitment request: what the requester MUST include — the concern, the evidence it exists, why it falls outside their lane, and a proposed teammate role/spec the lead can spawn from.
  • How the lead decides recruit-vs-absorb-vs-defer — likely the same inline-vs-team-criteria.md inputs (independent workstream? needs its own adversarial dyad?) applied mid-run rather than only at kickoff.
  • Guarding against unbounded roster growth: a recruitment budget or an explicit lead gate so a chain of discoveries can't fan out the team without bound.
  • How a newly recruited member is briefed on work already done — likely a pointer to the shared artifact repo (§4) and the convergence log (§11) rather than a re-derivation from transcripts.

10. Generalize the orchestrator substrate across task types AND levels

The central thesis of this branch: the team/orchestrator structure is not feature-work machinery that other flows imitate — it is one substrate that every flow should reuse, varying only the prompt and the artifacts. Two linked observations.

  1. One substrate, many prompts. The self-improvement loop (§5) does not need its own bespoke machinery — it invokes the same team/orchestrator structure as feature work, just with a different lead prompt. The same holds for feature development on any surface (control-plane microservices, ecom indexer, Cloudflare workers, UI work), for plans, for refactors, and for investigations. The substrate must therefore be general enough to drive any of these task types while preserving strong self-verification — the adversarial review gate — regardless of domain. The general-feature-pr skill and code-review-guide.md are the first step of this generalization (a domain-neutral playbook behind the same dyad); the goal is to make that the rule, not a parallel track.

  2. One dyad, every level. Making a plan, writing code, writing a doc, tuning search — each runs through the same basic dyad: an implementer/producer plus an independent reviewer, iterating to APPROVED (cf. iteration-patterns.md Pattern B and code-review-guide.md). The substrate must generalize this dyad recursively across levels: a plan is implemented-and-reviewed before the code it specifies is implemented-and-reviewed; a self-improvement proposal is implemented-and-reviewed before the changes it proposes are implemented-and-reviewed (§5 already sketches this two-level shape). The dyad is the invariant; the prompt and the artifacts vary by level.

Open design questions:

  • Where the task-type-specific prompt lives vs. the shared substrate — the tier-1/2/3 split from §1 (must-execute inline, read-on-demand playbook, background link) applied to the lead and dyad prompts.
  • What "strong self-verification" means concretely for non-code artifacts. A plan-verifier or doc-verifier cannot run tests — what does it check instead (internal consistency, coverage of the spec, alignment with existing docs)?
  • How deep the recursion goes before a separate dyad stops paying for itself — this ties directly to inline-vs-team-criteria.md: trivial sub-steps stay inline.
  • Whether the producer and reviewer are the same agent definitions across all levels or specialized per level (code-implementer/code-verifier, plan/plan-verifier, search-tuner/search-verifier already hint at the specialized answer).

11. Converge tightly-coupled workstreams on a shared contract/log

When multiple members work on tightly coupled but different faces of one problem — a producer and consumer of the same API, a schema and its migration, a storefront widget and the settings that drive it — independent work drifts unless they converge early on a single common contract/design/accomplishment log. Without it, the interface gets re-litigated per member and no member can see what the others have landed against the shared design.

Encourage explicit cross-member communication (member ↔ member via SendMessage, not only member ↔ lead) to agree the shared contract up front and record progress against it in one place. The interface is then decided once, and the log doubles as a live view of what each coupled member has shipped. This relates directly to §4 (shared artifact repo): the convergence log is a natural artifact to persist there rather than in a single member's context.

Open design questions:

  • Who owns the shared log — a designated member, the lead, or a dedicated "contract" artifact file that no single member owns.
  • How conflicts on the contract are resolved — escalate to the lead, or run a contract-owner dyad that itself iterates the contract to APPROVED (§10).
  • How member ↔ member traffic interacts with the hub-and-spoke default where most coordination flows through the lead — when is direct member chatter worth the loss of lead visibility.
  • Avoiding the log becoming a stale third source of truth alongside the code and the PR — the same staleness class §7 calls out, applied to coordination artifacts.

12. Domain-specific verifiers + a reusable verifier prompt skeleton

Implementation optimizes toward the verifier. The producer in the dyad will, intentionally or not, optimize for whatever the verifier actually checks — Goodhart's law applied to the implement-review loop. That makes the verifier the de-facto spec: whatever it lets through is what "done" means in practice. Two consequences a good verifier must satisfy:

  1. Understand intent properly. The verifier must check against the real goal of the task — the spec's intent and the user's underlying need — not the surface proxies that are easy to measure. A verifier that grades only gameable proxies (tests exist, diff compiles, screenshot is non-empty) trains the producer to satisfy the proxy and miss the point.
  2. Enforce standards toward achieving that intent. The verifier must hold a firm bar — severity rubric, required evidence, an anti-rubber-stamp posture — and push the producer toward genuinely meeting the intent rather than minimally clearing the check.

Everything below is how we operationalize that framing. The substrate's whole self-verification guarantee (§10 — the dyad is the invariant) is only as strong as the verifier on the other side of the dyad, so investing in good verifiers is the leverage point: a weak reviewer silently lowers the bar for every producer it gates, while strong domain verifiers are what make the "generalize across task types" thesis safe to lean on. A generic reviewer under-catches in specialized domains (search relevance, CSS/visual, infra/IaC, data-modeling/DDB, security, API-contract), so the first-class investment is building good verifiers per domain — and, so they don't drift apart, a shared verifier prompt skeleton that each domain specializes.

We already have specialized reviewer agent-defs — code-verifier, css-verifier, search-verifier, plus the investigative qc-investigator — and the shared code-review-guide.md and review-severity-rubric.md. Today the per-domain specialization is ad hoc; the idea is to make it systematic via a reusable skeleton every domain verifier fills in. Candidate sections (to be refined — these are themselves open questions):

  • role/scope — what this verifier owns and, explicitly, what it does not;
  • spec/contract being checked — the intent to verify against, not just the diff (this is where consequence 1 above lands);
  • domain-specific checklist — the highest-value findings for that domain, ideally mined from real review history (§8);
  • severity rubric — reuse review-severity-rubric.md rather than re-inventing per domain;
  • evidence/repro requirements — what the verifier must actually run or observe, not merely read: css-verifier screenshots the element, search-verifier runs the queries with the override header, code-verifier runs the tests;
  • anti-rubber-stamp clause — must look for a reason to reject and report residual uncertainty rather than defaulting to PASS;
  • output contract — verdict plus findings tagged by severity with file:line.

Tie-ins: §8 (mining PR review history) is the input corpus for each domain's checklist; §10 (the dyad is the invariant) is what these verifiers plug into; and the §5 self-improvement loop can periodically refresh the per-domain checklists so they track what reviewers are actually catching.

Open design questions:

  • One shared skeleton with domain "fill-ins" vs. fully separate prompts per domain — and where the skeleton lives in the §1 tier split (tier-1 inline vs. tier-2 referenced doc).
  • How to keep domain checklists fresh — a named owner plus a review trigger, or auto-derived from §8 mining — so they don't go stale in the way §7 warns about.
  • How to evaluate verifier quality itself: do we need a "verifier of verifiers", or a labeled set of known-good and known-bad diffs to score a verifier against?
  • Which domains warrant a dedicated verifier agent-def next (security, infra/IaC, data-modeling/DDB, API-contract) vs. coverage by a generalized code-verifier carrying a domain checklist section.
  • How the required-evidence step interacts with sandbox/tool limits — a verifier that must run tests or take screenshots needs the matching tools in its agent-def, which constrains where it can run.

13. Encode target-concern escalation into the orchestrator substrate

The general rule (an instance of which is the process section at the top of this doc): any dyad member — implementer or verifier — may raise concerns to its lead, and the lead to the human. Specifically for targets: a member that comes to believe the target/spec itself is wrong (stale, mis-scoped, contradicted by code or harness reality, superseded) must not silently re-scope the work to fit and must not press on against a target it believes is broken — it raises the concern (evidence + proposed amendment) to the lead; the lead triages (resolve within the target's intent, or amend the target); target amendments are human-gated. The lead does not quietly re-scope either; work blocked on a target concern pauses rather than guessing, while unaffected work continues.

Today this exists only as the binding note in this doc. The encoding work:

  • Lead protocol: a canonical "target concerns" paragraph in team-lead-protocol.md §Escalation handling, alongside the existing blocker/clarification escalation rule.
  • Implementer defs: a target_concern escalation type in the escalation-file vocabulary (code-implementer.md §5 and peers), with the raise-don't-absorb instruction.
  • Verifier defs: a clause in each verifier-class def (code-verifier, plan-verifier, css-verifier, search-verifier, finisher): a verifier that believes the spec/target is wrong states that as a target concern in its verdict instead of grading work against a target it believes is broken — and does not rubber-stamp it either.

Open design questions:

  • Whether the verifier clause is one shared paragraph duplicated per def (the SendMessage-verdict precedent — identical text, parity-checkable) or a referenced doc (tier-2 per §1's loading model; risk: not loaded when needed).
  • How a target_concern interacts with the round budget — does an open target concern stop the 3-round clock, and who restarts it.
  • Whether the human gate needs an artifact convention (e.g. amendments land as PRs to the target doc, like this one) or stays conversational.

14. Clean up and reorganize docs/dev/orchestration/ — and keep it clean

One-off pass first: reorganize the files in this directory, and move or mark the outdated ones — anything superseded by shipped work, describing pre-uplift behavior, or duplicating a canonical doc (e.g. the team-lead-protocol.md vs integrate-storefront/SKILL.md divergence its own header warns about). archive/ exists for retired docs; everything that stays must say what it is current FOR (the README's canonical-vs-living-state split is the start). This is §7's staleness audit pointed at the orchestration docs themselves.

Then keep it clean: consider a regular hook that triggers a cleanup-and-update-repo team — a scheduled (cron-style) hook that starts a small team whose brief is to sweep the repo's docs (this directory first, docs/ generally per §7), detect drift/staleness, and raise doc PRs. Overlaps deliberately with the §5 self-improvement loop; the open question is whether this is §5's first scheduled task or a separate lighter-weight loop (docs-only, no proposal stage — staleness fixes are low-risk enough to go straight to PR, still human-merged).

Open design questions:

  • Archive vs delete vs mark-in-place — and what the staleness marker is (header banner with owner + review trigger, per §7).
  • Hook cadence and trigger (time-based cron vs post-merge), and its token budget.
  • Relationship to §5: one loop with a docs-sweep task, or two loops with different risk profiles (docs-only PRs vs proposal-gated process changes).

15. Spec-intake stage — stakeholder objectives become verifier targets

Proposed by Raynor (2026-06-11, roadmap-execution session). Today, clarification is exception-triggered: /grill-me exists and is wired into integrate-storefront's triage table ("unclear or vague → clarify first"), and the general-dev loop's step 1 is CLARIFY — but nothing makes a clarifying interview the DEFAULT opening stage for a new feature or problem report; detection of "vague" is left to agent judgment, which under-triggers (agents assume rather than ask); and nothing requires the clarified stakeholder objectives to become the spec artifact the dyad's verifier grades against. §12's premise — the verifier is the de-facto spec; verify intent, not proxies — assumes the intent was captured somewhere. Today that capture is informal.

New standing stage: every entrypoint / team-lead session opens with spec-intake before planning or fanning out:

  1. Interview the stakeholder (default-on for new features and problem reports; bias-to-ask): objectives, success criteria, constraints, explicit out-of-scope. /grill-me is the existing mechanism — promoted from exception path to standing stage, with proportionality (a trivial one-liner needs zero questions; a new feature needs the interview).
  2. Record the objectives as the spec artifact that seeds the dyad's scope contract — making the chain stakeholder objectives → spec file → reviewer-brief SCOPE CONTRACT → verifier target explicit and auditable.
  3. Sequence it before inline-vs-team (inline-vs-team-criteria.md): the clarified scope is an input to that decision, not the other way round.

Open design questions:

  • The proportionality rule: what triggers a full interview vs a single confirmation question vs nothing — and how to counter the under-asking bias without nagging on trivial tasks.
  • Async stakeholders: what the session does when the human doesn't answer (block, proceed-with-stated-assumptions + flag for review, or defer the fan-out).
  • Relation to the pre-implementation "done contract" pattern (producer and verifier agree what "done" looks like before work starts — cf. Anthropic's harness-design-for-long-running-agents writeup): spec-intake is the human↔lead leg; the done-contract is the lead↔dyad leg of the same chain. Encode together or separately?
  • Where it's encoded: the §1 entrypoint template and the team-lead protocol are the natural homes (the entrypoints and substrate plans' phases already touch both); verifier-side provenance ("which stakeholder objective does this target trace to?") may belong in the §12 skeleton's spec-intent section.

Provenance

Context that motivated these: the June 10–11 overnight uplift run (4 plans, ~23 PRs, hub-and-spoke dyads) exercised the orchestration substrate hard and generated the failure-mode catalog referenced in item 5. See followups.md item 7 for the harness-behavior findings from the same run.