Orchestration Holistic Review
Synthesizer: holistic-review (consolidates 5 lenses + adversarial critique + prior team-audit)
Date: 2026-06-10
User goal: "Handle cooperation between human and multiple agents to do all tasks to do with integration and development with the Shopify app, with composable components we can use to define other workflows like general development etc."
Inputs: PRIMITIVES, ROLES, HUMAN-IN-LOOP, GENERALITY, FAILURE-MODES lenses; critique on blind spots/weak findings/cross-lens tensions; prior team-audit.md.
1. TL;DR
The system has a clean orchestration substrate (Skill -> Team -> Workflow -> Subagent) and an implementer/verifier dyad that genuinely composes, but every higher-level guarantee — role boundaries, workspace paths, production gates, isolation — lives in prose that the model honors voluntarily rather than the harness enforces. The substrate is largely domain-neutral; the prose, the playbook reads, and the workspace conventions are Shopify-coded, so "use this for general dev" today means copy-and-edit three near-identical 150-line workflow scripts and seven Shopify-flavored agent files. The most consequential gap blocking the user's stated goal is not a missing primitive — it is a missing enforcement layer between the lead's prompt-level intent and the teammate's runtime behavior: no TeammateSpawned validator, no tools: frontmatter, no typed workspace handle, no production-gate language at CLAUDE.md scope. To make this stack support both Shopify integrations and general dev with confidence, the work splits into three concurrent tracks: harden the existing storefront flow (fix the four P0 enforcement gaps), extract a parameterized reusable core (playbook + workspace + loop primitives), and add a thin observability/recovery layer (team-status, idle-watchdog, checkpoint).
2. Current State
Composable components inventory (consolidated)
Domain-neutral substrate (works as-is for general dev):
- Skill as human entry point (
integrate-storefront,grill-me,raise,monitor-pr) — the only thing humans invoke. - Team primitives (
TeamCreate,SendMessage,TaskCreate/List/Update,shutdown_request) — persistent multi-agent container, all generic. - Three-tier composition (Workflow / Subagent / Teammate) clearly distinguished in
.claude/skills/integrate-storefront/SKILL.md:14-26. - Implement -> Verify -> Iterate loop with structured
{pass, feedback}schema — identical shape across.claude/workflows/css-section.js:99-105,.claude/workflows/feature-pr.js:91-98,.claude/workflows/search-tuning.js:96-104. - Verifier output contract
{pass: boolean, feedback: string}— de-facto domain-neutral interface. /raiseskill — git branch + commit + push + PR + reviewers + e2e label detection; Marqo-monorepo-coupled (reviewer list) but Shopify-neutral./grill-meskill — structured human clarification with audit-trail JSON in${WORKSPACE}/decisions/.- Escalation file convention
${WORKSPACE}/escalations/<task>.json— typed async channel teammate -> lead. - P0-P3 severity rubric referenced from
docs/dev/review_severity_rubric.md.
Domain-neutral but orphaned (defined, never invoked by integrate-storefront):
.claude/agents/plan-verifier.md,test-case-generator.md,project-clarity-interviewer.md— three generic engineering roles with zero references in any workflow.
Shopify-coupled (would need re-skin or parameterization to reuse):
code-implementer.md:8-19(Shopify scope),code-verifier.md:11-12(mandatorydocs/integrations/AGENTS.mdread),css-implementer.md/css-verifier.md/finisher.md(Marqo selectors, Layer 3 custom_css),search-tuner.md/search-verifier.md(Marqo override header, admin.ecom.marqo.ai endpoints).feature-pr.js:27-31,css-section.js:28-33,search-tuning.js:28-34— shop slug / merchant doc args.- Workspace convention
tmp/integrations/<shop-slug>/with[TAG:shop-slug]task tags hooks parse — hard-coded in.claude/hooks/task-completed.sh:17-25,.claude/hooks/teammate-idle.sh:35-43. - Browser1-4 partition mapped to the four CSS sections (cards/filters/paging/heading).
Harness primitives available but unused:
EnterWorktree/ExitWorktree,Monitor,ScheduleWakeup,CronCreate,PushNotification,WebFetch,WebSearch— referenced in deferred-tools manifest, zero references anywhere under.claude/.
Hooks (only three exist):
TaskCreated, TaskCompleted, TeammateIdle (.claude/settings.json:46-75). No TeammateSpawned, no WorkflowFinished.
3. Top Findings (consolidated, ranked by impact)
P0-A — Enforcement layer is missing across the stack (PRIMITIVES F2/F8/F9, ROLES F11, FAILURE-MODES F1/F2/F5)
Every safety guarantee — isolation, workspace path, tool restriction, role boundary — is prose the model honors voluntarily. There is no TeammateSpawned hook (so isolation can be silently omitted), no typed workspace parameter (so ${WORKSPACE} placeholder can drift to a gitignored relative path), no tools: frontmatter on any agent file (so documented restrictions in docs/integrations/AGENTS.md:177-187 are model-honored fiction). The lead currently relies entirely on prompt prose to enforce these — and PR #3470 plus the audit's F1 already documented real silent data loss from this gap.
P0-B — Verifier -> implementer feedback path is broken after workflow exit (ROLES F2, FAILURE-MODES F3, prior audit B1)
Verifier prose forbids fixing (code-verifier.md:147-152, css-verifier.md:184-191, search-verifier.md:120-126); implementer lifecycle is bounded by the for-loop inside feature-pr.js:73-125. After max-rounds or pass, the implementer subagent is gone. Persistent verifier-class teammates contacted later (e.g., qa-multiplier in the MSQC session) have no implementer handle and no contract authority to spawn one. The lead becomes a mandatory broker for every post-loop fix — the exact symptom that triggered this review.
P0-C — QC/investigation role is hallucinated (FAILURE-MODES F4, prior audit B4)
The live MSQC team used a qa-multiplier teammate but .claude/agents/qa-multiplier.md does not exist. The boundary it enforced ("can't raise a PR") had no contract source — it was self-invented from the role name. This is the worst class of role failure: not wrong contract, but no contract masquerading as one. Same symptom would recur for any general-dev role (investigator, refactor-author, deploy-verifier) that doesn't have a written definition.
P0-D — Production-gate language lives in one skill, not at CLAUDE.md scope (HUMAN-IN-LOOP F4)
CLAUDE.md has no "Executing actions with care" section. The user-required confirm-before-prod-push rule (memory item feedback_confirm_prod_actions) is enforced only inside .claude/skills/integrate-storefront/SKILL.md rule 1. Any other agent (search-tuner directly invoked, code-implementer via /raise alone, future general-dev agent) inherits no such gate. Extending the system to general dev — DB migrations, infra teardown, customer data — without project-level gates means each new skill must re-add them.
P1-A — Three Shopify workflow scripts duplicate the same 150-line implement/verify shell (GENERALITY F2, ROLES F5)
css-section.js, feature-pr.js, search-tuning.js differ only in args, MAX_ROUNDS, role names, and prompt template. The shared shell has never been extracted. Adding a fourth role pair (accessibility, security-review, refactor) requires forking a fourth 150-line file. Bug fixes to loop semantics require three parallel edits. This is the single biggest reusability blocker for "compose new workflows" — the lens specifically flagged that an runImplementVerifyLoop({implementerRole, verifierRole, maxRounds, makePrompt}) primitive would collapse these to three 30-line config files.
P1-B — docs/integrations/AGENTS.md is the highest-leverage coupling point (GENERALITY F8)
Every Shopify-pipeline agent begins with "Read docs/integrations/AGENTS.md at the start of every task" (css-implementer.md:11, css-verifier.md:11, code-implementer.md:10, code-verifier.md:12, search-tuner.md:11, search-verifier.md:11, finisher.md:11, .claude/skills/integrate-storefront/SKILL.md:11-12). A playbook workflow arg + Read $PLAYBOOK_PATH would let a general-feature-pr skill point at docs/dev/code-review-guide.md and reuse the same agents.
P1-C — /raise ownership duplicated between teammate prose and workflow Phase 4 (prior audit A3, FAILURE-MODES F6, ROLES F3)
code-implementer.md:108-115 step 6 says raise; feature-pr.js:127-145 Phase 4 spawns a FRESH agent("code-implementer", ...) subagent to raise. If teammate runs workflow end-to-end, PR is raised by anonymous subagent in workflow cwd; persistent teammate is bypassed. If teammate doesn't run workflow, nobody raises (the tag-namespacing teammate in MSQC went idle plan-approved without a PR landing). No contract resolves who owns this step.
P1-D — Idle/long-running teammates have no watchdog, no checkpoint, no context-freshness guard (FAILURE-MODES F10/F11, blind spot #5)
Teammates go idle waiting for human re-engagement; SKILL.md prescribes "human comes back, lead re-messages teammate" (SKILL.md:316-322). No timeout, no idle-warning, no "re-fetch baseline before verifying" instruction. Conversation context decays under idle — PR may be amended, settings drifted, theme updated. Recovery on session end is ad-hoc handoff docs (SESSION-HANDOFF.md, handoff-tag-namespacing.md). No first-class TeamSnapshot / TeamResume.
P1-E — Auto Mode tension with hard gates is unresolved (HUMAN-IN-LOOP F3)
Auto Mode tells the lead "make the reasonable call and keep going"; SKILL.md:94-96 and :302-307 tell the lead "MUST invoke /grill-me". No precedence rule. For undocumented third-party apps and borderline-vague CSS scopes, behavior is non-deterministic.
P2 — secondary findings worth tracking
- Browser1-4 partition is enforced by prose, not harness (FAILURE-MODES F14). Race conditions on shared workspace writes are not modeled (blind spot #6). Code-verifier prose says "no browser" while
AGENTS.md:184lists browserN (prior audit A2).code-implementerpost-deploy verification violates the implementer/verifier separation (prior audit D3, FAILURE-MODES F15). Implementer/verifier cost (re-reading playbook every round, full Opus on both sides) is unexamined (blind spot #1). No structured "minimum viewable diff" for human gates — humans approve summaries (blind spot #2). No learning loop — every audit's findings live in gitignoredtmp/(blind spot #3). No credential-scope primitive separate from tool-list (blind spot #4). No reversibility classification of actions (blind spot #7). No team observability surface (blind spot #8). Lead role has no CANNOT list (blind spot #9, ROLES F12). Human as approver is treated as ground truth, no sanity-check (blind spot #10).
4. Critical Gaps Blocking the Stated Goal
The user wants "cooperation between human and multiple agents to do all tasks to do with integration and development." The composable-substrate piece is largely in place; the blockers are:
-
No enforcement layer. Every guarantee is voluntary. For Shopify integration this is "fragile but usually works because the four-section browser partition naturally separates concurrent writes" (critique cross-lens-tension E). For general dev, where work doesn't partition naturally, the absence is fatal. Required:
TeammateSpawnedhook, typedworkspaceparameter, mandatorytools:frontmatter, harness-validated browser/credential leases. -
No domain-neutral feature-PR path.
code-implementerandfeature-prare 80% generic but anchored to Shopify by theAGENTS.mdread and theshopSlug/appGuideargs. For general dev you must fork. Required:playbookarg replacing hard-codedAGENTS.mdreads;general-feature-prskill that wraps the same workflow with non-Shopify playbook. -
No verifier -> implementer recovery primitive. After the workflow exits, the only fix path is through the lead. For general dev with longer-lived PRs (code review iterations, post-deploy patches), this bottleneck dominates wall-clock time. Required: verifier-may-spawn-fix-implementer contract OR durable implementer teammate.
-
No project-level gate language. CLAUDE.md has no "Executing actions with care" section. General dev (prod deploys, DB writes, credential use) requires gates more often than storefront CSS does. Required: CLAUDE.md universal gates + per-playbook domain gates (two-tier).
-
No observability/recovery. Lead's attention is the bottleneck and there is no team-status surface, no idle-watchdog, no checkpoint. Long-running sessions (multi-day refactors, deploy windows) collapse to ad-hoc handoff documents. Required:
TeamHealthCheck, idle-watchdog,TeamSnapshot/TeamResume. -
No first-class investigation/triage role. Every workflow is implement-or-verify. "Investigate this bug", "diagnose this failure", "explore this codebase area" has no role (GENERALITY F12). The hallucinated
qa-multiplierwas filling this gap. Required:qc-investigatoragent definition with explicit fan-out authority (per prior audit B4).
5. Options for the Path Forward
Option A — Stabilize Storefront First, Generalize Later (S, low risk)
Scope: Land prior audit P0/P1 fixes (#3470 isolation, verifier-spawns-fix contract, qc-investigator.md, tools: frontmatter, deduplicate /raise). No new abstractions.
What changes:
- Hooks: add
TeammateSpawnedvalidating${SHARED_WORKSPACE}is absolute andisolation: "worktree"is set for code-touching roles. - Agent files: add
tools:frontmatter matchingAGENTS.md:181-187. code-verifier,css-verifier,search-verifier,finisher: add "Escalating fixes" section permitting transient subagent spawn for tactical post-loop fixes.- New
.claude/agents/qc-investigator.mdcodifying the hallucinated role. feature-pr.js: remove Phase 4; teammate owns/raiseafter workflow returns approved.
Risk: Low. All changes target documented gaps. Doesn't move the system toward generality.
Enables: Reliable Shopify integrations end-to-end without lead-as-broker for every fix. Does NOT enable general dev — the system stays Shopify-shaped.
Option B — Extract the Reusable Core (M, medium risk)
Scope: Everything in A, plus extract the implement/verify loop and parameterize the playbook.
What changes:
- New
.claude/workflows/_lib/implement-verify-loop.jsexposingrunImplementVerifyLoop({implementerRole, verifierRole, maxRounds, makePrompt, schema, args}). Existing three workflows reduce to ~30-line config files. - New workflow arg
playbook(absolute path to a domain guide); agent prose changes from "Readdocs/integrations/AGENTS.md" to "Read$PLAYBOOK_PATH". - New
.claude/skills/general-feature-pr/skill wrappingfeature-prworkflow withplaybook=docs/dev/code-review-guide.md(which needs to be authored as a slimmer non-Shopify equivalent). - Workspace convention generalized: hooks search
tmp/work/<slug>/ORtmp/integrations/<slug>/; task tags[TAG:slug]allowed regardless of domain. - CLAUDE.md: add "Executing actions with care" section with universal gates (no production writes without confirmation, no
git push --forcewithout explicit ask, no destructive operations without reversibility note).
Risk: Medium. Moves the source-of-truth for loop semantics into a shared lib — bugs propagate to all workflows. Requires authoring a generic playbook that's not just a strip-down of AGENTS.md.
Enables: Define a new role pair (refactor-implementer/verifier, deploy-implementer/verifier, security-review pair) by writing two agent definitions and a 30-line workflow config. The substrate becomes a real composable framework.
Option C — Full Enforcement + Observability Pass (L, higher risk)
Scope: Everything in B, plus the harness/enforcement layer.
What changes:
- Typed
workspaceparameter onAgent({...})spawns; harness substitutes${SHARED_WORKSPACE}in spawn prose.TeammateSpawnedhook validates path is absolute and inside repo. tools:frontmatter becomes mandatory; harness rejects spawns of agents without it.- Browser/credential leases: declare
exclusive_resources: ["browser3"]in spawn args; harness rejects double-allocation. - New
team-statusslash command + JSON output: list teammates, idle duration, last message, worktree health. - Idle-watchdog hook: after N turns of
TeammateIdleawaiting lead response, auto-SendMessageto lead summarizing the open ask. TeamSnapshot()/TeamResume(snapshot_id)primitives replacing handoff docs.- Verifier tool tiering: declare loop-scope tools (read-only) + escalation-scope tools (can spawn fix-implementer). Resolves the cross-lens tension between sandboxing and self-orchestration.
- CLAUDE.md universal gates plus reversibility classification — actions tagged reversible / reversible-with-effort / irreversible, with different gate UX per tier.
Risk: Higher. Touches harness behavior. Requires harness change-control. Risk of over-engineering before validating Option B reveals the right abstractions.
Enables: Production-grade multi-agent system that scales to general dev safely. Long-running sessions become first-class. Audit trails become real artifacts, not conversation excerpts.
Option D — Investigation-First Variant (M-L, medium risk, complementary)
Scope: A first-class Investigator workflow primitive distinct from implement/verify, addressing the missing "diagnose / explore" role across all domains.
What changes:
- New
.claude/agents/investigator.mdwith explicit fan-out authority and "produce a structured findings JSON" contract. - New
.claude/workflows/investigation.jsrunning 1-3 rounds ofinvestigator -> reviewer(reviewer evaluates whether the findings explain the symptom). - Investigator is permitted to
WebFetch/WebSearch(addressing PRIMITIVES F12). Lead's escalation triage for undocumented third-party apps can delegate to investigator before invoking/grill-me. - Investigator can recommend (not invoke) a follow-up workflow: feature-pr, search-tuning, css-section, or human escalation.
Risk: Medium. New role that may overlap with qc-investigator (Option A) — needs disambiguation. Underspecified investigator may produce verbose findings that don't reduce lead workload.
Enables: Both Shopify ("investigate this badge system") and general dev ("diagnose this prod 500") use the same primitive. Reduces lead's mechanical escalation triage.
6. Recommended Sequencing (any option)
Regardless of which option is chosen, do these first three in order:
Step 1 (this week) — Land prior audit P0s + project-level gate.
- Merge PR #3470 (workspace placeholder + isolation mandate). This eliminates the silent data-loss path that already happened in MSQC.
- Add "Executing actions with care" section to CLAUDE.md covering: no prod writes without explicit human OK, reversibility classification, branch-protection norms. Lift the user-memory
feedback_confirm_prod_actionsrule into project scope. - Add
.claude/agents/qc-investigator.mdcodifying whatqa-multiplieractually does. This unblocks future MSQC-shape sessions immediately.
Step 2 (this week, parallel) — Resolve role-contract ambiguities.
- Add
tools:frontmatter to every agent file. Resolve A2 (code-verifier browser yes/no — recommend yes for UI verification). Resolve D2 (stop callingfinishera verifier). Resolve D3 (rename post-deploy section in code-implementer.md to acknowledge the exception OR move to apost-deploy-verifierrole). - Add explicit "Escalating fixes outside the loop" section to every verifier definition permitting transient subagent spawn. Removes the lead-as-broker bottleneck for the most common case.
- Deduplicate
/raise: pick one source of truth. Recommend removingfeature-pr.js:127-145Phase 4 and putting it on the teammate. This also makes feature-pr usable by qc-investigator without the awkward shadow-subagent pattern.
Step 3 (next 1-2 weeks) — Extract the loop primitive and parameterize the playbook.
- Even if only Shopify is targeted, this reduces three 150-line files to one shared lib + three configs. Bug fixes to loop semantics happen once.
- Parameterize playbook via workflow arg. Author a slim
docs/dev/code-review-guide.mddistinct from the storefront AGENTS.md. This is the precondition for any non-Shopify reuse — without it, "general dev support" can't even start.
Step 4 onward — Option C harness work OR Option D investigation primitive. Pick based on which gap hurts more in practice once Step 3 lands. If MSQC-style investigation-then-fix recurs, do D. If long-running idle teammates / lost session state recur, do C's observability piece first.
7. Open Questions for the User (Design Decisions Only You Can Make)
-
Sandbox vs self-orchestration trade-off (cross-lens tension). Verifiers spawning fix-implementer subagents requires the
Agenttool. Iftools:frontmatter becomes mandatory, do verifiers get a tiered tool set (read-only inside loop, Agent-permitted outside loop)? Or do we accept that the "verifier can never modify code" claim is loop-scoped only? Option C resolves with explicit tiers; Option A leaves it as prose. -
Plan approval — lead-only or human-too? HUMAN-IN-LOOP F2 flagged that plan approval is teammate -> lead, no human. Forcing all plans through humans would 100x the human's approval workload (critique weak-finding). Acceptable triage: "plans touching production / customer data / cross-component / over N files require human; rest stay lead-only"? Or do you want CSS-implementer-style plans to remain fully agentic?
-
General-dev playbook content. Step 3 of the sequencing requires a
docs/dev/code-review-guide.mdthat's the general-dev equivalent of the storefront AGENTS.md. Do you want me to draft this from CLAUDE.md + the existing code-verifier.md checklist sections that ARE generic (P0-P3 rubric, CLAUDE.md compliance, fail-fast, Pydantic), or do you want to author it? -
Workspace path generalization. Hooks currently search
tmp/integrations/*/. For general dev, do we (a) generalize hooks totmp/{work,integrations}/*/, (b) move all work totmp/work/*/(rename), or (c) keeptmp/integrations/as the workspace term even for non-integration work? The hooks scan logic in.claude/hooks/task-completed.sh:31and.claude/hooks/teammate-idle.sh:31needs corresponding update. -
Auto Mode precedence. When Auto Mode is active, which gates remain hard? Recommend:
/grill-mefor undocumented apps stays hard; mid-workflow clarification stays soft (lead makes reasonable call); production push always hard regardless of Auto Mode. Confirm or adjust. -
Investigator role first-classness. Is
qc-investigatora Shopify-specific role (live now asqa-multiplier) or a general-dev primitive (Option D)? If the former, write the file and move on. If the latter, design the investigator + investigation workflow as a parallel track to feature-pr. -
Plan-verifier / test-case-generator / project-clarity-interviewer orphans. Three lenses flagged these as dead code in
.claude/agents/. Are they legacy (move out), reserved for a non-integration flow you have in mind (mark them with a note), or candidates to integrate into a future general-dev plan-first workflow? -
Reversibility tier on the human gate. Should production-affecting actions be classified (reversible / effortful-revert / irreversible) with different UX per tier (one-tap approve / show full diff / two-step confirm with summary)? Critique blind spot #7 — this is most consequential for general dev (DB migrations, infra teardown) but should be designed before non-Shopify writes happen.
-
Knowledge persistence across sessions. Where do audit findings, lessons learned, new third-party app catalog entries live? Currently
tmp/integrations/*/reports/is gitignored. Do we want adocs/integrations/learnings.md(or similar) that the lead writes to at session end, OR a hook that proposes AGENTS.md edits when an integration completes? -
Lead's CANNOT list. Every role has a CANNOT section except the lead. Do you want a documented "Lead CANNOT" list (e.g., "the lead never does implementation work inline — always spawn a teammate")? Critique blind spot #9.