Hippothesis: Full API Fuzz Testing

Status: Phase 1 foundation Component: components/hippothesis Tracking epic: dev-l29 (bd)

Problem

Hippothesis currently runs as a thin smoke-test probe: a handful of "hello" checks against configured base URLs, plus a single property-based search round-trip against search_proxy. The three ingress APIs that sit in front of Marqo Cloud each expose many more endpoints, and regressions in any of them are only caught post-deploy by the ecom E2E suite (which is narrow) or by customers. We want a single, environment-agnostic fuzz harness that exercises every public endpoint of:

search_proxy — Cloudflare Worker (Hono) — search, recommendations, agentic search, admin proxy
controller — Django DRF — account, indexes, merchandising, search, API keys
admin_lambda — FastAPI — index settings, query configs, exports/imports, forks, sync jobs, monitoring, onboarding, analytics

The target is Hippodrome (locally and in CI) and the shared dev/staging clusters. "Fuzz" here means Hypothesis-driven property tests: strategies generate realistic-but-varied inputs, exercisers drive HTTP clients, and assertions check structural invariants rather than exact values.

Goals

G1 — Every endpoint of every ingress API has at least one Hypothesis exerciser that sends shaped-but-randomised input and asserts no-5xx + structural response invariants.
G2 — HTTP clients for each ingress API are typed, composable, and reusable across exercisers. Auth, base URL selection, and retries live behind the client — exercisers don't touch wiring.
G3 — Strategies are modular, composable, and aligned with the Pydantic/ TS request models so failing examples surface real model mismatches.
G4 — The full suite runs against Hippodrome in CI (local profile) and opt-in against dev / staging via the existing HIPPOTHESIS_ENV + per-service overrides.
G5 — Exercisers degrade gracefully when a service isn't reachable or auth isn't available: pytest.skip by default, pytest.fail when HIPPOTHESIS_FAIL_ON_UNREACHABLE=true.

Non-goals (Phase 1)

Stateful fuzzing (Hypothesis RuleBasedStateMachine). Covered in Phase 2.
Cross-service choreography (e.g. indexing via controller then searching via search_proxy). Covered in Phase 3.
Performance/load testing. Separate tool (scripts/perf/).
OpenAPI schema generation or contract verification. Tracked separately.
Stand-alone runner binary (hippothesis/runner.py). Stays a placeholder until the suite outgrows pants test.

Tenets

Run against real HTTP, not FastAPI TestClient. The existing test_import_export.py uses TestClient deliberately for in-process DDB wiring; the fuzz exercisers here target deployed services.
Shape over exhaustiveness. Strategies should generate inputs the API could plausibly receive in production — not every bit-pattern. 5xx on malformed-but-valid input is a real bug; 5xx on bit-salad is not our problem in Phase 1.
One exerciser module per endpoint family. Per-module, not per-endpoint, so shared fixtures (auth, index setup) are reused.
Fail fast, fail loud. No broad except Exception. Assertion messages include the full request (method, path, body, headers sans secrets) so reproducing a failing example is a copy-paste.

Architecture

components/hippothesis/hippothesis/
├── config.py                       # env → Config (exists)
├── client.py                       # legacy TargetClient (exists, kept for hello)
├── clients/                        # NEW: per-ingress typed HTTP clients
│   ├── base.py                     # shared auth, request, error handling
│   ├── search_proxy.py
│   ├── controller.py
│   └── admin_lambda.py
├── strategies/
│   ├── common.py                   # NEW: account ids, index names, pagination
│   ├── queries.py                  # exists
│   ├── search_proxy_requests.py    # NEW
│   ├── controller_requests.py      # NEW
│   └── admin_lambda_requests.py    # NEW
└── tests/
    ├── conftest.py                 # shared fixtures (exists, expanded)
    ├── import_export/              # in-process admin_lambda tests (exists)
    └── exercisers/                 # NEW: live HTTP exercisers
        ├── conftest.py
        ├── test_search_proxy_*.py  # one file per endpoint family
        ├── test_controller_*.py
        └── test_admin_lambda_*.py

Client contract

Every client extends a shared base that owns:

Base URL resolution (via ServiceProbe).
Auth header injection (X-API-Key for search_proxy; bearer token for admin_lambda; session cookies for controller).
Timeout / retry policy (one retry on 5xx only; no retry on 4xx).
Structured request logging (method + path + status + request id).
Uniform response wrapper (ClientResponse) exposing status code, parsed JSON or raw text, and response headers.

Concrete clients expose one method per endpoint, typed with TypedDict or Pydantic for the request/response where models exist. Clients never raise_for_status() — exercisers own the assertions.

Strategy contract

Every strategy module exports @st.composite functions named after the request model they build (e.g. search_request(), query_config_create_request()).
Strategies compose via common.py primitives: account_id(), index_name(), document_id(), pagination_cursor(), etc.
Where a Pydantic model already exists (e.g. admin_lambda request models) the strategy draws sub-fields and returns the hydrated model — not a raw dict — so exercisers get type-checked inputs.

Exerciser contract

Each exerciser test:

Takes the matching client as a fixture.
Is decorated with @given(request=<strategy>) and @settings(max_examples=N, deadline=None, ...).
Skips early if the service isn't configured.
Calls the client method and asserts:
- response.status_code < 500 (hard failure on 5xx).
- Response is valid JSON (or valid for documented content-type).
- Response parses into the expected response model (when one exists).
- Endpoint-specific invariants (e.g. pagination cursor echoes, ids in response match request).
Cleans up any state it created (delete-after-create patterns).

max_examples defaults to 10 per endpoint; stable endpoints can raise it. Expensive endpoints (forks, reindexing) use max_examples=1 and seed.

Endpoint catalog (Phase 1 scope)

Counts below reflect live endpoints on main as of the epic's creation. A per-service breakdown appears in the "Decomposition" section.

Service	Framework	Endpoints	Notes
search_proxy	Hono / Cloudflare	~24	Split read (search) vs admin-proxy vs agentic
controller	Django DRF	~50	Session auth dominant; SSO/signup carve-out
admin_lambda	FastAPI	~70	Bearer token; 207 Multi-Status for batches

Auth and setup requirements:

search_proxy — needs HIPPOTHESIS_API_KEY + an existing index (HIPPOTHESIS_SYSTEM_ACCOUNT_ID, HIPPOTHESIS_INDEX_NAME). The hello probe uses /api/v1/account which is already configured.
controller — needs session cookie or CSRF token. For Hippodrome this means wiring a service-account login flow (new work, part of the controller sub-bead).
admin_lambda — needs admin bearer token. In Hippodrome this is seeded via the dev auth fixture; in dev/staging it's a scoped secret.

Categories that are auth-only and unsafe to fuzz blindly (password reset, SSO start, signup) are marked category: auth-sensitive and receive read-only or shape-only exercisers (assert error codes; don't POST credentials).

Decomposition into sub-beads

The epic is deliberately broken into service-sized slices. Each slice is independently shippable and landable.

Bead title	Scope
`hippothesis: client base + shared strategies`	Phase 1 foundation: `clients/base.py`, `strategies/common.py`, exerciser conftest + harness. (This PR.)
`hippothesis: search_proxy client + exercisers`	Typed client, strategies, one exerciser per endpoint family (search, recommendations, agentic, admin-proxy).
`hippothesis: controller client + exercisers`	Session-auth client, strategies, exercisers for account / indexes / merchandising / search / api_keys.
`hippothesis: admin_lambda client + exercisers`	Bearer-auth client, strategies, exercisers for index_settings / query_configs / forks / sync_jobs / monitoring / export-import / onboarding / analytics / profiles / aliases / pixels.
`hippothesis: CI dashboard + Slack alert`	Push suite results to Hippothesis Grafana board; alert on recurring failures.

Each sub-bead should reference this plan in its description. Each should target ≤400 LOC net added.

Risk & open questions

Data pollution. Exercisers that create resources (query configs, forks, synonyms) will pollute dev/staging DDB. Mitigation: every created resource is cleaned up in a finally block, and test index names use a hpt-<uuid> prefix so bulk-delete is trivial.
Rate limits. search_proxy and the workflow-driven endpoints have per-account rate limits. Mitigation: exercisers use low max_examples, and the suite respects Retry-After when present.
Flaky assertions on eventual consistency. Searching for just-created documents requires polling (see existing _wait_for_search_hit helper). Mitigation: the pattern is extracted into a shared helper; new-document invariants gate behind explicit waits.
Auth drift. If an ingress's auth mechanism changes (e.g. cookie → token), every exerciser for that service breaks at once. Mitigation: auth lives exclusively in the per-service client base; tests never touch headers directly.

Success criteria for Phase 1

Every endpoint listed in the catalog has one passing exerciser in the local profile (Hippodrome) and one registered-but-allowed-to-skip exerciser in the dev/staging profiles.
The Hippothesis Dispatch workflow (.github/workflows/hippothesis_dispatch.yaml) runs the full suite without timing out (<15 min wall clock).
A failing example produces a reproducer line (method + path + body) that a human can paste into curl.

References

Existing component docs: components/hippothesis
Existing exerciser (pattern): components/hippothesis/hippothesis/tests/test_customer_documents.py
Existing in-process test (contrast): components/hippothesis/hippothesis/tests/import_export/test_import_export.py
Hippodrome runbook: components/hippodrome/AGENTS.md
CI entrypoint: .github/workflows/hippothesis_dispatch.yaml

Problem​

Goals​

Non-goals (Phase 1)​

Tenets​

Architecture​

Client contract​

Strategy contract​

Exerciser contract​

Endpoint catalog (Phase 1 scope)​

Decomposition into sub-beads​

Risk & open questions​

Success criteria for Phase 1​

References​