Skip to main content

Hippothesis: Full API Fuzz Testing

Status: Phase 1 foundation Component: components/hippothesis Tracking epic: dev-l29 (bd)

Problem

Hippothesis currently runs as a thin smoke-test probe: a handful of "hello" checks against configured base URLs, plus a single property-based search round-trip against search_proxy. The three ingress APIs that sit in front of Marqo Cloud each expose many more endpoints, and regressions in any of them are only caught post-deploy by the ecom E2E suite (which is narrow) or by customers. We want a single, environment-agnostic fuzz harness that exercises every public endpoint of:

  • search_proxy — Cloudflare Worker (Hono) — search, recommendations, agentic search, admin proxy
  • controller — Django DRF — account, indexes, merchandising, search, API keys
  • admin_lambda — FastAPI — index settings, query configs, exports/imports, forks, sync jobs, monitoring, onboarding, analytics

The target is Hippodrome (locally and in CI) and the shared dev/staging clusters. "Fuzz" here means Hypothesis-driven property tests: strategies generate realistic-but-varied inputs, exercisers drive HTTP clients, and assertions check structural invariants rather than exact values.

Goals

  • G1 — Every endpoint of every ingress API has at least one Hypothesis exerciser that sends shaped-but-randomised input and asserts no-5xx + structural response invariants.
  • G2 — HTTP clients for each ingress API are typed, composable, and reusable across exercisers. Auth, base URL selection, and retries live behind the client — exercisers don't touch wiring.
  • G3 — Strategies are modular, composable, and aligned with the Pydantic/ TS request models so failing examples surface real model mismatches.
  • G4 — The full suite runs against Hippodrome in CI (local profile) and opt-in against dev / staging via the existing HIPPOTHESIS_ENV + per-service overrides.
  • G5 — Exercisers degrade gracefully when a service isn't reachable or auth isn't available: pytest.skip by default, pytest.fail when HIPPOTHESIS_FAIL_ON_UNREACHABLE=true.

Non-goals (Phase 1)

  • Stateful fuzzing (Hypothesis RuleBasedStateMachine). Covered in Phase 2.
  • Cross-service choreography (e.g. indexing via controller then searching via search_proxy). Covered in Phase 3.
  • Performance/load testing. Separate tool (scripts/perf/).
  • OpenAPI schema generation or contract verification. Tracked separately.
  • Stand-alone runner binary (hippothesis/runner.py). Stays a placeholder until the suite outgrows pants test.

Tenets

  1. Run against real HTTP, not FastAPI TestClient. The existing test_import_export.py uses TestClient deliberately for in-process DDB wiring; the fuzz exercisers here target deployed services.
  2. Shape over exhaustiveness. Strategies should generate inputs the API could plausibly receive in production — not every bit-pattern. 5xx on malformed-but-valid input is a real bug; 5xx on bit-salad is not our problem in Phase 1.
  3. One exerciser module per endpoint family. Per-module, not per-endpoint, so shared fixtures (auth, index setup) are reused.
  4. Fail fast, fail loud. No broad except Exception. Assertion messages include the full request (method, path, body, headers sans secrets) so reproducing a failing example is a copy-paste.

Architecture

components/hippothesis/hippothesis/
├── config.py # env → Config (exists)
├── client.py # legacy TargetClient (exists, kept for hello)
├── clients/ # NEW: per-ingress typed HTTP clients
│ ├── base.py # shared auth, request, error handling
│ ├── search_proxy.py
│ ├── controller.py
│ └── admin_lambda.py
├── strategies/
│ ├── common.py # NEW: account ids, index names, pagination
│ ├── queries.py # exists
│ ├── search_proxy_requests.py # NEW
│ ├── controller_requests.py # NEW
│ └── admin_lambda_requests.py # NEW
└── tests/
├── conftest.py # shared fixtures (exists, expanded)
├── import_export/ # in-process admin_lambda tests (exists)
└── exercisers/ # NEW: live HTTP exercisers
├── conftest.py
├── test_search_proxy_*.py # one file per endpoint family
├── test_controller_*.py
└── test_admin_lambda_*.py

Client contract

Every client extends a shared base that owns:

  • Base URL resolution (via ServiceProbe).
  • Auth header injection (X-API-Key for search_proxy; bearer token for admin_lambda; session cookies for controller).
  • Timeout / retry policy (one retry on 5xx only; no retry on 4xx).
  • Structured request logging (method + path + status + request id).
  • Uniform response wrapper (ClientResponse) exposing status code, parsed JSON or raw text, and response headers.

Concrete clients expose one method per endpoint, typed with TypedDict or Pydantic for the request/response where models exist. Clients never raise_for_status() — exercisers own the assertions.

Strategy contract

  • Every strategy module exports @st.composite functions named after the request model they build (e.g. search_request(), query_config_create_request()).
  • Strategies compose via common.py primitives: account_id(), index_name(), document_id(), pagination_cursor(), etc.
  • Where a Pydantic model already exists (e.g. admin_lambda request models) the strategy draws sub-fields and returns the hydrated model — not a raw dict — so exercisers get type-checked inputs.

Exerciser contract

Each exerciser test:

  1. Takes the matching client as a fixture.
  2. Is decorated with @given(request=<strategy>) and @settings(max_examples=N, deadline=None, ...).
  3. Skips early if the service isn't configured.
  4. Calls the client method and asserts:
    • response.status_code < 500 (hard failure on 5xx).
    • Response is valid JSON (or valid for documented content-type).
    • Response parses into the expected response model (when one exists).
    • Endpoint-specific invariants (e.g. pagination cursor echoes, ids in response match request).
  5. Cleans up any state it created (delete-after-create patterns).

max_examples defaults to 10 per endpoint; stable endpoints can raise it. Expensive endpoints (forks, reindexing) use max_examples=1 and seed.

Endpoint catalog (Phase 1 scope)

Counts below reflect live endpoints on main as of the epic's creation. A per-service breakdown appears in the "Decomposition" section.

ServiceFrameworkEndpointsNotes
search_proxyHono / Cloudflare~24Split read (search) vs admin-proxy vs agentic
controllerDjango DRF~50Session auth dominant; SSO/signup carve-out
admin_lambdaFastAPI~70Bearer token; 207 Multi-Status for batches

Auth and setup requirements:

  • search_proxy — needs HIPPOTHESIS_API_KEY + an existing index (HIPPOTHESIS_SYSTEM_ACCOUNT_ID, HIPPOTHESIS_INDEX_NAME). The hello probe uses /api/v1/account which is already configured.
  • controller — needs session cookie or CSRF token. For Hippodrome this means wiring a service-account login flow (new work, part of the controller sub-bead).
  • admin_lambda — needs admin bearer token. In Hippodrome this is seeded via the dev auth fixture; in dev/staging it's a scoped secret.

Categories that are auth-only and unsafe to fuzz blindly (password reset, SSO start, signup) are marked category: auth-sensitive and receive read-only or shape-only exercisers (assert error codes; don't POST credentials).

Decomposition into sub-beads

The epic is deliberately broken into service-sized slices. Each slice is independently shippable and landable.

Bead titleScope
hippothesis: client base + shared strategiesPhase 1 foundation: clients/base.py, strategies/common.py, exerciser conftest + harness. (This PR.)
hippothesis: search_proxy client + exercisersTyped client, strategies, one exerciser per endpoint family (search, recommendations, agentic, admin-proxy).
hippothesis: controller client + exercisersSession-auth client, strategies, exercisers for account / indexes / merchandising / search / api_keys.
hippothesis: admin_lambda client + exercisersBearer-auth client, strategies, exercisers for index_settings / query_configs / forks / sync_jobs / monitoring / export-import / onboarding / analytics / profiles / aliases / pixels.
hippothesis: CI dashboard + Slack alertPush suite results to Hippothesis Grafana board; alert on recurring failures.

Each sub-bead should reference this plan in its description. Each should target ≤400 LOC net added.

Risk & open questions

  • Data pollution. Exercisers that create resources (query configs, forks, synonyms) will pollute dev/staging DDB. Mitigation: every created resource is cleaned up in a finally block, and test index names use a hpt-<uuid> prefix so bulk-delete is trivial.
  • Rate limits. search_proxy and the workflow-driven endpoints have per-account rate limits. Mitigation: exercisers use low max_examples, and the suite respects Retry-After when present.
  • Flaky assertions on eventual consistency. Searching for just-created documents requires polling (see existing _wait_for_search_hit helper). Mitigation: the pattern is extracted into a shared helper; new-document invariants gate behind explicit waits.
  • Auth drift. If an ingress's auth mechanism changes (e.g. cookie → token), every exerciser for that service breaks at once. Mitigation: auth lives exclusively in the per-service client base; tests never touch headers directly.

Success criteria for Phase 1

  • Every endpoint listed in the catalog has one passing exerciser in the local profile (Hippodrome) and one registered-but-allowed-to-skip exerciser in the dev/staging profiles.
  • The Hippothesis Dispatch workflow (.github/workflows/hippothesis_dispatch.yaml) runs the full suite without timing out (<15 min wall clock).
  • A failing example produces a reproducer line (method + path + body) that a human can paste into curl.

References

  • Existing component docs: components/hippothesis
  • Existing exerciser (pattern): components/hippothesis/hippothesis/tests/test_customer_documents.py
  • Existing in-process test (contrast): components/hippothesis/hippothesis/tests/import_export/test_import_export.py
  • Hippodrome runbook: components/hippodrome/AGENTS.md
  • CI entrypoint: .github/workflows/hippothesis_dispatch.yaml