Hippothesis: Full API Fuzz Testing
Status: Phase 1 foundation
Component: components/hippothesis
Tracking epic: dev-l29 (bd)
Problem
Hippothesis currently runs as a thin smoke-test probe: a handful of "hello"
checks against configured base URLs, plus a single property-based search
round-trip against search_proxy. The three ingress APIs that sit in front of
Marqo Cloud each expose many more endpoints, and regressions in any of them
are only caught post-deploy by the ecom E2E suite (which is narrow) or by
customers. We want a single, environment-agnostic fuzz harness that
exercises every public endpoint of:
search_proxy— Cloudflare Worker (Hono) — search, recommendations, agentic search, admin proxycontroller— Django DRF — account, indexes, merchandising, search, API keysadmin_lambda— FastAPI — index settings, query configs, exports/imports, forks, sync jobs, monitoring, onboarding, analytics
The target is Hippodrome (locally and in CI) and the shared dev/staging clusters. "Fuzz" here means Hypothesis-driven property tests: strategies generate realistic-but-varied inputs, exercisers drive HTTP clients, and assertions check structural invariants rather than exact values.
Goals
- G1 — Every endpoint of every ingress API has at least one Hypothesis exerciser that sends shaped-but-randomised input and asserts no-5xx + structural response invariants.
- G2 — HTTP clients for each ingress API are typed, composable, and reusable across exercisers. Auth, base URL selection, and retries live behind the client — exercisers don't touch wiring.
- G3 — Strategies are modular, composable, and aligned with the Pydantic/ TS request models so failing examples surface real model mismatches.
- G4 — The full suite runs against Hippodrome in CI (
localprofile) and opt-in againstdev/stagingvia the existingHIPPOTHESIS_ENV+ per-service overrides. - G5 — Exercisers degrade gracefully when a service isn't reachable or
auth isn't available:
pytest.skipby default,pytest.failwhenHIPPOTHESIS_FAIL_ON_UNREACHABLE=true.
Non-goals (Phase 1)
- Stateful fuzzing (Hypothesis
RuleBasedStateMachine). Covered in Phase 2. - Cross-service choreography (e.g. indexing via controller then searching via search_proxy). Covered in Phase 3.
- Performance/load testing. Separate tool (
scripts/perf/). - OpenAPI schema generation or contract verification. Tracked separately.
- Stand-alone runner binary (
hippothesis/runner.py). Stays a placeholder until the suite outgrowspants test.
Tenets
- Run against real HTTP, not FastAPI
TestClient. The existingtest_import_export.pyusesTestClientdeliberately for in-process DDB wiring; the fuzz exercisers here target deployed services. - Shape over exhaustiveness. Strategies should generate inputs the API could plausibly receive in production — not every bit-pattern. 5xx on malformed-but-valid input is a real bug; 5xx on bit-salad is not our problem in Phase 1.
- One exerciser module per endpoint family. Per-module, not per-endpoint, so shared fixtures (auth, index setup) are reused.
- Fail fast, fail loud. No broad
except Exception. Assertion messages include the full request (method, path, body, headers sans secrets) so reproducing a failing example is a copy-paste.
Architecture
components/hippothesis/hippothesis/
├── config.py # env → Config (exists)
├── client.py # legacy TargetClient (exists, kept for hello)
├── clients/ # NEW: per-ingress typed HTTP clients
│ ├── base.py # shared auth, request, error handling
│ ├── search_proxy.py
│ ├── controller.py
│ └── admin_lambda.py
├── strategies/
│ ├── common.py # NEW: account ids, index names, pagination
│ ├── queries.py # exists
│ ├── search_proxy_requests.py # NEW
│ ├── controller_requests.py # NEW
│ └── admin_lambda_requests.py # NEW
└── tests/
├── conftest.py # shared fixtures (exists, expanded)
├── import_export/ # in-process admin_lambda tests (exists)
└── exercisers/ # NEW: live HTTP exercisers
├── conftest.py
├── test_search_proxy_*.py # one file per endpoint family
├── test_controller_*.py
└── test_admin_lambda_*.py
Client contract
Every client extends a shared base that owns:
- Base URL resolution (via
ServiceProbe). - Auth header injection (
X-API-Keyfor search_proxy; bearer token for admin_lambda; session cookies for controller). - Timeout / retry policy (one retry on 5xx only; no retry on 4xx).
- Structured request logging (method + path + status + request id).
- Uniform response wrapper (
ClientResponse) exposing status code, parsed JSON or raw text, and response headers.
Concrete clients expose one method per endpoint, typed with TypedDict or
Pydantic for the request/response where models exist. Clients never
raise_for_status() — exercisers own the assertions.
Strategy contract
- Every strategy module exports
@st.compositefunctions named after the request model they build (e.g.search_request(),query_config_create_request()). - Strategies compose via
common.pyprimitives:account_id(),index_name(),document_id(),pagination_cursor(), etc. - Where a Pydantic model already exists (e.g. admin_lambda request models) the strategy draws sub-fields and returns the hydrated model — not a raw dict — so exercisers get type-checked inputs.
Exerciser contract
Each exerciser test:
- Takes the matching client as a fixture.
- Is decorated with
@given(request=<strategy>)and@settings(max_examples=N, deadline=None, ...). - Skips early if the service isn't configured.
- Calls the client method and asserts:
response.status_code < 500(hard failure on 5xx).- Response is valid JSON (or valid for documented content-type).
- Response parses into the expected response model (when one exists).
- Endpoint-specific invariants (e.g. pagination cursor echoes, ids in response match request).
- Cleans up any state it created (delete-after-create patterns).
max_examples defaults to 10 per endpoint; stable endpoints can raise it.
Expensive endpoints (forks, reindexing) use max_examples=1 and seed.
Endpoint catalog (Phase 1 scope)
Counts below reflect live endpoints on main as of the epic's creation.
A per-service breakdown appears in the "Decomposition" section.
| Service | Framework | Endpoints | Notes |
|---|---|---|---|
| search_proxy | Hono / Cloudflare | ~24 | Split read (search) vs admin-proxy vs agentic |
| controller | Django DRF | ~50 | Session auth dominant; SSO/signup carve-out |
| admin_lambda | FastAPI | ~70 | Bearer token; 207 Multi-Status for batches |
Auth and setup requirements:
- search_proxy — needs
HIPPOTHESIS_API_KEY+ an existing index (HIPPOTHESIS_SYSTEM_ACCOUNT_ID,HIPPOTHESIS_INDEX_NAME). The hello probe uses/api/v1/accountwhich is already configured. - controller — needs session cookie or CSRF token. For Hippodrome this means wiring a service-account login flow (new work, part of the controller sub-bead).
- admin_lambda — needs admin bearer token. In Hippodrome this is seeded via the dev auth fixture; in dev/staging it's a scoped secret.
Categories that are auth-only and unsafe to fuzz blindly (password reset,
SSO start, signup) are marked category: auth-sensitive and receive
read-only or shape-only exercisers (assert error codes; don't POST
credentials).
Decomposition into sub-beads
The epic is deliberately broken into service-sized slices. Each slice is independently shippable and landable.
| Bead title | Scope |
|---|---|
hippothesis: client base + shared strategies | Phase 1 foundation: clients/base.py, strategies/common.py, exerciser conftest + harness. (This PR.) |
hippothesis: search_proxy client + exercisers | Typed client, strategies, one exerciser per endpoint family (search, recommendations, agentic, admin-proxy). |
hippothesis: controller client + exercisers | Session-auth client, strategies, exercisers for account / indexes / merchandising / search / api_keys. |
hippothesis: admin_lambda client + exercisers | Bearer-auth client, strategies, exercisers for index_settings / query_configs / forks / sync_jobs / monitoring / export-import / onboarding / analytics / profiles / aliases / pixels. |
hippothesis: CI dashboard + Slack alert | Push suite results to Hippothesis Grafana board; alert on recurring failures. |
Each sub-bead should reference this plan in its description. Each should target ≤400 LOC net added.
Risk & open questions
- Data pollution. Exercisers that create resources (query configs,
forks, synonyms) will pollute dev/staging DDB. Mitigation: every
created resource is cleaned up in a
finallyblock, and test index names use ahpt-<uuid>prefix so bulk-delete is trivial. - Rate limits. search_proxy and the workflow-driven endpoints have
per-account rate limits. Mitigation: exercisers use low
max_examples, and the suite respectsRetry-Afterwhen present. - Flaky assertions on eventual consistency. Searching for
just-created documents requires polling (see existing
_wait_for_search_hithelper). Mitigation: the pattern is extracted into a shared helper; new-document invariants gate behind explicit waits. - Auth drift. If an ingress's auth mechanism changes (e.g. cookie → token), every exerciser for that service breaks at once. Mitigation: auth lives exclusively in the per-service client base; tests never touch headers directly.
Success criteria for Phase 1
- Every endpoint listed in the catalog has one passing exerciser in the
localprofile (Hippodrome) and one registered-but-allowed-to-skip exerciser in thedev/stagingprofiles. - The Hippothesis Dispatch workflow
(
.github/workflows/hippothesis_dispatch.yaml) runs the full suite without timing out (<15 min wall clock). - A failing example produces a reproducer line (method + path + body)
that a human can paste into
curl.
References
- Existing component docs:
components/hippothesis - Existing exerciser (pattern):
components/hippothesis/hippothesis/tests/test_customer_documents.py - Existing in-process test (contrast):
components/hippothesis/hippothesis/tests/import_export/test_import_export.py - Hippodrome runbook:
components/hippodrome/AGENTS.md - CI entrypoint:
.github/workflows/hippothesis_dispatch.yaml