Skip to content

Dump index contents to S3 (vespa-visit)

Context

You want to dump all the Vespa docs from a Marqo index for some reason. For example:

  • reindexing
  • bulk patching
  • data analysis across all fields
  • copying the contents of one index to another

Process

  1. Get the system account ID and index name of the index you want to dump from (Polo may help with this).
  2. Go to the Vespa Dump Docs GitHub Action in the cloud_data_plane repo, and “Run workflow”:

    • Set the environment to prod-cell-1 if the index is in prod (usually the case).
    • Enter the system account ID and index name.
    • The “Job ID” is just the suffix to append to the output. If not provided, it will use a timestamp. You can enter something more memorable if you prefer.

    image.png

  3. If the index is in the prod environment, ask a member of the Data Plane team to approve the workflow (on #cloud-data-plane-team).

  4. Wait until the workflow is complete (should be ~1 minute for a small index, and update to 10-20 minutes for a larger one). Find the output docs in this bucket: s3://bastionfilesbucket-prod-cell-1/_vespa_visit/{system_account_id}-{index_name}-{job_id}.jsonl.
  5. Download the files and do whatever you like with them!