Skip to main content

Actions & Remediation

Status: PHASE 1 IMPLEMENTED — Schema migrations (021-023), the suggestion_engine collector (9 detectors porting all 8 legacy detectors plus gp2→gp3), the /api/v1/actions and /api/v1/actions/summary endpoints, and the Actions UI page are implemented. Phase 2 (preview/execute flow, savings measurement, Action Lambda) is not yet built.

Polo is the primary interface for executing cost-saving actions on Marqo Cloud infrastructure. When it identifies waste, it presents a concrete action the user can execute with a single click.

Trust Model

  1. Every action has a dry-run. Preview shows exactly which API calls will be made.
  2. Every action is audited. Who clicked Go, when, what happened, recorded in action_log.
  3. Destructive actions require confirmation. Two-step: Review → Confirm.
  4. Actions are scoped to a known safe set. Pre-defined action types, not arbitrary AWS commands.
  5. Savings are measured conservatively. No assumption of perpetual savings.

Supported Action Types

TypeAWS APIReversible?Risk
terminate_instanceec2.terminate_instancesNoHigh
stop_instanceec2.stop_instancesYesMedium
delete_volumeec2.delete_volumeNoHigh
delete_snapshotec2.delete_snapshotNoMedium
resize_instanceec2.modify_instance_attributeYesMedium
modify_volumeec2.modify_volumePartialMedium
release_eipec2.release_addressNoLow
delete_nat_gatewayec2.delete_nat_gatewayNoHigh
stop_sagemaker_notebooksagemaker.stop_notebook_instanceYesLow
delete_s3_objectss3.delete_objectsNoHigh
delete_log_grouplogs.delete_log_groupNoMedium

Safety Constraints

Hard-coded in the Action Lambda, enforced before every action:

  • Protected tags: Resources with polo:protected or do-not-delete tags are never touched.
  • Production restriction: terminate_instance, delete_volume, delete_s3_objects blocked for marqo_env='prod'.
  • Minimum age: Instances must be 24h+ old, snapshots 7d+ old (prevents race with provisioning).
  • Batch limits: Max 50 snapshots or 1000 S3 objects per action.

Execution Architecture

The Cloudflare Worker does not have AWS write credentials. It invokes the Action Lambda:

React SPA → Worker (POST /api/actions/preview) → Lambda (DryRun=true) → AWS → preview result
React SPA → Worker (POST /api/actions/execute) → Lambda (real) → AWS → action_log updated

The Action Lambda has its own IAM role (polo-action-role) scoped to exactly the 11 supported actions.

Savings Measurement

The counterfactual problem

Deleting a test instance that would have been torn down tomorrow saves 1 day, not infinity.

Savings basis categories

BasisAttribution windowExample
orphaned_resource30 days (conservative — probably would have lingered)Unattached EBS volume
rightsizingMeasured continuously from actual post-resize cost datam5.2xlarge → m5.large
scheduled_teardownTime acceleration only (days saved vs expected teardown)Stopping test 1 day early
policy_violationDuration of violation + 30 daysSageMaker running 3 weeks vs 8-hour rule
stale_data30 days (these genuinely would persist indefinitely)Old snapshots, logs

Calculation

  • Fixed-window: total_savings = cost_before_daily × savings_window_days
  • Rightsizing: Daily job accumulates (cost_before - actual_cost) until resource is terminated or resized again

Suggestion Engine

Detectors run daily at 08:00 UTC, query resource_snapshots, and write rows to action_suggestions with status='pending'. Resources tagged polo:protected or do-not-delete are excluded from all terminate/delete detectors.

DetectorModuleWhat it findsSuggested actionSeverity
unattached_volumesdetectors/unattached_volumes.pyEBS volume in available state for 7+ daysdelete_volumehigh
unassociated_eipsdetectors/unassociated_eips.pyEIP not associated with any network interfacerelease_eiplow
idle_nat_gatewaysdetectors/idle_nat_gateways.pyNAT gateway with zero bytes processed for 7+ daysdelete_nat_gatewayhigh
running_notebooksdetectors/running_notebooks.pySageMaker notebook in InService statestop_sagemaker_notebookmedium
stopped_instancesdetectors/stopped_instances.pyEC2 instance in stopped state for 7+ daysterminate_instancemedium
gp2_volumesdetectors/gp2_volumes.pyEBS gp2 volume (gp3 is always cheaper)modify_volumelow
no_role_instancesdetectors/no_role_instances.pyRunning EC2 instance with no Polo role classification (migration 018 column, not IAM) — indicates untagged or orphaned resourcesterminate_instancehigh
orphaned_instancesdetectors/orphaned_instances.pyRunning instance whose marqo_index no longer exists in hierarchy_nodes (HA suffix variants stripped before comparison)terminate_instancehigh
persistent_dev_indexesdetectors/persistent_dev_indexes.pyInstance with audience='development' running for 3+ hours and role != 'monitoring' — dev instances should be ephemeralterminate_instancemedium

Stale suggestions (pending for 7+ days) are expired to status='expired' at the start of each run before new ones are inserted. Duplicates (same resource_arn already pending) are skipped.

EC2 rightsizing detectors are deprioritised while the infra team reshuffles architecture.

UI Flow

Suggestions are presented ranked by estimated daily saving. Each card shows: resource, reason, evidence, estimated saving, and Preview / Dismiss / Go buttons. The Savings Dashboard shows cumulative savings over time, by category, and by person.

Key Tables

  • action_types — Registry of supported actions (seeded by migration)
  • action_suggestions — Pending suggestions with status tracking
  • action_log — Full audit trail with before/after state and measured savings