Actions & Remediation
Status: PHASE 1 IMPLEMENTED — Schema migrations (021-023), the
suggestion_enginecollector (9 detectors porting all 8 legacy detectors plus gp2→gp3), the/api/v1/actionsand/api/v1/actions/summaryendpoints, and the Actions UI page are implemented. Phase 2 (preview/execute flow, savings measurement, Action Lambda) is not yet built.
Polo is the primary interface for executing cost-saving actions on Marqo Cloud infrastructure. When it identifies waste, it presents a concrete action the user can execute with a single click.
Trust Model
- Every action has a dry-run. Preview shows exactly which API calls will be made.
- Every action is audited. Who clicked Go, when, what happened, recorded in
action_log. - Destructive actions require confirmation. Two-step: Review → Confirm.
- Actions are scoped to a known safe set. Pre-defined action types, not arbitrary AWS commands.
- Savings are measured conservatively. No assumption of perpetual savings.
Supported Action Types
| Type | AWS API | Reversible? | Risk |
|---|---|---|---|
terminate_instance | ec2.terminate_instances | No | High |
stop_instance | ec2.stop_instances | Yes | Medium |
delete_volume | ec2.delete_volume | No | High |
delete_snapshot | ec2.delete_snapshot | No | Medium |
resize_instance | ec2.modify_instance_attribute | Yes | Medium |
modify_volume | ec2.modify_volume | Partial | Medium |
release_eip | ec2.release_address | No | Low |
delete_nat_gateway | ec2.delete_nat_gateway | No | High |
stop_sagemaker_notebook | sagemaker.stop_notebook_instance | Yes | Low |
delete_s3_objects | s3.delete_objects | No | High |
delete_log_group | logs.delete_log_group | No | Medium |
Safety Constraints
Hard-coded in the Action Lambda, enforced before every action:
- Protected tags: Resources with
polo:protectedordo-not-deletetags are never touched. - Production restriction:
terminate_instance,delete_volume,delete_s3_objectsblocked formarqo_env='prod'. - Minimum age: Instances must be 24h+ old, snapshots 7d+ old (prevents race with provisioning).
- Batch limits: Max 50 snapshots or 1000 S3 objects per action.
Execution Architecture
The Cloudflare Worker does not have AWS write credentials. It invokes the Action Lambda:
React SPA → Worker (POST /api/actions/preview) → Lambda (DryRun=true) → AWS → preview result
React SPA → Worker (POST /api/actions/execute) → Lambda (real) → AWS → action_log updated
The Action Lambda has its own IAM role (polo-action-role) scoped to exactly the 11 supported actions.
Savings Measurement
The counterfactual problem
Deleting a test instance that would have been torn down tomorrow saves 1 day, not infinity.
Savings basis categories
| Basis | Attribution window | Example |
|---|---|---|
orphaned_resource | 30 days (conservative — probably would have lingered) | Unattached EBS volume |
rightsizing | Measured continuously from actual post-resize cost data | m5.2xlarge → m5.large |
scheduled_teardown | Time acceleration only (days saved vs expected teardown) | Stopping test 1 day early |
policy_violation | Duration of violation + 30 days | SageMaker running 3 weeks vs 8-hour rule |
stale_data | 30 days (these genuinely would persist indefinitely) | Old snapshots, logs |
Calculation
- Fixed-window:
total_savings = cost_before_daily × savings_window_days - Rightsizing: Daily job accumulates
(cost_before - actual_cost)until resource is terminated or resized again
Suggestion Engine
Detectors run daily at 08:00 UTC, query resource_snapshots, and write rows to action_suggestions with status='pending'. Resources tagged polo:protected or do-not-delete are excluded from all terminate/delete detectors.
| Detector | Module | What it finds | Suggested action | Severity |
|---|---|---|---|---|
unattached_volumes | detectors/unattached_volumes.py | EBS volume in available state for 7+ days | delete_volume | high |
unassociated_eips | detectors/unassociated_eips.py | EIP not associated with any network interface | release_eip | low |
idle_nat_gateways | detectors/idle_nat_gateways.py | NAT gateway with zero bytes processed for 7+ days | delete_nat_gateway | high |
running_notebooks | detectors/running_notebooks.py | SageMaker notebook in InService state | stop_sagemaker_notebook | medium |
stopped_instances | detectors/stopped_instances.py | EC2 instance in stopped state for 7+ days | terminate_instance | medium |
gp2_volumes | detectors/gp2_volumes.py | EBS gp2 volume (gp3 is always cheaper) | modify_volume | low |
no_role_instances | detectors/no_role_instances.py | Running EC2 instance with no Polo role classification (migration 018 column, not IAM) — indicates untagged or orphaned resources | terminate_instance | high |
orphaned_instances | detectors/orphaned_instances.py | Running instance whose marqo_index no longer exists in hierarchy_nodes (HA suffix variants stripped before comparison) | terminate_instance | high |
persistent_dev_indexes | detectors/persistent_dev_indexes.py | Instance with audience='development' running for 3+ hours and role != 'monitoring' — dev instances should be ephemeral | terminate_instance | medium |
Stale suggestions (pending for 7+ days) are expired to status='expired' at the start of each run before new ones are inserted. Duplicates (same resource_arn already pending) are skipped.
EC2 rightsizing detectors are deprioritised while the infra team reshuffles architecture.
UI Flow
Suggestions are presented ranked by estimated daily saving. Each card shows: resource, reason, evidence, estimated saving, and Preview / Dismiss / Go buttons. The Savings Dashboard shows cumulative savings over time, by category, and by person.
Key Tables
action_types— Registry of supported actions (seeded by migration)action_suggestions— Pending suggestions with status trackingaction_log— Full audit trail with before/after state and measured savings