Runbooks (Cross-Repo)
Operational runbooks, diagnostics, incident response and maintenance procedures aggregated across every repo.
119 pages across 3 repos.
Cloud Control Plane
- 2026 06 19 Envato Audio 502 Blips
- 2026 06 19 Kark P90
- 2026 06 19 Kogan Latency Blips
- 2026-04-14 - Personalized Search Outage
- [INTERNAL ONLY] Investigation doc - 5xx spike affecting envato due to kserve VRAM OOM
- Admin Platform
- Admin Worker Failed Requests Exceed 1 in 5m
- admin_ui blank dashboard after Vite 8 upgrade (2026-06-19)
- Agentic Search (Cloudflare Worker)
- API Gateway
- Check For-You is working
- Cloudflare Workers
- CloudWatch (Logs, Metrics, Alarms)
- Cognito
- Configure agentic search and converse for an index
- Configure pagination stability for an index
- Control Plane (Console + Monolith)
- Controller
- Controller API 4XX Rate Anomalous vs Baseline
- Controller API 5XX Rate Exceeds 5% in 5m
- Convert classic index to an ecommerce index
- Convert classic index to ecom
- Create a new ecommerce account
- Create analytics account if enabled
- Decommission a fork source index
- Dump index contents to S3 (vespa-visit)
- DynamoDB
- DynamoDB Access-Patterns Cheat Sheet
- Ecom Agentic Converse 5XX RPS
- Ecom Agentic Search 5XX RPS
- Ecom API 4XX Rate Alerts
- Ecom API 5XX Errors
- Ecom API Success Rate Below 98%
- Ecom Index Settings Exporter Lambda Has Errors
- Ecom Internal Indexer Job Success Rate Below 99%
- Ecom Metrics Worker Lambda Has Errors
- Ecom Queue Depth / Queue Backlog
- Ecommerce Platform
- Ecommerce Platform Runbook
- ECS (Fargate)
- Edit ecommerce index settings
- Elastic Beanstalk
- Enable ETL pipeline to push automated updates to an index
- Enable merchandising for an index
- Enable relevance trigger for an index
- EventBridge Webhook Migration Runbook
- FashionNova Queue Age Spike (2026-02-17)
- Flow: Add Documents from Shopify (Product Indexing)
- Flow: Add/Update/Delete Documents (Product Indexing)
- Flow: Agentic Search (AI-Powered Conversational Search)
- Flow: Login & Authentication
- Flow: Recommendations
- Flow: Search Query
- Flow: Settings Sync (DynamoDB → Cloudflare KV)
- Flow: User Signup
- Forking writes between indexes
- Inspecting Live Resources
- Lambda
- Laura Geller — Promo Messages Verification Guide
- Laura Geller — Promo UTM Test Links
- Laura Geller: Unreachable Documents (2026-05-20)
- Load agentic cached queries for an index
- Marqo settings not found for shop {shop_id}
- Merchandising Diagnostics
- Merchandising Exporter Lambda Has Errors
- Operations
- Post Incident Review template
- prod-EcomMonitoringServiceAlarm
- Reconcile Missouri Quilt Pixel IDs (ccp-3two.2)
- Reconnect pixel ETL pipeline to an existing customer’s new index
- Redrive Ecom Indexing Pipeline
- Reindexing
- Reindexing Job Made No Progress in 15m
- Reindexing Pipeline Replay Lambda Has Errors
- Runbooks
- S3
- Scale ecom indexing throughput
- Search Proxy (Cloudflare Worker)
- Secrets Manager
- Set account to use custom Marqtune model
- Set up For-You recommendations
- Setting up exact match boosters (for example, boost on title field)
- Shopify Diagnostics
- SQS
- WAF (Web Application Firewall)
Cloud Data Plane
- Runbook: alb_controller_unhealthy_pods
- Runbook: auth_keys_refresh_failure
- Runbook: autoscaling_controller_exceptions
- Runbook: autoscaling_controller_unhealthy
- Runbook: cf_global_worker_routes_threshold
- Runbook: cf_metrics_collector_dlq
- Runbook: cf_origin_p95_latency
- Runbook: cloudflare_429_rate
- Runbook: coredns_error_rate
- Runbook: coredns_unhealthy_pods
- Runbook: edge_unreachability
- Runbook: external_dns_unhealthy_pods
- Runbook: fluent_bit_unhealthy_pods
- Runbook: grpc_metrics_gateway_alerts
- Runbook: grpc_metrics_gateway_workers_vpc_setup
- Runbook: index_unreachability
- Runbook: keda_operator_unhealthy
- Runbook: keda_scaled_object_errors
- Runbook: keda_scaler_errors
- Runbook: kserve_inference_error_rate
- Runbook: kserve_node_disk_util
- Runbook: kserve_scaling_capacity
- Runbook: log_investigation_with_claude
- Runbook: marqo_api_env_var_hotfix
- Runbook: Migrate EKS Control Plane Log Group to CloudWatch Logs Infrequent Access
- Runbook: prometheus_autoscaling_unhealthy
- Runbook: prometheus_server_unhealthy
- Runbook: reverse_proxy_5xx