CloudWatch (Logs, Metrics, Alarms)
Logs
# Tail logs for a Lambda
aws logs tail /aws/lambda/staging-EcomIndexerFunction --follow
# Search logs for errors in last hour
aws logs filter-log-events \
--log-group-name /aws/lambda/staging-EcomIndexerFunction \
--start-time $(date -v-1H +%s000) \
--filter-pattern "?ERROR ?Traceback ?Exception"
# Monolith logs (ECS Fargate)
aws logs filter-log-events \
--log-group-name staging-monolith-logs \
--start-time $(date -v-15M +%s000) \
--filter-pattern "ERROR"
# Exclude health checks from monolith logs
aws logs filter-log-events \
--log-group-name staging-monolith-logs \
--start-time $(date -v-15M +%s000) \
--filter-pattern "-\"GET /openapi.json 200\""
# List log groups
aws logs describe-log-groups --query 'logGroups[].[logGroupName,storedBytes]' --output table
Alarms
# List all alarms
aws cloudwatch describe-alarms --query 'MetricAlarms[].[AlarmName,StateValue,MetricName]' --output table
# List alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM --output table
# Get alarm history
aws cloudwatch describe-alarm-history --alarm-name "staging-EcomMetricsWorkerDLQAlarm" --max-items 10
Key Alarms
| Alarm | Trigger | Severity |
|---|---|---|
{env}-EcomMetricsWorkerDLQAlarm | Messages in metrics DLQ | Sev2 (Slack + PagerDuty) |
{env}-EcomMonitoringServiceAlarm | Monitoring Lambda errors | Sev2.5 |
{env}-EcomPartialDocumentsDetectedGlobalAlarm | Partial docs in indexer | Sev2 |
{env}-Agentic5xxRpsAlarm | Agentic 5XX rate > 2/s | Sev2 |
MerchandisingExporterErrorAlarm-{env} | Merch exporter errors | Sev2 |
MerchandisingExporterHeartbeatAlarm-{env} | No merch exporter invocations | Sev2 |
Dashboards
# List dashboards
aws cloudwatch list-dashboards --query 'DashboardEntries[].[DashboardName]' --output table
Key dashboards: {env}-EcomDashboard, CloudControllerDashboard-{env}, MerchandisingExporterDashboard-{env}.
SNS Notification Topics
| Topic | Purpose |
|---|---|
CloudwatchAlarmNotifySlack | Slack alerts |
CloudwatchAlarmNotifyPagerduty | PagerDuty Sev2 |
CloudwatchAlarmNotifyPagerdutySev2_5 | PagerDuty Sev2.5 |
What to Look For
| Symptom | Check |
|---|---|
| Alert firing | aws cloudwatch describe-alarms --state-value ALARM |
| Missing logs | Verify log group exists, check Lambda execution role has logs permissions |
| High error rate | Filter log group with ERROR pattern |
| Latency issues | Check dashboard widgets for p99 latency |