Skip to content

CloudWatch (Logs, Metrics, Alarms)

Logs

# Tail logs for a Lambda
aws logs tail /aws/lambda/staging-EcomIndexerFunction --follow

# Search logs for errors in last hour
aws logs filter-log-events \
  --log-group-name /aws/lambda/staging-EcomIndexerFunction \
  --start-time $(date -v-1H +%s000) \
  --filter-pattern "?ERROR ?Traceback ?Exception"

# Monolith logs (ECS Fargate)
aws logs filter-log-events \
  --log-group-name staging-monolith-logs \
  --start-time $(date -v-15M +%s000) \
  --filter-pattern "ERROR"

# Exclude health checks from monolith logs
aws logs filter-log-events \
  --log-group-name staging-monolith-logs \
  --start-time $(date -v-15M +%s000) \
  --filter-pattern "-\"GET /openapi.json 200\""

# List log groups
aws logs describe-log-groups --query 'logGroups[].[logGroupName,storedBytes]' --output table

Alarms

# List all alarms
aws cloudwatch describe-alarms --query 'MetricAlarms[].[AlarmName,StateValue,MetricName]' --output table

# List alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM --output table

# Get alarm history
aws cloudwatch describe-alarm-history --alarm-name "staging-EcomMetricsWorkerDLQAlarm" --max-items 10

Key Alarms

Alarm Trigger Severity
{env}-EcomMetricsWorkerDLQAlarm Messages in metrics DLQ Sev2 (Slack + PagerDuty)
{env}-EcomMonitoringServiceAlarm Monitoring Lambda errors Sev2.5
{env}-EcomPartialDocumentsDetectedGlobalAlarm Partial docs in indexer Sev2
{env}-Agentic5xxRpsAlarm Agentic 5XX rate > 2/s Sev2
MerchandisingExporterErrorAlarm-{env} Merch exporter errors Sev2
MerchandisingExporterHeartbeatAlarm-{env} No merch exporter invocations Sev2

Dashboards

# List dashboards
aws cloudwatch list-dashboards --query 'DashboardEntries[].[DashboardName]' --output table

Key dashboards: {env}-EcomDashboard, CloudControllerDashboard-{env}, MerchandisingExporterDashboard-{env}.

SNS Notification Topics

Topic Purpose
CloudwatchAlarmNotifySlack Slack alerts
CloudwatchAlarmNotifyPagerduty PagerDuty Sev2
CloudwatchAlarmNotifyPagerdutySev2_5 PagerDuty Sev2.5

What to Look For

Symptom Check
Alert firing aws cloudwatch describe-alarms --state-value ALARM
Missing logs Verify log group exists, check Lambda execution role has logs permissions
High error rate Filter log group with ERROR pattern
Latency issues Check dashboard widgets for p99 latency