Skip to main content

CloudWatch (Logs, Metrics, Alarms)

Logs

# Tail logs for a Lambda
aws logs tail /aws/lambda/staging-EcomIndexerFunction --follow

# Search logs for errors in last hour
aws logs filter-log-events \
--log-group-name /aws/lambda/staging-EcomIndexerFunction \
--start-time $(date -v-1H +%s000) \
--filter-pattern "?ERROR ?Traceback ?Exception"

# Monolith logs (ECS Fargate)
aws logs filter-log-events \
--log-group-name staging-monolith-logs \
--start-time $(date -v-15M +%s000) \
--filter-pattern "ERROR"

# Exclude health checks from monolith logs
aws logs filter-log-events \
--log-group-name staging-monolith-logs \
--start-time $(date -v-15M +%s000) \
--filter-pattern "-\"GET /openapi.json 200\""

# List log groups
aws logs describe-log-groups --query 'logGroups[].[logGroupName,storedBytes]' --output table

Alarms

# List all alarms
aws cloudwatch describe-alarms --query 'MetricAlarms[].[AlarmName,StateValue,MetricName]' --output table

# List alarms in ALARM state
aws cloudwatch describe-alarms --state-value ALARM --output table

# Get alarm history
aws cloudwatch describe-alarm-history --alarm-name "staging-EcomMetricsWorkerDLQAlarm" --max-items 10

Key Alarms

AlarmTriggerSeverity
{env}-EcomMetricsWorkerDLQAlarmMessages in metrics DLQSev2 (Slack + PagerDuty)
{env}-EcomMonitoringServiceAlarmMonitoring Lambda errorsSev2.5
{env}-EcomPartialDocumentsDetectedGlobalAlarmPartial docs in indexerSev2
{env}-Agentic5xxRpsAlarmAgentic 5XX rate > 2/sSev2
MerchandisingExporterErrorAlarm-{env}Merch exporter errorsSev2
MerchandisingExporterHeartbeatAlarm-{env}No merch exporter invocationsSev2

Dashboards

# List dashboards
aws cloudwatch list-dashboards --query 'DashboardEntries[].[DashboardName]' --output table

Key dashboards: {env}-EcomDashboard, CloudControllerDashboard-{env}, MerchandisingExporterDashboard-{env}.

SNS Notification Topics

TopicPurpose
CloudwatchAlarmNotifySlackSlack alerts
CloudwatchAlarmNotifyPagerdutyPagerDuty Sev2
CloudwatchAlarmNotifyPagerdutySev2_5PagerDuty Sev2.5

What to Look For

SymptomCheck
Alert firingaws cloudwatch describe-alarms --state-value ALARM
Missing logsVerify log group exists, check Lambda execution role has logs permissions
High error rateFilter log group with ERROR pattern
Latency issuesCheck dashboard widgets for p99 latency