AWS CloudWatch: The Complete Monitoring Guide for DevOps Engineers (2026)
AWS CloudWatch is the central monitoring service for everything running on AWS. This guide covers metrics, logs, alarms, dashboards, Container Insights, and production best practices.
If you're running workloads on AWS, CloudWatch is unavoidable. It's where EC2 metrics live, where Lambda logs go, where ECS and EKS surface container health, and where you set the alarms that wake people up at 3am.
But CloudWatch is a sprawling service with dozens of features, and most engineers only scratch the surface. This guide covers the parts that actually matter for production DevOps.
What Is AWS CloudWatch?
CloudWatch is AWS's native observability platform. It covers:
- Metrics: Numeric time-series data from AWS services (CPU, memory, request counts, error rates)
- Logs: Log ingestion, storage, and querying (CloudWatch Logs + Logs Insights)
- Alarms: Threshold-based alerting on any metric
- Dashboards: Real-time and historical visualization
- Events/EventBridge: React to AWS events (instance state change, deployment completion, etc.)
- Synthetics: Canary monitoring for APIs and websites
- Container Insights: Deep container metrics for ECS and EKS
- Application Insights: Automated anomaly detection for applications
Part 1: Metrics
Built-in metrics (free)
Every AWS service automatically emits metrics to CloudWatch. Key ones:
EC2:
CPUUtilization— CPU % (1-minute granularity)NetworkIn / NetworkOut— bytes transferredStatusCheckFailed— 0 or 1 (hardware/software failure)
RDS:
CPUUtilization,DatabaseConnections,FreeStorageSpaceReadLatency,WriteLatency— critical for performanceReplicaLag— for read replicas
ELB/ALB:
RequestCount,HTTPCode_Target_5XX_CountTargetResponseTime— latencyUnhealthyHostCount— are your targets healthy?
SQS:
ApproximateNumberOfMessagesVisible— queue depthNumberOfMessagesSent,NumberOfMessagesDeletedApproximateAgeOfOldestMessage— how stale is the oldest message?
Custom metrics
Publish your own metrics from any application or script:
# Publish a custom metric via AWS CLI
aws cloudwatch put-metric-data \
--namespace "MyApp/API" \
--metric-name "OrdersProcessed" \
--value 42 \
--unit Count \
--dimensions Environment=production,Service=order-processorFrom Python (boto3):
import boto3
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
cloudwatch.put_metric_data(
Namespace='MyApp/API',
MetricData=[
{
'MetricName': 'OrdersProcessed',
'Value': 42,
'Unit': 'Count',
'Dimensions': [
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Service', 'Value': 'order-processor'},
]
},
]
)From Go:
package main
import (
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/service/cloudwatch"
"github.com/aws/aws-sdk-go-v2/service/cloudwatch/types"
)
func publishMetric(client *cloudwatch.Client, value float64) {
client.PutMetricData(context.Background(), &cloudwatch.PutMetricDataInput{
Namespace: aws.String("MyApp/API"),
MetricData: []types.MetricDatum{
{
MetricName: aws.String("OrdersProcessed"),
Value: aws.Float64(value),
Unit: types.StandardUnitCount,
},
},
})
}Metric Math
Combine metrics with math expressions:
# Error rate = errors / total requests
METRICS(["errors", "requests"])
error_rate = errors / requests * 100
In the CloudWatch console: Metrics → Select metrics → Graphed metrics → Add math.
Part 2: CloudWatch Logs
Log Groups and Log Streams
- Log Group: Container for logs from one source (e.g.,
/aws/lambda/my-function) - Log Stream: Individual log stream within a group (e.g., one Lambda instance)
Send EC2 application logs to CloudWatch
Install the CloudWatch Agent on EC2:
# Download and install
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.debCreate the agent config:
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/ec2/nginx/access",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
},
{
"file_path": "/var/log/nginx/error.log",
"log_group_name": "/ec2/nginx/error",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/myapp/app.log",
"log_group_name": "/ec2/myapp",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
-sQuerying logs with CloudWatch Logs Insights
Logs Insights is CloudWatch's query language for log analysis. Far more powerful than grep.
Find 5xx errors:
fields @timestamp, @message
| filter @message like /HTTP 5\d\d/
| stats count() as errorCount by bin(5m)
| sort @timestamp desc
Find slow API endpoints:
fields @timestamp, requestPath, duration
| filter duration > 1000 # requests taking more than 1 second
| stats avg(duration) as avgDuration, count() as requestCount by requestPath
| sort avgDuration desc
| limit 20
Count errors by type:
fields @timestamp, errorType, @message
| filter level = "ERROR"
| stats count() as errorCount by errorType
| sort errorCount desc
Lambda cold start analysis:
filter @type = "REPORT"
| stats avg(@initDuration) as avgColdStart, count(@initDuration) as coldStartCount
| filter ispresent(@initDuration)
Top IPs by request count:
fields @timestamp, clientIP
| stats count() as requests by clientIP
| sort requests desc
| limit 10
Set log retention (save money)
By default, CloudWatch Logs never expire. Set retention to control costs:
# Set 30-day retention on a log group
aws logs put-retention-policy \
--log-group-name "/aws/lambda/my-function" \
--retention-in-days 30
# For all log groups via script
aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text | \
tr '\t' '\n' | while read lg; do
aws logs put-retention-policy \
--log-group-name "$lg" \
--retention-in-days 90
donePart 3: Alarms
Create an alarm for high CPU
aws cloudwatch put-metric-alarm \
--alarm-name "High-CPU-Production" \
--alarm-description "CPU above 80% for 5 minutes" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--alarm-actions "arn:aws:sns:us-east-1:123456789:production-alerts" \
--ok-actions "arn:aws:sns:us-east-1:123456789:production-alerts"Alarm on metric math (composite)
# Alarm when error rate > 1%
aws cloudwatch put-metric-alarm \
--alarm-name "High-Error-Rate" \
--metrics '[
{
"Id": "errors",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "HTTPCode_Target_5XX_Count",
"Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
},
"Period": 60,
"Stat": "Sum"
}
},
{
"Id": "requests",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "RequestCount",
"Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
},
"Period": 60,
"Stat": "Sum"
}
},
{
"Id": "errorRate",
"Expression": "errors/requests*100",
"Label": "ErrorRate"
}
]' \
--comparison-operator GreaterThanThreshold \
--threshold 1 \
--evaluation-periods 3 \
--alarm-actions "arn:aws:sns:us-east-1:123456789:production-alerts"SNS integration for Slack alerts
Create a Lambda that forwards SNS alerts to Slack:
import json
import urllib.request
def handler(event, context):
sns_message = json.loads(event['Records'][0]['Sns']['Message'])
alarm_name = sns_message['AlarmName']
new_state = sns_message['NewStateValue']
reason = sns_message['NewStateReason']
color = "#ff0000" if new_state == "ALARM" else "#36a64f"
emoji = "🔴" if new_state == "ALARM" else "✅"
payload = {
"attachments": [{
"color": color,
"title": f"{emoji} CloudWatch Alarm: {alarm_name}",
"text": reason,
"fields": [
{"title": "State", "value": new_state, "short": True},
{"title": "Account", "value": sns_message.get('AWSAccountId', 'N/A'), "short": True}
]
}]
}
webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
data = json.dumps(payload).encode('utf-8')
req = urllib.request.Request(webhook_url, data=data, headers={'Content-Type': 'application/json'})
urllib.request.urlopen(req)Part 4: Container Insights for EKS
Container Insights gives you deep visibility into EKS pods, nodes, and containers.
Enable Container Insights on EKS
# Install CloudWatch Agent as DaemonSet
ClusterName=my-cluster
RegionName=us-east-1
FluentBitHttpPort=2020
FluentBitReadFromHead=Off
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/On/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/;s/{{read_from_tail}}/Off/" | \
kubectl apply -f -After deployment, you get metrics in CloudWatch under namespaces:
ContainerInsights— cluster, node, pod, container metrics- Container logs in log groups:
/aws/containerinsights/<cluster>/application
Key Container Insights metrics
# Pod CPU usage
aws cloudwatch get-metric-statistics \
--namespace ContainerInsights \
--metric-name pod_cpu_utilization \
--dimensions Name=ClusterName,Value=my-cluster Name=Namespace,Value=production \
--start-time 2026-03-16T00:00:00Z \
--end-time 2026-03-16T23:59:59Z \
--period 300 \
--statistics AveragePart 5: CloudWatch Dashboards
Create a production dashboard as code:
{
"widgets": [
{
"type": "metric",
"properties": {
"title": "API Error Rate",
"metrics": [
["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count",
"LoadBalancer", "app/my-alb/abc123",
{"stat": "Sum", "period": 60, "label": "5xx Errors"}],
[".", "RequestCount", ".", ".",
{"stat": "Sum", "period": 60, "label": "Total Requests", "yAxis": "right"}]
],
"period": 60,
"view": "timeSeries"
}
},
{
"type": "alarm",
"properties": {
"title": "Active Alarms",
"alarms": [
"arn:aws:cloudwatch:us-east-1:123456789:alarm:High-CPU-Production",
"arn:aws:cloudwatch:us-east-1:123456789:alarm:High-Error-Rate"
]
}
},
{
"type": "log",
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/lambda/my-function' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
"region": "us-east-1",
"view": "table"
}
}
]
}Apply via Terraform:
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "production-overview"
dashboard_body = file("dashboards/production.json")
}Cost Optimization
CloudWatch costs can add up. Key savings:
| Item | Default | Cost Control |
|---|---|---|
| Log storage | Never expires | Set retention (30–90 days) |
| Custom metrics | $0.30/metric/month | Only publish metrics you actually use |
| Detailed monitoring | Disabled | Enable only for critical EC2 instances |
| Log queries | $0.005/GB scanned | Be specific in Insights queries |
# Find log groups without retention policies (cost risk)
aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].[logGroupName,storedBytes]' \
--output tableLearn More
Want hands-on AWS monitoring labs with real CloudWatch, Prometheus, and Grafana setups? KodeKloud's AWS and DevOps courses give you real cloud environments with guided exercises.
If you're looking for a cost-effective cloud to run practice environments, DigitalOcean offers $200 in free credits for new accounts — enough to run real infrastructure for months.
Summary
CloudWatch is your primary observability tool if you're on AWS:
- Metrics — built-in from every service + custom metrics via PutMetricData
- Logs Insights — powerful query language for log analysis
- Alarms — threshold and anomaly detection with SNS notifications
- Container Insights — deep ECS and EKS visibility
- Dashboards — real-time operational views as code
Start with the basics (alarms on CPU, error rate, and queue depth), add log retention policies to control costs, and build dashboards that give your team the signal they need during incidents.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
OpenTelemetry Complete Guide 2026: The New Standard for Observability
OpenTelemetry is becoming the default observability standard, replacing vendor-specific agents. This guide covers what it is, how traces/metrics/logs work, and how to instrument a Node.js app end-to-end.
Prometheus vs Datadog vs New Relic: Which Monitoring Tool Should You Use in 2026?
A real comparison of the three most popular monitoring tools — what they're actually good at, where they fall short, and which one fits your team's situation.
Why Agentic AI Will Kill the Traditional On-Call Rotation by 2028
60% of enterprises now use AIOps self-healing. 83% of alerts auto-resolve without humans. The era of 2 AM PagerDuty wake-ups is ending. Here's what replaces it.