AWS CloudWatch: The Complete Monitoring Guide for DevOps Engineers (2026)

AWS CloudWatch is the central monitoring service for everything running on AWS. This guide covers metrics, logs, alarms, dashboards, Container Insights, and production best practices.

If you're running workloads on AWS, CloudWatch is unavoidable. It's where EC2 metrics live, where Lambda logs go, where ECS and EKS surface container health, and where you set the alarms that wake people up at 3am.

But CloudWatch is a sprawling service with dozens of features, and most engineers only scratch the surface. This guide covers the parts that actually matter for production DevOps.

What Is AWS CloudWatch?

CloudWatch is AWS's native observability platform. It covers:

Metrics: Numeric time-series data from AWS services (CPU, memory, request counts, error rates)
Logs: Log ingestion, storage, and querying (CloudWatch Logs + Logs Insights)
Alarms: Threshold-based alerting on any metric
Dashboards: Real-time and historical visualization
Events/EventBridge: React to AWS events (instance state change, deployment completion, etc.)
Synthetics: Canary monitoring for APIs and websites
Container Insights: Deep container metrics for ECS and EKS
Application Insights: Automated anomaly detection for applications

Part 1: Metrics

Built-in metrics (free)

Every AWS service automatically emits metrics to CloudWatch. Key ones:

EC2:

CPUUtilization — CPU % (1-minute granularity)
NetworkIn / NetworkOut — bytes transferred
StatusCheckFailed — 0 or 1 (hardware/software failure)

RDS:

CPUUtilization, DatabaseConnections, FreeStorageSpace
ReadLatency, WriteLatency — critical for performance
ReplicaLag — for read replicas

ELB/ALB:

RequestCount, HTTPCode_Target_5XX_Count
TargetResponseTime — latency
UnhealthyHostCount — are your targets healthy?

SQS:

ApproximateNumberOfMessagesVisible — queue depth
NumberOfMessagesSent, NumberOfMessagesDeleted
ApproximateAgeOfOldestMessage — how stale is the oldest message?

Custom metrics

Publish your own metrics from any application or script:

bash

# Publish a custom metric via AWS CLI
aws cloudwatch put-metric-data \
  --namespace "MyApp/API" \
  --metric-name "OrdersProcessed" \
  --value 42 \
  --unit Count \
  --dimensions Environment=production,Service=order-processor

From Python (boto3):

python

import boto3
 
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
 
cloudwatch.put_metric_data(
    Namespace='MyApp/API',
    MetricData=[
        {
            'MetricName': 'OrdersProcessed',
            'Value': 42,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'production'},
                {'Name': 'Service', 'Value': 'order-processor'},
            ]
        },
    ]
)

From Go:

package main
 
import (
    "github.com/aws/aws-sdk-go-v2/aws"
    "github.com/aws/aws-sdk-go-v2/service/cloudwatch"
    "github.com/aws/aws-sdk-go-v2/service/cloudwatch/types"
)
 
func publishMetric(client *cloudwatch.Client, value float64) {
    client.PutMetricData(context.Background(), &cloudwatch.PutMetricDataInput{
        Namespace: aws.String("MyApp/API"),
        MetricData: []types.MetricDatum{
            {
                MetricName: aws.String("OrdersProcessed"),
                Value:      aws.Float64(value),
                Unit:       types.StandardUnitCount,
            },
        },
    })
}

Metric Math

Combine metrics with math expressions:

# Error rate = errors / total requests
METRICS(["errors", "requests"])
error_rate = errors / requests * 100

In the CloudWatch console: Metrics → Select metrics → Graphed metrics → Add math.

Part 2: CloudWatch Logs

Log Groups and Log Streams

Log Group: Container for logs from one source (e.g., /aws/lambda/my-function)
Log Stream: Individual log stream within a group (e.g., one Lambda instance)

Send EC2 application logs to CloudWatch

Install the CloudWatch Agent on EC2:

bash

# Download and install
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

Create the agent config:

json

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/ec2/nginx/access",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/ec2/nginx/error",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/myapp/app.log",
            "log_group_name": "/ec2/myapp",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

bash

# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
  -s

Querying logs with CloudWatch Logs Insights

Logs Insights is CloudWatch's query language for log analysis. Far more powerful than grep.

Find 5xx errors:

fields @timestamp, @message
| filter @message like /HTTP 5\d\d/
| stats count() as errorCount by bin(5m)
| sort @timestamp desc

Find slow API endpoints:

fields @timestamp, requestPath, duration
| filter duration > 1000    # requests taking more than 1 second
| stats avg(duration) as avgDuration, count() as requestCount by requestPath
| sort avgDuration desc
| limit 20

Count errors by type:

fields @timestamp, errorType, @message
| filter level = "ERROR"
| stats count() as errorCount by errorType
| sort errorCount desc

Lambda cold start analysis:

filter @type = "REPORT"
| stats avg(@initDuration) as avgColdStart, count(@initDuration) as coldStartCount
| filter ispresent(@initDuration)

Top IPs by request count:

fields @timestamp, clientIP
| stats count() as requests by clientIP
| sort requests desc
| limit 10

Set log retention (save money)

By default, CloudWatch Logs never expire. Set retention to control costs:

bash

# Set 30-day retention on a log group
aws logs put-retention-policy \
  --log-group-name "/aws/lambda/my-function" \
  --retention-in-days 30
 
# For all log groups via script
aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text | \
  tr '\t' '\n' | while read lg; do
    aws logs put-retention-policy \
      --log-group-name "$lg" \
      --retention-in-days 90
done

Part 3: Alarms

Create an alarm for high CPU

bash

aws cloudwatch put-metric-alarm \
  --alarm-name "High-CPU-Production" \
  --alarm-description "CPU above 80% for 5 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:production-alerts" \
  --ok-actions "arn:aws:sns:us-east-1:123456789:production-alerts"

Alarm on metric math (composite)

bash

# Alarm when error rate > 1%
aws cloudwatch put-metric-alarm \
  --alarm-name "High-Error-Rate" \
  --metrics '[
    {
      "Id": "errors",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
        },
        "Period": 60,
        "Stat": "Sum"
      }
    },
    {
      "Id": "requests",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "RequestCount",
          "Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
        },
        "Period": 60,
        "Stat": "Sum"
      }
    },
    {
      "Id": "errorRate",
      "Expression": "errors/requests*100",
      "Label": "ErrorRate"
    }
  ]' \
  --comparison-operator GreaterThanThreshold \
  --threshold 1 \
  --evaluation-periods 3 \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:production-alerts"

Create a Lambda that forwards SNS alerts to Slack:

python

import json
import urllib.request
 
def handler(event, context):
    sns_message = json.loads(event['Records'][0]['Sns']['Message'])
 
    alarm_name = sns_message['AlarmName']
    new_state = sns_message['NewStateValue']
    reason = sns_message['NewStateReason']
 
    color = "#ff0000" if new_state == "ALARM" else "#36a64f"
    emoji = "🔴" if new_state == "ALARM" else "✅"
 
    payload = {
        "attachments": [{
            "color": color,
            "title": f"{emoji} CloudWatch Alarm: {alarm_name}",
            "text": reason,
            "fields": [
                {"title": "State", "value": new_state, "short": True},
                {"title": "Account", "value": sns_message.get('AWSAccountId', 'N/A'), "short": True}
            ]
        }]
    }
 
    webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    data = json.dumps(payload).encode('utf-8')
    req = urllib.request.Request(webhook_url, data=data, headers={'Content-Type': 'application/json'})
    urllib.request.urlopen(req)

Part 4: Container Insights for EKS

Container Insights gives you deep visibility into EKS pods, nodes, and containers.

Enable Container Insights on EKS

bash

# Install CloudWatch Agent as DaemonSet
ClusterName=my-cluster
RegionName=us-east-1
FluentBitHttpPort=2020
FluentBitReadFromHead=Off
 
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
  sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/On/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/;s/{{read_from_tail}}/Off/" | \
  kubectl apply -f -

After deployment, you get metrics in CloudWatch under namespaces:

ContainerInsights — cluster, node, pod, container metrics
Container logs in log groups: /aws/containerinsights/<cluster>/application

Key Container Insights metrics

bash

# Pod CPU usage
aws cloudwatch get-metric-statistics \
  --namespace ContainerInsights \
  --metric-name pod_cpu_utilization \
  --dimensions Name=ClusterName,Value=my-cluster Name=Namespace,Value=production \
  --start-time 2026-03-16T00:00:00Z \
  --end-time 2026-03-16T23:59:59Z \
  --period 300 \
  --statistics Average

Part 5: CloudWatch Dashboards

Create a production dashboard as code:

json

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "API Error Rate",
        "metrics": [
          ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count",
           "LoadBalancer", "app/my-alb/abc123",
           {"stat": "Sum", "period": 60, "label": "5xx Errors"}],
          [".", "RequestCount", ".", ".",
           {"stat": "Sum", "period": 60, "label": "Total Requests", "yAxis": "right"}]
        ],
        "period": 60,
        "view": "timeSeries"
      }
    },
    {
      "type": "alarm",
      "properties": {
        "title": "Active Alarms",
        "alarms": [
          "arn:aws:cloudwatch:us-east-1:123456789:alarm:High-CPU-Production",
          "arn:aws:cloudwatch:us-east-1:123456789:alarm:High-Error-Rate"
        ]
      }
    },
    {
      "type": "log",
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/lambda/my-function' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
        "region": "us-east-1",
        "view": "table"
      }
    }
  ]
}

Apply via Terraform:

hcl

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "production-overview"
  dashboard_body = file("dashboards/production.json")
}

Cost Optimization

CloudWatch costs can add up. Key savings:

Item	Default	Cost Control
Log storage	Never expires	Set retention (30–90 days)
Custom metrics	$0.30/metric/month	Only publish metrics you actually use
Detailed monitoring	Disabled	Enable only for critical EC2 instances
Log queries	$0.005/GB scanned	Be specific in Insights queries

bash

# Find log groups without retention policies (cost risk)
aws logs describe-log-groups \
  --query 'logGroups[?!retentionInDays].[logGroupName,storedBytes]' \
  --output table

Learn More

Want hands-on AWS monitoring labs with real CloudWatch, Prometheus, and Grafana setups? KodeKloud's AWS and DevOps courses give you real cloud environments with guided exercises.

If you're looking for a cost-effective cloud to run practice environments, DigitalOcean offers $200 in free credits for new accounts — enough to run real infrastructure for months.

Summary

CloudWatch is your primary observability tool if you're on AWS:

Metrics — built-in from every service + custom metrics via PutMetricData
Logs Insights — powerful query language for log analysis
Alarms — threshold and anomaly detection with SNS notifications
Container Insights — deep ECS and EKS visibility
Dashboards — real-time operational views as code

Start with the basics (alarms on CPU, error rate, and queue depth), add log retention policies to control costs, and build dashboards that give your team the signal they need during incidents.

AWS CloudWatch: The Complete Monitoring Guide for DevOps Engineers (2026)

What Is AWS CloudWatch?

Part 1: Metrics

Built-in metrics (free)

Custom metrics

Metric Math

Part 2: CloudWatch Logs

Log Groups and Log Streams

Send EC2 application logs to CloudWatch

Querying logs with CloudWatch Logs Insights

Set log retention (save money)

Part 3: Alarms

Create an alarm for high CPU

Alarm on metric math (composite)

Part 4: Container Insights for EKS

Enable Container Insights on EKS

Key Container Insights metrics

Part 5: CloudWatch Dashboards

Cost Optimization

Learn More

Summary

Stay ahead of the curve

Related Articles

Loki vs CloudWatch Logs vs Datadog Logs: Log Management in 2026

Build an AI-Powered SLO Budget Tracker with Python + Claude (2026)

Datadog vs Grafana Cloud: Which Monitoring Platform Should You Actually Use in 2026?

Comments