All Articles

AWS CloudWatch: The Complete Monitoring Guide for DevOps Engineers (2026)

AWS CloudWatch is the central monitoring service for everything running on AWS. This guide covers metrics, logs, alarms, dashboards, Container Insights, and production best practices.

DevOpsBoysMar 16, 20267 min read
Share:Tweet

If you're running workloads on AWS, CloudWatch is unavoidable. It's where EC2 metrics live, where Lambda logs go, where ECS and EKS surface container health, and where you set the alarms that wake people up at 3am.

But CloudWatch is a sprawling service with dozens of features, and most engineers only scratch the surface. This guide covers the parts that actually matter for production DevOps.


What Is AWS CloudWatch?

CloudWatch is AWS's native observability platform. It covers:

  • Metrics: Numeric time-series data from AWS services (CPU, memory, request counts, error rates)
  • Logs: Log ingestion, storage, and querying (CloudWatch Logs + Logs Insights)
  • Alarms: Threshold-based alerting on any metric
  • Dashboards: Real-time and historical visualization
  • Events/EventBridge: React to AWS events (instance state change, deployment completion, etc.)
  • Synthetics: Canary monitoring for APIs and websites
  • Container Insights: Deep container metrics for ECS and EKS
  • Application Insights: Automated anomaly detection for applications

Part 1: Metrics

Built-in metrics (free)

Every AWS service automatically emits metrics to CloudWatch. Key ones:

EC2:

  • CPUUtilization — CPU % (1-minute granularity)
  • NetworkIn / NetworkOut — bytes transferred
  • StatusCheckFailed — 0 or 1 (hardware/software failure)

RDS:

  • CPUUtilization, DatabaseConnections, FreeStorageSpace
  • ReadLatency, WriteLatency — critical for performance
  • ReplicaLag — for read replicas

ELB/ALB:

  • RequestCount, HTTPCode_Target_5XX_Count
  • TargetResponseTime — latency
  • UnhealthyHostCount — are your targets healthy?

SQS:

  • ApproximateNumberOfMessagesVisible — queue depth
  • NumberOfMessagesSent, NumberOfMessagesDeleted
  • ApproximateAgeOfOldestMessage — how stale is the oldest message?

Custom metrics

Publish your own metrics from any application or script:

bash
# Publish a custom metric via AWS CLI
aws cloudwatch put-metric-data \
  --namespace "MyApp/API" \
  --metric-name "OrdersProcessed" \
  --value 42 \
  --unit Count \
  --dimensions Environment=production,Service=order-processor

From Python (boto3):

python
import boto3
 
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
 
cloudwatch.put_metric_data(
    Namespace='MyApp/API',
    MetricData=[
        {
            'MetricName': 'OrdersProcessed',
            'Value': 42,
            'Unit': 'Count',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'production'},
                {'Name': 'Service', 'Value': 'order-processor'},
            ]
        },
    ]
)

From Go:

go
package main
 
import (
    "github.com/aws/aws-sdk-go-v2/aws"
    "github.com/aws/aws-sdk-go-v2/service/cloudwatch"
    "github.com/aws/aws-sdk-go-v2/service/cloudwatch/types"
)
 
func publishMetric(client *cloudwatch.Client, value float64) {
    client.PutMetricData(context.Background(), &cloudwatch.PutMetricDataInput{
        Namespace: aws.String("MyApp/API"),
        MetricData: []types.MetricDatum{
            {
                MetricName: aws.String("OrdersProcessed"),
                Value:      aws.Float64(value),
                Unit:       types.StandardUnitCount,
            },
        },
    })
}

Metric Math

Combine metrics with math expressions:

# Error rate = errors / total requests
METRICS(["errors", "requests"])
error_rate = errors / requests * 100

In the CloudWatch console: Metrics → Select metrics → Graphed metrics → Add math.


Part 2: CloudWatch Logs

Log Groups and Log Streams

  • Log Group: Container for logs from one source (e.g., /aws/lambda/my-function)
  • Log Stream: Individual log stream within a group (e.g., one Lambda instance)

Send EC2 application logs to CloudWatch

Install the CloudWatch Agent on EC2:

bash
# Download and install
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

Create the agent config:

json
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/ec2/nginx/access",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/ec2/nginx/error",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/myapp/app.log",
            "log_group_name": "/ec2/myapp",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}
bash
# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json \
  -s

Querying logs with CloudWatch Logs Insights

Logs Insights is CloudWatch's query language for log analysis. Far more powerful than grep.

Find 5xx errors:

fields @timestamp, @message
| filter @message like /HTTP 5\d\d/
| stats count() as errorCount by bin(5m)
| sort @timestamp desc

Find slow API endpoints:

fields @timestamp, requestPath, duration
| filter duration > 1000    # requests taking more than 1 second
| stats avg(duration) as avgDuration, count() as requestCount by requestPath
| sort avgDuration desc
| limit 20

Count errors by type:

fields @timestamp, errorType, @message
| filter level = "ERROR"
| stats count() as errorCount by errorType
| sort errorCount desc

Lambda cold start analysis:

filter @type = "REPORT"
| stats avg(@initDuration) as avgColdStart, count(@initDuration) as coldStartCount
| filter ispresent(@initDuration)

Top IPs by request count:

fields @timestamp, clientIP
| stats count() as requests by clientIP
| sort requests desc
| limit 10

Set log retention (save money)

By default, CloudWatch Logs never expire. Set retention to control costs:

bash
# Set 30-day retention on a log group
aws logs put-retention-policy \
  --log-group-name "/aws/lambda/my-function" \
  --retention-in-days 30
 
# For all log groups via script
aws logs describe-log-groups --query 'logGroups[*].logGroupName' --output text | \
  tr '\t' '\n' | while read lg; do
    aws logs put-retention-policy \
      --log-group-name "$lg" \
      --retention-in-days 90
done

Part 3: Alarms

Create an alarm for high CPU

bash
aws cloudwatch put-metric-alarm \
  --alarm-name "High-CPU-Production" \
  --alarm-description "CPU above 80% for 5 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:production-alerts" \
  --ok-actions "arn:aws:sns:us-east-1:123456789:production-alerts"

Alarm on metric math (composite)

bash
# Alarm when error rate > 1%
aws cloudwatch put-metric-alarm \
  --alarm-name "High-Error-Rate" \
  --metrics '[
    {
      "Id": "errors",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "HTTPCode_Target_5XX_Count",
          "Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
        },
        "Period": 60,
        "Stat": "Sum"
      }
    },
    {
      "Id": "requests",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "RequestCount",
          "Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
        },
        "Period": 60,
        "Stat": "Sum"
      }
    },
    {
      "Id": "errorRate",
      "Expression": "errors/requests*100",
      "Label": "ErrorRate"
    }
  ]' \
  --comparison-operator GreaterThanThreshold \
  --threshold 1 \
  --evaluation-periods 3 \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:production-alerts"

SNS integration for Slack alerts

Create a Lambda that forwards SNS alerts to Slack:

python
import json
import urllib.request
 
def handler(event, context):
    sns_message = json.loads(event['Records'][0]['Sns']['Message'])
 
    alarm_name = sns_message['AlarmName']
    new_state = sns_message['NewStateValue']
    reason = sns_message['NewStateReason']
 
    color = "#ff0000" if new_state == "ALARM" else "#36a64f"
    emoji = "🔴" if new_state == "ALARM" else "✅"
 
    payload = {
        "attachments": [{
            "color": color,
            "title": f"{emoji} CloudWatch Alarm: {alarm_name}",
            "text": reason,
            "fields": [
                {"title": "State", "value": new_state, "short": True},
                {"title": "Account", "value": sns_message.get('AWSAccountId', 'N/A'), "short": True}
            ]
        }]
    }
 
    webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    data = json.dumps(payload).encode('utf-8')
    req = urllib.request.Request(webhook_url, data=data, headers={'Content-Type': 'application/json'})
    urllib.request.urlopen(req)

Part 4: Container Insights for EKS

Container Insights gives you deep visibility into EKS pods, nodes, and containers.

Enable Container Insights on EKS

bash
# Install CloudWatch Agent as DaemonSet
ClusterName=my-cluster
RegionName=us-east-1
FluentBitHttpPort=2020
FluentBitReadFromHead=Off
 
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
  sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/On/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/;s/{{read_from_tail}}/Off/" | \
  kubectl apply -f -

After deployment, you get metrics in CloudWatch under namespaces:

  • ContainerInsights — cluster, node, pod, container metrics
  • Container logs in log groups: /aws/containerinsights/<cluster>/application

Key Container Insights metrics

bash
# Pod CPU usage
aws cloudwatch get-metric-statistics \
  --namespace ContainerInsights \
  --metric-name pod_cpu_utilization \
  --dimensions Name=ClusterName,Value=my-cluster Name=Namespace,Value=production \
  --start-time 2026-03-16T00:00:00Z \
  --end-time 2026-03-16T23:59:59Z \
  --period 300 \
  --statistics Average

Part 5: CloudWatch Dashboards

Create a production dashboard as code:

json
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "API Error Rate",
        "metrics": [
          ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count",
           "LoadBalancer", "app/my-alb/abc123",
           {"stat": "Sum", "period": 60, "label": "5xx Errors"}],
          [".", "RequestCount", ".", ".",
           {"stat": "Sum", "period": 60, "label": "Total Requests", "yAxis": "right"}]
        ],
        "period": 60,
        "view": "timeSeries"
      }
    },
    {
      "type": "alarm",
      "properties": {
        "title": "Active Alarms",
        "alarms": [
          "arn:aws:cloudwatch:us-east-1:123456789:alarm:High-CPU-Production",
          "arn:aws:cloudwatch:us-east-1:123456789:alarm:High-Error-Rate"
        ]
      }
    },
    {
      "type": "log",
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/lambda/my-function' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
        "region": "us-east-1",
        "view": "table"
      }
    }
  ]
}

Apply via Terraform:

hcl
resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "production-overview"
  dashboard_body = file("dashboards/production.json")
}

Cost Optimization

CloudWatch costs can add up. Key savings:

ItemDefaultCost Control
Log storageNever expiresSet retention (30–90 days)
Custom metrics$0.30/metric/monthOnly publish metrics you actually use
Detailed monitoringDisabledEnable only for critical EC2 instances
Log queries$0.005/GB scannedBe specific in Insights queries
bash
# Find log groups without retention policies (cost risk)
aws logs describe-log-groups \
  --query 'logGroups[?!retentionInDays].[logGroupName,storedBytes]' \
  --output table

Learn More

Want hands-on AWS monitoring labs with real CloudWatch, Prometheus, and Grafana setups? KodeKloud's AWS and DevOps courses give you real cloud environments with guided exercises.

If you're looking for a cost-effective cloud to run practice environments, DigitalOcean offers $200 in free credits for new accounts — enough to run real infrastructure for months.


Summary

CloudWatch is your primary observability tool if you're on AWS:

  1. Metrics — built-in from every service + custom metrics via PutMetricData
  2. Logs Insights — powerful query language for log analysis
  3. Alarms — threshold and anomaly detection with SNS notifications
  4. Container Insights — deep ECS and EKS visibility
  5. Dashboards — real-time operational views as code

Start with the basics (alarms on CPU, error rate, and queue depth), add log retention policies to control costs, and build dashboards that give your team the signal they need during incidents.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments