EC2 Monitoring - Essential Guide for AWS Users

Amazon EC2 (Elastic Compute Cloud) is the foundation of many AWS-powered services, offering scalable virtual servers in the cloud. For cloud engineers and DevOps specialists, ensuring these EC2 instances run smoothly is important. EC2 monitoring is critical for ensuring a dependable, high-performance AWS infrastructure.

This article covers the fundamentals of EC2 monitoring, from basic concepts to advanced strategies.

What is EC2 Monitoring and Why is it Crucial?

EC2 monitoring is the constant process of tracking, evaluating, and improving the performance, health, and resource use of your Amazon EC2 instances. It is a critical approach for effectively managing AWS infrastructure. Here's why it is so important:

Ensuring Reliability: Monitoring EC2 instances allows you to proactively discover and handle issues before they escalate, ensuring that your applications function smoothly and without interruption.
Optimizing Performance: By measuring important performance indicators regularly, you can fine-tune your EC2 instances and ensure they function to their full potential.
Cost Management: Monitoring enables you to optimise resource allocation, allowing you to right-size your instances and avoid wasteful over-provisioning while lowering expenses.
Improving Security: Continuous monitoring might identify unexpected patterns of activity that may indicate possible security threats, allowing you to respond swiftly.

AWS provides two levels of EC2 monitoring:

Basic Monitoring: Provides important indicators including CPU utilization at 5-minute intervals at no additional cost.
Detailed Monitoring: For an additional cost, you gain more detailed insights using data at 1-minute intervals.

Relationship Between EC2 Monitoring and Overall AWS Observability

EC2 monitoring is a core element of AWS observability, providing crucial insights into how your EC2 instances interact with other AWS services. Integrating EC2 metrics with broader observability tools like CloudWatch gives you a comprehensive view of your application's behaviour and performance, making it easier to detect and troubleshoot issues across the entire AWS environment.

AWS Shared Responsibility Model and EC2 Monitoring

In the AWS Shared Responsibility Model, monitoring is key to the user's responsibility for securing and managing their EC2 instances. While AWS ensures the security of the cloud infrastructure, the user is responsible for securing and monitoring what runs in the cloud—including EC2 instances. This means regularly monitoring security metrics, detecting suspicious activity, and enforcing compliance are all part of the user's obligation under the shared model, particularly for EC2 instance security.

Essential EC2 Metrics to Monitor

Monitoring important EC2 indicators ensures optimal performance, spots problems early, and minimizes expenses. The following are the key metrics to monitor to ensure your EC2 instances perform efficiently.

CPU Utilization

CPU usage indicates how much of the EC2 instance's allotted compute capability is being used. Consistently high CPU consumption suggests that your instance is under significant strain, requiring you to scale up or improve your application. On the other side, poor usage indicates over-provisioning, resulting in unneeded expenses.

# Example CloudWatch query to retrieve CPU utilization data for a specific EC2 instance
get_metric_statistics(
    Namespace='AWS/EC2',
    MetricName='CPUUtilization',
    Dimensions=[{'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'}],
    StartTime=datetime.utcnow() - timedelta(hours=1),
    EndTime=datetime.utcnow(),
    Period=300,
    Statistics=['Average']
)

Memory Usage

Unlike CPU, Amazon CloudWatch does not provide memory use by default. To monitor this statistic, install and set up the CloudWatch Agent on your EC2 instance. High memory utilization can degrade performance and result in application crashes, thus monitoring it is critical for ensuring your instance has enough resources.

Disk I/O

Disk I/O metrics measure the read and write operations on your instance's storage. High-disk I/O can create a bottleneck for programs that rely on frequent disk access, such as databases or data-intensive services. Monitoring disk performance allows you to uncover inefficiencies and avoid storage-related performance concerns.

Key Disk I/O Metrics:

ReadOps and WriteOps: Number of read and write operations.
ReadBytes and WriteBytes: The total amount of data read and written in bytes.
DiskReadOps and DiskWriteOps are more specific metrics of disk I/O activity.

Network Performance

Network performance metrics enable you to track the data transfer rates and packet activity of your EC2 instances. Monitoring these metrics is critical for ensuring that your instance has enough network capacity to handle both incoming and outgoing traffic properly.

Key network metrics include the volume of incoming and exiting data (in bytes).

NetworkPacketsIn and NetworkPacketsOut: Total amount of packets transmitted to and from the instance.

Latency and Throughput Monitoring

Latency measures the time it takes for a request to travel from the sender to the receiver, while throughput measures the rate at which data is successfully transmitted. Monitoring both latency and throughput is crucial for performance-critical applications like web services or databases.

Latency: High latency might indicate network congestion, suboptimal routing, or overloaded instances.
Throughput: Low throughput could suggest network bottlenecks or issues with instance configuration.

Status Checks and Their Significance

EC2 status checks are a simple method to see if your instance is functioning as intended. They assist in detecting possible issues with either the underlying AWS infrastructure or the EC2 instance itself, ensuring that you are notified of any issues that may have an impact on performance.

There are two types of status checks:

System Status Checks: These tests monitor the AWS infrastructure (hardware, networking) that supports your EC2 instance. Hardware breakdowns and network difficulties are frequent issues requiring AWS assistance.
Instance Status Checks: These checks evaluate the health of the instance, including its software, operating system, and network configurations. Problems such as damaged file systems, failing programs, or wrong network settings are recognized and require human intervention to resolve.

Configure Status Check Alarms

You can set up alerts to warn you when a system or instance status check fails. This proactive alerting helps to guarantee that issues are resolved quickly.

# Example command to create a CloudWatch alarm for failed EC2 system status checks
aws cloudwatch put-metric-alarm \\
    --alarm-name "EC2 System Check Failed" \\
    --metric-name StatusCheckFailed_System \\
    --namespace AWS/EC2 \\
    --statistic Maximum \\
    --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \\
    --period 300 \\
    --threshold 1 \\
    --comparison-operator GreaterThanOrEqualToThreshold \\
    --evaluation-periods 1 \\
    --alarm-actions arn:aws:sns:us-east-1:111122223333:MyTopic

By setting these alarms, you can stay ahead of potential issues and ensure the continuous health and availability of your EC2 instances.

EC2 Auto Recovery

EC2 Auto Recovery allows AWS to automatically recover your EC2 instance if it fails a system status check due to issues like hardware failure. AWS automatically restarts the instance on different hardware within the same Availability Zone, preserving instance data and minimizing downtime. Auto Recovery is particularly useful for ensuring the high availability of critical instances.

You can enable EC2 Auto Recovery by setting up an Amazon CloudWatch alarm that triggers the recovery action when a system status check fails.

Tools for EC2 Monitoring in AWS

AWS provides a variety of useful tools for monitoring your EC2 instances, allowing you to measure performance, detect anomalies, and automate replies.

Amazon CloudWatch: CloudWatch is AWS' offers monitoring EC2 and other resources. It collects measurements, monitors log data, generates alerts, and provides real-time performance information. CloudWatch enables you to monitor the health of your instance and respond quickly to concerns. An example can be monitoring CPU use, disk I/O, and network traffic.
- An example use case:
  - Setting up alarms to send notifications or perform scheduled tasks.
AWS CloudTrail: CloudTrail captures and tracks all API requests performed through your AWS account, including those involving EC2 instances. This logging helps trace changes to EC2 setups, which improves your security posture and assists with compliance and auditing requirements.
- Some example use cases:
  - Detecting unauthorized changes to instance settings.
  - Auditing activities performed by various AWS users or services.
Amazon GuardDuty: GuardDuty is a threat detection service that detects suspicious activities throughout your AWS environment, including EC2 instances. It analyzes data from CloudTrail, VPC Flow Logs, and DNS logs to identify possible security concerns such as hacked instances or strange API requests.
- Some example use cases:
  - Including detecting anomalous or illegal network activity.
  - Detection of Bitcoin mining on EC2 instances.
AWS Systems Manager: Systems Manager provides a common interface for operational data from all AWS services, allowing you to manage and monitor EC2 fleets at scale. You can automate typical operations like patch management and get precise performance data.
- Some example use cases:
  - Perform automatic patching and configuration upgrades.
  - Monitor many EC2 instances from a single dashboard.

Best Practices for Effective EC2 Monitoring

To achieve effective EC2 monitoring, follow a few best practices. These will assist you in proactively managing your resources, reducing expenditures, and detecting problems before they worsen.

Set up custom CloudWatch dashboards: Organize all key EC2 metrics into CloudWatch dashboards to provide consolidated monitoring. These dashboards give a visual overview of your instances' performance, allowing you to identify patterns and abnormalities immediately. For example, create a real-time dashboard to monitor CPU, RAM, disk I/O, and network traffic.
Implement correct tagging: Tag EC2 instances by environment (e.g., dev, test, prod), application, or business unit. This makes it easy to organize resources, implement procedures, and evaluate metrics. For example, tagging instances with Environment=Production enables restricted monitoring of only production systems.
Use CloudWatch Alarms: Set up alerts for important indicators like CPU use, storage space, and network traffic. Alarms alert you to possible difficulties, allowing you to take action before they harm your application. For example, set a CloudWatch alarm to alert you if CPU use surpasses 80% for more than 5 minutes. This will urge you to examine or scale up the instance.
Integrate EC2 monitoring with other AWS services: Monitoring EC2 in isolation may not provide the complete picture. Correlating EC2 metrics with data from other AWS services, such as RDS (for databases) or ELB (load balancers), allows you to see how different components of your application stack interact. For example, use CloudWatch to concurrently monitor the health of EC2 instances and RDS databases, allowing you to spot cross-service performance bottlenecks.

Advanced EC2 Monitoring Techniques

Beyond basic monitoring, AWS provides sophisticated tools and approaches to help you control your EC2 instances and create more intelligent monitoring strategies.

Utilize CloudWatch Logs for log analysis: You can send system and application logs from your EC2 instances to CloudWatch Logs. This centralizes your log data for simpler analysis, searching, and storage. Setting up log-based alarms allows you to be alerted of certain occurrences in real-time. For example, send Nginx or Apache web server logs to CloudWatch Logs and configure an alert for 500 Internal Server Errors.
Implement custom metrics: While CloudWatch includes built-in metrics, you may have special requirements that necessitate custom measurements. You can use the CloudWatch Agent to deliver custom metrics about your application, such as request latency or database query performance. For example, track custom metrics like “number of active users” or “an average request processing time” to monitor application-specific performance.
Leverage EC2 instance metadata: The EC2 instance metadata includes information on the instance, such as its ID, security group, and availability zone. By accessing this metadata programmatically, you can acquire insights that normal monitoring metrics do not provide. For example, you use instance metadata to dynamically alter instance behaviour based on the environment or area.
Use AWS Lambda for automated remediation: AWS Lambda can be set up to automatically respond to CloudWatch alarms. This enables you to automate tasks like restarting an unresponsive instance or scaling an instance depending on certain thresholds. For example, create a Lambda function that runs when a CloudWatch alarm alerts to limited disk space, automatically clearing away unneeded log files.

How to Monitor EC2 Instances with CloudWatch

Amazon CloudWatch provides comprehensive monitoring for EC2 instances by collecting metrics and logs, which help optimize performance and detect issues. Here's a step-by-step guide to monitor your EC2 instances using CloudWatch:

1. Enable Detailed Monitoring

By default, CloudWatch collects EC2 metrics at 5-minute intervals. You can enable detailed monitoring to get 1-minute granular metrics.

To enable detailed monitoring, follow these steps:

Navigate to the EC2 Dashboard in the AWS Management Console.
Select your EC2 instance.
Click on the Actions dropdown and choose Monitor and Troubleshoot.
Select Enable Detailed Monitoring.

2. View CloudWatch Metrics

CloudWatch collects essential EC2 instance metrics, such as:

CPU Utilization: Tracks the percentage of allocated EC2 compute units in use.
Disk Read/Write Operations: Measures the rate data is being read/written from/to the disk.
Network In/Out: Monitors the volume of data transferred via network interfaces.

To view CloudWatch metric, follow these steps:

Go to CloudWatch in the AWS Management Console.
Select Metrics > EC2.
You will see a list of metrics associated with your EC2 instances.

3. Set Up CloudWatch Alarms

CloudWatch alarms notify you when an EC2 instance metric crosses a specific threshold. This is essential for real-time monitoring and automation.

Example: Set an alarm to trigger if CPU utilization exceeds 80% for 5 minutes.

aws cloudwatch put-metric-alarm \\
  --alarm-name "HighCPUUtilization" \\
  --metric-name "CPUUtilization" \\
  --namespace "AWS/EC2" \\
  --statistic "Average" \\
  --period 300 \\
  --threshold 80 \\
  --comparison-operator "GreaterThanThreshold" \\
  --evaluation-periods 1 \\
  --alarm-actions arn:aws:sns:your-topic-arn \\
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0

4. Install the CloudWatch Agent for Additional Metrics

To collect more detailed information like memory usage, you need to install and configure the CloudWatch Agent.

Steps:
- SSH into your EC2 instance and install the CloudWatch Agent:
```
sudo yum install amazon-cloudwatch-agent
```
- Create a configuration file for the metrics you want to collect, including memory and disk utilization.
- Start the CloudWatch Agent:
```
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \\
  -a start
```

For more detailed steps, refer to the AWS CloudWatch User Guide.

Enhancing EC2 Monitoring with SigNoz

While CloudWatch provides robust native monitoring capabilities, tools like SigNoz can offer advanced features such as real-time performance insights, distributed tracing, and enhanced analytics for EC2 monitoring.

Key Features of SigNoz for EC2 Monitoring:

Advanced Observability: SigNoz offers deep visibility into EC2 metrics, providing insights into CPU, memory, and disk usage, along with application-level metrics.
Distributed Tracing: SigNoz allows you to trace requests across multiple AWS services, helping you identify bottlenecks and performance issues.
Real-Time Alerts: Configure custom alerts that provide real-time notifications when metrics exceed predefined thresholds.
Cost-Effective: SigNoz is an open-source solution that provides a cost-effective alternative to other observability platforms.

Getting Started with SigNoz:

SigNoz cloud is the easiest way to run SigNoz. Sign up for a free account and get 30 days of unlimited access to all features.

You can also install and self-host SigNoz yourself since it is open-source. With 20,000+ GitHub stars, open-source SigNoz is loved by developers. Find the instructions to self-host SigNoz.

Install the SigNoz Agent on your EC2 instance to start collecting metrics.
Configure EC2 to Send Metrics: For EC2 instances (without Kubernetes): You need to install the OpenTelemetry agent directly on your EC2 instance. This will collect and send telemetry data (metrics, logs, and traces) to SigNoz. Here's how you can configure it:
- Install OpenTelemetry Collector (or agent) on your EC2 instance. This can be done by downloading the OpenTelemetry Collector binaries or using package managers, such as Yum or APT.
- Once installed, you can configure the agent to collect system metrics like CPU, memory, disk I/O, and network performance.
An example to configure the OpenTelemetry agent (no Kubernetes involved) using a configuration file stored locally on the EC2 instance:
```
receivers:
  hostmetrics:
    collection_interval: 60s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  batch:
    timeout: 10s

exporters:
  otlp:
    endpoint: <SigNoz-endpoint>
    insecure: true

service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      processors: [batch]
      exporters: [otlp]
```
- Save this YAML configuration file on your EC2 instance.
- Start the OpenTelemetry agent using the configuration file. The agent will send metrics to the SigNoz server over the OTLP (OpenTelemetry Protocol).
For Kubernetes-based EC2 Instances: If you're using EC2 instances as part of a Kubernetes cluster, you can apply the YAML configuration to your cluster. The provided YAML file is used to configure the OpenTelemetry Collector in Kubernetes environments to send data to SigNoz.
An example uses the YAML file in Kubernetes:
- Apply the YAML configuration to your Kubernetes cluster:
```
kubectl apply -f opentelemetry-config.yaml
```
This will set up the OpenTelemetry Collector in the cluster to forward metrics from your EC2 instances (via Kubernetes) to SigNoz.
Set Up Dashboards: In the SigNoz UI, configure custom dashboards for EC2 metrics such as CPU usage, disk I/O, and network traffic. Create alerts for any critical thresholds, ensuring you stay ahead of potential issues.

To learn more, visit the SigNoz Documentation.

Real-Time Trace and Metric Correlation

One of the standout features of SigNoz is its ability to correlate traces and metrics in real-time. It allows you to:

Pinpoint performance bottlenecks: By correlating slow traces with high CPU or memory usage, you can quickly determine the root cause of latency.
Improve troubleshooting efficiency: Real-time correlation between metrics and traces ensures you can detect and resolve issues faster, reducing downtime.
Gain end-to-end visibility: SigNoz connects the dots between application-level events and infrastructure metrics, enabling you to see how specific transactions or traces impact EC2 performance.

Key Takeaways

Monitoring your EC2 instances is important for maintaining a stable and high-performance AWS infrastructure. Proactive monitoring helps to prevent problems before they turn into costly downtime.
Monitor crucial metrics including CPU utilization, memory use, disk I/O, and network performance. These key indicators give useful information about the health and performance of your EC2 instances.
AWS technologies such as CloudWatch, CloudTrail, and GuardDuty are critical for monitoring and controlling EC2 instances. They offer native connectors with AWS services to simplify monitoring.
Consider using third-party solutions like SigNoz to improve observability and long-term data preservation. SigNoz provides deeper insights by connecting metrics, logs, and traces while being a cost-effective alternative to typical APM solutions.
Implement best practices like setting up CloudWatch dashboards, using proper tagging, and integrating alarms. Use AWS Lambda for automated remediation actions to ensure your EC2 environment remains healthy.

FAQs

What is the difference between basic and detailed monitoring in EC2?

Basic monitoring: Provides free metrics at 5-minute intervals, giving a high-level perspective of EC2 performance. It is appropriate for general monitoring purposes.
Detailed Monitoring: At an extra expense, 1-minute metrics are available, allowing for more granular performance tracking. Detailed monitoring is excellent for mission-critical situations in which real-time performance data is required.

How frequently are EC2 measurements gathered in CloudWatch?

Under basic monitoring, CloudWatch automatically gathers EC2 metrics every 5 minutes. If thorough monitoring is set, measurements are gathered every 1 minute, providing more frequent data points for more accurate insights.

Can I monitor custom metrics for my EC2 instances?

Yes. You can use the CloudWatch agent to monitor custom metrics in addition to regular EC2 metrics. This is particularly handy for tracking application-specific performance indicators like request latency and error rates.

How do I create alerts for EC2 instance performance issues?

You can create CloudWatch alarms depending on any parameter, such as CPU utilization or disk space consumption. When a certain threshold is reached, these alarms can send out messages via Amazon SNS or start automatic activities like scaling an instance or restarting a service.

Example:

aws cloudwatch put-metric-alarm --alarm-name "HighCPUUtilization" \\
    --metric-name CPUUtilization --namespace AWS/EC2 \\
    --statistic Average --period 60 --threshold 80 \\
    --comparison-operator GreaterThanOrEqualToThreshold \\
    --evaluation-periods 2 --alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopic

How do I troubleshoot EC2 instance performance issues?

Troubleshooting EC2 performance starts by examining metrics like CPU utilization, memory, disk I/O, and network performance using CloudWatch. Use EC2 status checks to identify if the problem stems from the instance itself or the underlying AWS infrastructure. Logs in CloudWatch Logs or EC2 instance logs can help diagnose the root cause of the issue. Additionally, AWS Elastic Load Balancer (ELB) logs, VPC flow logs, and CloudTrail logs provide insights into networking and security-related performance issues.

What’s the role of EC2 monitoring in hybrid cloud setups?

In hybrid cloud environments, where part of the infrastructure is on-premises and part is in the cloud, EC2 monitoring helps maintain consistency and performance between the two. Monitoring EC2 instances alongside on-premises resources (using tools like AWS Systems Manager or CloudWatch hybrid monitoring) ensures that applications running across both environments have optimal performance, security, and resource utilization. Monitoring also aids in load balancing and the efficient allocation of resources between on-premises and cloud environments.