Table of content
Alarms
The choice of alarm type often depends on the nature of the metric being monitored and the pattern of its expected behavior.
AWS Service | Metric Example | Alarm Type | Use Case Description |
---|---|---|---|
Lambda | Invocation Count | Static Threshold | Use a static threshold alarm when you want to be alerted if the number of invocations exceeds a certain fixed number, indicating unusual activity or potential errors. |
Error Rate | Anomaly Detection | An anomaly detection alarm is suitable for monitoring Lambda error rates, as it can learn the normal error rate pattern and alert when deviations occur. | |
RDS | CPU Utilization | Static Threshold | A static threshold alarm works well for monitoring CPU utilization, alerting if usage consistently exceeds a set percentage, which might indicate the need for scaling. |
Read IOPS | Anomaly Detection | For Read IOPS, an anomaly detection alarm can monitor normal read patterns and alert you to unusual spikes or drops, which could indicate performance issues. | |
API Gateway | 4XX and 5XX Error Rates | Anomaly Detection | Anomaly detection is ideal for monitoring error rates as it can alert you to unexpected increases in errors that don’t follow the usual pattern of your API traffic. |
Latency | Static Threshold | Use a static threshold alarm for latency to get alerted if the response time of your API exceeds a certain threshold, indicating performance issues. | |
Composite Alarms | Composite alarms can be used to combine multiple metrics, such as error rates and latency, to provide a more comprehensive view of API Gateway’s health. | ||
DynamoDB | Read/Write Throttle Events | Static Threshold | Use a static threshold alarm to monitor if the number of read/write throttle events exceeds a certain limit, indicating potential table scaling issues or hot key problems. |
Consumed Read/Write Capacity | Anomaly Detection | An anomaly detection alarm is suitable for understanding typical read/write capacity patterns and alerting on unusual deviations, which could indicate inefficiency or overuse. | |
Step Functions (StepFn) | Execution Time | Static Threshold | A static threshold alarm can alert you if the execution time of a state machine exceeds a set threshold, which might indicate inefficient workflows or errors. |
Failed Executions | Anomaly Detection | An anomaly detection alarm can be used to monitor the rate of failed executions, alerting when failures deviate from the norm, which could indicate issues in the workflow. | |
ECS | CPU and Memory Utilization | Static Threshold | Static threshold alarms are effective for monitoring if CPU or memory utilization goes above or drops below expected levels, indicating potential scaling needs or inefficiencies. |
Task Count | Anomaly Detection | Use anomaly detection to monitor the number of running tasks and alert on unexpected changes, which might indicate issues with task scheduling or service health. | |
EKS | Node CPU/Memory Utilization | Static Threshold | Static threshold alarms are useful for monitoring CPU and memory utilization of nodes, alerting if these metrics exceed expected thresholds, indicating potential resource issues. |
Pod Count | Anomaly Detection | An anomaly detection alarm can monitor the number of pods and alert on unusual increases or decreases, which could be indicative of scaling issues or cluster health problems. | |
EC2 | CPU Utilization | Static Threshold | A common use case for a static threshold alarm is to monitor EC2 CPU utilization, alerting if it consistently exceeds a set percentage, potentially indicating performance issues. |
Network In/Out | Anomaly Detection | Anomaly detection can alert you to abnormal network traffic patterns, which might indicate security concerns or misconfigurations. | |
Status Check Failures | Composite Alarms | Composite alarms can be used to combine multiple EC2 status checks (instance and system) to get a holistic view of instance health. |
CW Metric Alarm
CloudWatch alarms are used to watch a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. When a metric crosses the threshold set by the alarm for a specified number of evaluation periods, an action can be triggered such as sending a notification to an SNS topic.
High Invocation Errors Lambda Example
resource "aws_cloudwatch_metric_alarm" "alarm_lambda" {
alarm_name = "${var.alarm_prefix} - High Invocation Errors - ${var.function_friendly_name}"
alarm_description = "High number of Lambda unhandled invocation exceptions"
comparison_operator = "GreaterThanOrEqualToThreshold" # The arithmetic operation to use when comparing the specified statistic and threshold.
namespace = "AWS/Lambda" # The namespace for the alarm's associated metric.
metric_name = "Errors" # The name of the metric to monitor.
statistic = "Sum"
threshold = "1" # The value against which the specified statistic is compared.
evaluation_periods = "1" # The number of periods over which data is compared to the specified threshold.
period = "300" # The period, in seconds, over which the statistic is applied.
treat_missing_data = "notBreaching"
dimensions = {
FunctionName = var.function_name # The dimension for the alarm's associated metric (Lambda function name).
}
alarm_actions = [var.alarm_sns_topic] # The list of actions to execute when this alarm transitions into an ALARM state.
}
- metric_name - See Docs for supported metric You can visit this link and go to specific AWS Service to see more detail name of metric.
- comparison_operator - Either of the following is supported:
GreaterThanOrEqualToThreshold
,GreaterThanThreshold
,LessThanThreshold
,LessThanOrEqualToThreshold
. Additionally, the values LessThanLowerOrGreaterThanUpperThreshold, LessThanLowerThreshold, and GreaterThanUpperThreshold are used only for alarms based on anomaly detection models - statistic - The statistic to apply to the alarm’s associated metric. Either of the following is supported:
SampleCount
,Average
,Sum
,Minimum
,Maximum
- treat_missing_data - Sets how this alarm is to handle missing data points. The following values are supported:
missing
,ignore
,breaching
andnotBreaching
. Defaults tomissing
.
SQS Size Exceeds Example
resource "aws_cloudwatch_metric_alarm" "sqs_size_exceeds_on_action" {
alarm_name = var.alarm_name
alarm_description = var.alarm_description
comparison_operator = "GreaterThanOrEqualToThreshold"
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesVisible"
statistic = "Sum"
threshold = var.threshold
evaluation_periods = "1"
period = "300"
dimensions = {
QueueName = var.queue_name
}
alarm_actions = [var.alarm_sns_topic]
}
SQS Old Message Example
ApproximateAgeOfOldestMessage
is a metric in Amazon SQS (Simple Queue Service) that represents the approximate age (in seconds) of the oldest message in the queue that has not yet been processed. This metric is useful for monitoring the queue’s health and ensuring that messages are being consumed and processed in a timely manner.
resource "aws_cloudwatch_metric_alarm" "alarm" {
alarm_name = var.alarm_name
alarm_description = var.alarm_description
comparison_operator = "GreaterThanOrEqualToThreshold"
namespace = "AWS/SQS"
metric_name = "ApproximateAgeOfOldestMessage"
statistic = "Sum"
threshold = var.threshold
evaluation_periods = "1"
period = "300"
treat_missing_data = "notBreaching"
dimensions = {
QueueName = var.queue_name
}
alarm_actions = [var.alarm_sns_topic]
}
Filter (Pattern Matching) Alarm
Log metric filter scans through log data in CloudWatch Logs and can trigger a metric based on the presence of certain patterns or terms in the log entries. This is useful for monitoring application and system logs, creating alarms, and triggering notifications or other actions based on log data.
For example, you might create a filter to find and count occurrences of a specific error code in your application logs. When a log pattern is matched, you can increment a metric that you’ve defined. This metric will then be visible within the CloudWatch Metrics console and can be used to trigger alarms.
Log Level Errors Pattern Example
The following terraform resource creates a CloudWatch Log Metric Filter to monitor for error messages (logLevel=ERROR
) in
the specified lambda function’s log group. When an error log is detected, it increments a CloudWatch metric
specific to that lambda function, helping in monitoring and alerting for application errors.
resource "aws_cloudwatch_log_metric_filter" "filter" {
name = "${var.function_name}-log-errors"
pattern = "[timestamp,logId,logLevel=ERROR,msg]"
log_group_name = "/aws/lambda/${var.function_name}"
metric_transformation {
name = "${var.function_name}-log-errors"
namespace = var.metric_namespace
value = "1"
default_value = "0"
}
}
Anomaly Detection Alarm
Anomaly detection is examining specific data points and detecting rare occurrences that seem suspicious because they’re different from the established pattern of behaviors.
When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention.
When viewing a graph of metric data, overlay the expected values onto the graph as a band. This makes it visually clear which values in the graph are out of the normal range. For more information, see Creating a graph.
You can enable anomaly detection using the AWS Management Console, the AWS CLI, AWS CloudFormation, or the AWS SDK. You can enable anomaly detection on metrics vended by AWS and also on custom metrics.
You can also retrieve the upper and lower values of the model’s band by using the GetMetricData
API request with the ANOMALY_DETECTION_BAND
metric math function.
resource "aws_cloudwatch_metric_alarm" "xx_anomaly_detection" {
alarm_name = "terraform-test-foobar"
comparison_operator = "GreaterThanUpperThreshold"
evaluation_periods = 2
threshold_metric_id = "e1"
alarm_description = "This metric monitors ec2 cpu utilization"
insufficient_data_actions = []
metric_query {
id = "e1"
expression = "ANOMALY_DETECTION_BAND(m1)"
label = "CPUUtilization (Expected)"
return_data = "true"
}
metric_query {
id = "m1"
return_data = "true"
metric {
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
stat = "Average"
unit = "Count"
dimensions = {
InstanceId = "i-abc123"
}
}
}
}
Composite Alarm
Composite alarms allow you to combine multiple alarms into a single alarm. This is useful for more complex conditions where you need to evaluate several metrics together.
resource "aws_cloudwatch_metric_alarm" "cpu_utilization_alarm" {
# Configuration for the CPU Utilization alarm
# ...
}
resource "aws_cloudwatch_metric_alarm" "disk_write_ops_alarm" {
# Configuration for the Disk Write Ops alarm
# ...
}
resource "aws_cloudwatch_composite_alarm" "system_health_alarm" {
alarm_name = "SystemHealthAlarm"
alarm_description = "Alarm for overall system health based on CPU and Disk Write Ops"
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.cpu_utilization_alarm.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.disk_write_ops_alarm.alarm_name})"
actions_enabled = true
ok_actions = [var.ok_action]
alarm_actions = [var.alarm_action]
insufficient_data_actions = [var.insufficient_data_action]
}