Service Monitoring using Terraform

Table of content

Alarms

Alarms

The choice of alarm type often depends on the nature of the metric being monitored and the pattern of its expected behavior.

AWS Service	Metric Example	Alarm Type	Use Case Description
Lambda	Invocation Count	Static Threshold	Use a static threshold alarm when you want to be alerted if the number of invocations exceeds a certain fixed number, indicating unusual activity or potential errors.
	Error Rate	Anomaly Detection	An anomaly detection alarm is suitable for monitoring Lambda error rates, as it can learn the normal error rate pattern and alert when deviations occur.
RDS	CPU Utilization	Static Threshold	A static threshold alarm works well for monitoring CPU utilization, alerting if usage consistently exceeds a set percentage, which might indicate the need for scaling.
	Read IOPS	Anomaly Detection	For Read IOPS, an anomaly detection alarm can monitor normal read patterns and alert you to unusual spikes or drops, which could indicate performance issues.
API Gateway	4XX and 5XX Error Rates	Anomaly Detection	Anomaly detection is ideal for monitoring error rates as it can alert you to unexpected increases in errors that don’t follow the usual pattern of your API traffic.
	Latency	Static Threshold	Use a static threshold alarm for latency to get alerted if the response time of your API exceeds a certain threshold, indicating performance issues.
		Composite Alarms	Composite alarms can be used to combine multiple metrics, such as error rates and latency, to provide a more comprehensive view of API Gateway’s health.
DynamoDB	Read/Write Throttle Events	Static Threshold	Use a static threshold alarm to monitor if the number of read/write throttle events exceeds a certain limit, indicating potential table scaling issues or hot key problems.
	Consumed Read/Write Capacity	Anomaly Detection	An anomaly detection alarm is suitable for understanding typical read/write capacity patterns and alerting on unusual deviations, which could indicate inefficiency or overuse.
Step Functions (StepFn)	Execution Time	Static Threshold	A static threshold alarm can alert you if the execution time of a state machine exceeds a set threshold, which might indicate inefficient workflows or errors.
	Failed Executions	Anomaly Detection	An anomaly detection alarm can be used to monitor the rate of failed executions, alerting when failures deviate from the norm, which could indicate issues in the workflow.
ECS	CPU and Memory Utilization	Static Threshold	Static threshold alarms are effective for monitoring if CPU or memory utilization goes above or drops below expected levels, indicating potential scaling needs or inefficiencies.
	Task Count	Anomaly Detection	Use anomaly detection to monitor the number of running tasks and alert on unexpected changes, which might indicate issues with task scheduling or service health.
EKS	Node CPU/Memory Utilization	Static Threshold	Static threshold alarms are useful for monitoring CPU and memory utilization of nodes, alerting if these metrics exceed expected thresholds, indicating potential resource issues.
	Pod Count	Anomaly Detection	An anomaly detection alarm can monitor the number of pods and alert on unusual increases or decreases, which could be indicative of scaling issues or cluster health problems.
EC2	CPU Utilization	Static Threshold	A common use case for a static threshold alarm is to monitor EC2 CPU utilization, alerting if it consistently exceeds a set percentage, potentially indicating performance issues.
	Network In/Out	Anomaly Detection	Anomaly detection can alert you to abnormal network traffic patterns, which might indicate security concerns or misconfigurations.
	Status Check Failures	Composite Alarms	Composite alarms can be used to combine multiple EC2 status checks (instance and system) to get a holistic view of instance health.

CW Metric Alarm

CloudWatch alarms are used to watch a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. When a metric crosses the threshold set by the alarm for a specified number of evaluation periods, an action can be triggered such as sending a notification to an SNS topic.

High Invocation Errors Lambda Example

resource "aws_cloudwatch_metric_alarm" "alarm_lambda" {
    alarm_name          = "${var.alarm_prefix} - High Invocation Errors - ${var.function_friendly_name}"
    alarm_description   = "High number of Lambda unhandled invocation exceptions"
    comparison_operator = "GreaterThanOrEqualToThreshold" # The arithmetic operation to use when comparing the specified statistic and threshold.
    namespace           = "AWS/Lambda" # The namespace for the alarm's associated metric.
    metric_name         = "Errors" # The name of the metric to monitor.
    statistic           = "Sum"
    threshold           = "1" # The value against which the specified statistic is compared.
    evaluation_periods  = "1" # The number of periods over which data is compared to the specified threshold.
    period              = "300" # The period, in seconds, over which the statistic is applied.
    treat_missing_data  = "notBreaching"

    dimensions = {
        FunctionName = var.function_name # The dimension for the alarm's associated metric (Lambda function name).
    }

    alarm_actions = [var.alarm_sns_topic] # The list of actions to execute when this alarm transitions into an ALARM state.
}

metric_name - See Docs for supported metric You can visit this link and go to specific AWS Service to see more detail name of metric.
comparison_operator - Either of the following is supported: GreaterThanOrEqualToThreshold, GreaterThanThreshold, LessThanThreshold, LessThanOrEqualToThreshold. Additionally, the values LessThanLowerOrGreaterThanUpperThreshold, LessThanLowerThreshold, and GreaterThanUpperThreshold are used only for alarms based on anomaly detection models
statistic - The statistic to apply to the alarm’s associated metric. Either of the following is supported: SampleCount, Average, Sum, Minimum, Maximum
treat_missing_data - Sets how this alarm is to handle missing data points. The following values are supported: missing, ignore, breaching and notBreaching. Defaults to missing.

SQS Size Exceeds Example

resource "aws_cloudwatch_metric_alarm" "sqs_size_exceeds_on_action" {
  alarm_name          = var.alarm_name
  alarm_description   = var.alarm_description
  comparison_operator = "GreaterThanOrEqualToThreshold"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  statistic           = "Sum"
  threshold           = var.threshold
  evaluation_periods  = "1"
  period              = "300"

  dimensions = {
    QueueName = var.queue_name
  }

  alarm_actions = [var.alarm_sns_topic]
}

SQS Old Message Example

ApproximateAgeOfOldestMessage is a metric in Amazon SQS (Simple Queue Service) that represents the approximate age (in seconds) of the oldest message in the queue that has not yet been processed. This metric is useful for monitoring the queue’s health and ensuring that messages are being consumed and processed in a timely manner.

resource "aws_cloudwatch_metric_alarm" "alarm" {
  alarm_name          = var.alarm_name
  alarm_description   = var.alarm_description
  comparison_operator = "GreaterThanOrEqualToThreshold"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  statistic           = "Sum"
  threshold           = var.threshold
  evaluation_periods  = "1"
  period              = "300"
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = var.queue_name
  }

  alarm_actions = [var.alarm_sns_topic]
}

Filter (Pattern Matching) Alarm

Log metric filter scans through log data in CloudWatch Logs and can trigger a metric based on the presence of certain patterns or terms in the log entries. This is useful for monitoring application and system logs, creating alarms, and triggering notifications or other actions based on log data.

For example, you might create a filter to find and count occurrences of a specific error code in your application logs. When a log pattern is matched, you can increment a metric that you’ve defined. This metric will then be visible within the CloudWatch Metrics console and can be used to trigger alarms.

Log Level Errors Pattern Example

The following terraform resource creates a CloudWatch Log Metric Filter to monitor for error messages (logLevel=ERROR) in the specified lambda function’s log group. When an error log is detected, it increments a CloudWatch metric specific to that lambda function, helping in monitoring and alerting for application errors.

resource "aws_cloudwatch_log_metric_filter" "filter" {
  name           = "${var.function_name}-log-errors"
  pattern        = "[timestamp,logId,logLevel=ERROR,msg]"
  log_group_name = "/aws/lambda/${var.function_name}"

  metric_transformation {
    name          = "${var.function_name}-log-errors"
    namespace     = var.metric_namespace
    value         = "1"
    default_value = "0"
  }
}

Anomaly Detection Alarm

Anomaly detection is examining specific data points and detecting rare occurrences that seem suspicious because they’re different from the established pattern of behaviors.

When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention.

When viewing a graph of metric data, overlay the expected values onto the graph as a band. This makes it visually clear which values in the graph are out of the normal range. For more information, see Creating a graph. You can enable anomaly detection using the AWS Management Console, the AWS CLI, AWS CloudFormation, or the AWS SDK. You can enable anomaly detection on metrics vended by AWS and also on custom metrics. You can also retrieve the upper and lower values of the model’s band by using the GetMetricData API request with the ANOMALY_DETECTION_BAND metric math function.

resource "aws_cloudwatch_metric_alarm" "xx_anomaly_detection" {
  alarm_name                = "terraform-test-foobar"
  comparison_operator       = "GreaterThanUpperThreshold"
  evaluation_periods        = 2
  threshold_metric_id       = "e1"
  alarm_description         = "This metric monitors ec2 cpu utilization"
  insufficient_data_actions = []

  metric_query {
    id          = "e1"
    expression  = "ANOMALY_DETECTION_BAND(m1)"
    label       = "CPUUtilization (Expected)"
    return_data = "true"
  }

  metric_query {
    id          = "m1"
    return_data = "true"
    metric {
      metric_name = "CPUUtilization"
      namespace   = "AWS/EC2"
      period      = 120
      stat        = "Average"
      unit        = "Count"

      dimensions = {
        InstanceId = "i-abc123"
      }
    }
  }
}

Composite Alarm

Composite alarms allow you to combine multiple alarms into a single alarm. This is useful for more complex conditions where you need to evaluate several metrics together.

resource "aws_cloudwatch_metric_alarm" "cpu_utilization_alarm" {
  # Configuration for the CPU Utilization alarm
  # ...
}

resource "aws_cloudwatch_metric_alarm" "disk_write_ops_alarm" {
  # Configuration for the Disk Write Ops alarm
  # ...
}

resource "aws_cloudwatch_composite_alarm" "system_health_alarm" {
  alarm_name          = "SystemHealthAlarm"
  alarm_description   = "Alarm for overall system health based on CPU and Disk Write Ops"
  alarm_rule          = "ALARM(${aws_cloudwatch_metric_alarm.cpu_utilization_alarm.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.disk_write_ops_alarm.alarm_name})"

  actions_enabled     = true
  ok_actions          = [var.ok_action]
  alarm_actions       = [var.alarm_action]
  insufficient_data_actions = [var.insufficient_data_action]
}