Skip to content

Service Monitoring using Terraform

Updated: at 05:21 AM

IMG

Table of content

Alarms

The choice of alarm type often depends on the nature of the metric being monitored and the pattern of its expected behavior.

AWS ServiceMetric ExampleAlarm TypeUse Case Description
LambdaInvocation CountStatic ThresholdUse a static threshold alarm when you want to be alerted if the number of invocations exceeds a certain fixed number, indicating unusual activity or potential errors.
Error RateAnomaly DetectionAn anomaly detection alarm is suitable for monitoring Lambda error rates, as it can learn the normal error rate pattern and alert when deviations occur.
RDSCPU UtilizationStatic ThresholdA static threshold alarm works well for monitoring CPU utilization, alerting if usage consistently exceeds a set percentage, which might indicate the need for scaling.
Read IOPSAnomaly DetectionFor Read IOPS, an anomaly detection alarm can monitor normal read patterns and alert you to unusual spikes or drops, which could indicate performance issues.
API Gateway4XX and 5XX Error RatesAnomaly DetectionAnomaly detection is ideal for monitoring error rates as it can alert you to unexpected increases in errors that don’t follow the usual pattern of your API traffic.
LatencyStatic ThresholdUse a static threshold alarm for latency to get alerted if the response time of your API exceeds a certain threshold, indicating performance issues.
Composite AlarmsComposite alarms can be used to combine multiple metrics, such as error rates and latency, to provide a more comprehensive view of API Gateway’s health.
DynamoDBRead/Write Throttle EventsStatic ThresholdUse a static threshold alarm to monitor if the number of read/write throttle events exceeds a certain limit, indicating potential table scaling issues or hot key problems.
Consumed Read/Write CapacityAnomaly DetectionAn anomaly detection alarm is suitable for understanding typical read/write capacity patterns and alerting on unusual deviations, which could indicate inefficiency or overuse.
Step Functions (StepFn)Execution TimeStatic ThresholdA static threshold alarm can alert you if the execution time of a state machine exceeds a set threshold, which might indicate inefficient workflows or errors.
Failed ExecutionsAnomaly DetectionAn anomaly detection alarm can be used to monitor the rate of failed executions, alerting when failures deviate from the norm, which could indicate issues in the workflow.
ECSCPU and Memory UtilizationStatic ThresholdStatic threshold alarms are effective for monitoring if CPU or memory utilization goes above or drops below expected levels, indicating potential scaling needs or inefficiencies.
Task CountAnomaly DetectionUse anomaly detection to monitor the number of running tasks and alert on unexpected changes, which might indicate issues with task scheduling or service health.
EKSNode CPU/Memory UtilizationStatic ThresholdStatic threshold alarms are useful for monitoring CPU and memory utilization of nodes, alerting if these metrics exceed expected thresholds, indicating potential resource issues.
Pod CountAnomaly DetectionAn anomaly detection alarm can monitor the number of pods and alert on unusual increases or decreases, which could be indicative of scaling issues or cluster health problems.
EC2CPU UtilizationStatic ThresholdA common use case for a static threshold alarm is to monitor EC2 CPU utilization, alerting if it consistently exceeds a set percentage, potentially indicating performance issues.
Network In/OutAnomaly DetectionAnomaly detection can alert you to abnormal network traffic patterns, which might indicate security concerns or misconfigurations.
Status Check FailuresComposite AlarmsComposite alarms can be used to combine multiple EC2 status checks (instance and system) to get a holistic view of instance health.

CW Metric Alarm

CloudWatch alarms are used to watch a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. When a metric crosses the threshold set by the alarm for a specified number of evaluation periods, an action can be triggered such as sending a notification to an SNS topic.

High Invocation Errors Lambda Example

resource "aws_cloudwatch_metric_alarm" "alarm_lambda" {
    alarm_name          = "${var.alarm_prefix} - High Invocation Errors - ${var.function_friendly_name}"
    alarm_description   = "High number of Lambda unhandled invocation exceptions"
    comparison_operator = "GreaterThanOrEqualToThreshold" # The arithmetic operation to use when comparing the specified statistic and threshold.
    namespace           = "AWS/Lambda" # The namespace for the alarm's associated metric.
    metric_name         = "Errors" # The name of the metric to monitor.
    statistic           = "Sum"
    threshold           = "1" # The value against which the specified statistic is compared.
    evaluation_periods  = "1" # The number of periods over which data is compared to the specified threshold.
    period              = "300" # The period, in seconds, over which the statistic is applied.
    treat_missing_data  = "notBreaching"

    dimensions = {
        FunctionName = var.function_name # The dimension for the alarm's associated metric (Lambda function name).
    }

    alarm_actions = [var.alarm_sns_topic] # The list of actions to execute when this alarm transitions into an ALARM state.
}

SQS Size Exceeds Example

resource "aws_cloudwatch_metric_alarm" "sqs_size_exceeds_on_action" {
  alarm_name          = var.alarm_name
  alarm_description   = var.alarm_description
  comparison_operator = "GreaterThanOrEqualToThreshold"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  statistic           = "Sum"
  threshold           = var.threshold
  evaluation_periods  = "1"
  period              = "300"

  dimensions = {
    QueueName = var.queue_name
  }

  alarm_actions = [var.alarm_sns_topic]
}

SQS Old Message Example

ApproximateAgeOfOldestMessage is a metric in Amazon SQS (Simple Queue Service) that represents the approximate age (in seconds) of the oldest message in the queue that has not yet been processed. This metric is useful for monitoring the queue’s health and ensuring that messages are being consumed and processed in a timely manner.

resource "aws_cloudwatch_metric_alarm" "alarm" {
  alarm_name          = var.alarm_name
  alarm_description   = var.alarm_description
  comparison_operator = "GreaterThanOrEqualToThreshold"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  statistic           = "Sum"
  threshold           = var.threshold
  evaluation_periods  = "1"
  period              = "300"
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = var.queue_name
  }

  alarm_actions = [var.alarm_sns_topic]
}

Filter (Pattern Matching) Alarm

Log metric filter scans through log data in CloudWatch Logs and can trigger a metric based on the presence of certain patterns or terms in the log entries. This is useful for monitoring application and system logs, creating alarms, and triggering notifications or other actions based on log data.

For example, you might create a filter to find and count occurrences of a specific error code in your application logs. When a log pattern is matched, you can increment a metric that you’ve defined. This metric will then be visible within the CloudWatch Metrics console and can be used to trigger alarms.

Log Level Errors Pattern Example

The following terraform resource creates a CloudWatch Log Metric Filter to monitor for error messages (logLevel=ERROR) in the specified lambda function’s log group. When an error log is detected, it increments a CloudWatch metric specific to that lambda function, helping in monitoring and alerting for application errors.

resource "aws_cloudwatch_log_metric_filter" "filter" {
  name           = "${var.function_name}-log-errors"
  pattern        = "[timestamp,logId,logLevel=ERROR,msg]"
  log_group_name = "/aws/lambda/${var.function_name}"

  metric_transformation {
    name          = "${var.function_name}-log-errors"
    namespace     = var.metric_namespace
    value         = "1"
    default_value = "0"
  }
}

Anomaly Detection Alarm

Anomaly detection is examining specific data points and detecting rare occurrences that seem suspicious because they’re different from the established pattern of behaviors.

When you enable anomaly detection for a metric, CloudWatch applies statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention.

When viewing a graph of metric data, overlay the expected values onto the graph as a band. This makes it visually clear which values in the graph are out of the normal range. For more information, see Creating a graph. You can enable anomaly detection using the AWS Management Console, the AWS CLI, AWS CloudFormation, or the AWS SDK. You can enable anomaly detection on metrics vended by AWS and also on custom metrics. You can also retrieve the upper and lower values of the model’s band by using the GetMetricData API request with the ANOMALY_DETECTION_BAND metric math function.

resource "aws_cloudwatch_metric_alarm" "xx_anomaly_detection" {
  alarm_name                = "terraform-test-foobar"
  comparison_operator       = "GreaterThanUpperThreshold"
  evaluation_periods        = 2
  threshold_metric_id       = "e1"
  alarm_description         = "This metric monitors ec2 cpu utilization"
  insufficient_data_actions = []

  metric_query {
    id          = "e1"
    expression  = "ANOMALY_DETECTION_BAND(m1)"
    label       = "CPUUtilization (Expected)"
    return_data = "true"
  }

  metric_query {
    id          = "m1"
    return_data = "true"
    metric {
      metric_name = "CPUUtilization"
      namespace   = "AWS/EC2"
      period      = 120
      stat        = "Average"
      unit        = "Count"

      dimensions = {
        InstanceId = "i-abc123"
      }
    }
  }
}

Composite Alarm

Composite alarms allow you to combine multiple alarms into a single alarm. This is useful for more complex conditions where you need to evaluate several metrics together.

resource "aws_cloudwatch_metric_alarm" "cpu_utilization_alarm" {
  # Configuration for the CPU Utilization alarm
  # ...
}

resource "aws_cloudwatch_metric_alarm" "disk_write_ops_alarm" {
  # Configuration for the Disk Write Ops alarm
  # ...
}

resource "aws_cloudwatch_composite_alarm" "system_health_alarm" {
  alarm_name          = "SystemHealthAlarm"
  alarm_description   = "Alarm for overall system health based on CPU and Disk Write Ops"
  alarm_rule          = "ALARM(${aws_cloudwatch_metric_alarm.cpu_utilization_alarm.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.disk_write_ops_alarm.alarm_name})"

  actions_enabled     = true
  ok_actions          = [var.ok_action]
  alarm_actions       = [var.alarm_action]
  insufficient_data_actions = [var.insufficient_data_action]
}