Metrics based alerts

A Metric-based alert in SigNoz allows you to define conditions based on metric data and trigger alerts when these conditions are met. Here's a breakdown of the various sections and options available when configuring a Metric-based alert:

Step 1: Define the Metric

In this step, you use the Metrics Query Builder to choose the metric to monitor. Some of the fields that are available in Metrics Query Builder includes:

  • Metrics: A field to select the specific metric you want to monitor (e.g., CPU usage, memory utilization). You can also choose an aggregation function like "Count," "Sum," or "Average."

  • WHERE: A filter field to define specific conditions for the metric. You can apply logical operators like "IN," "NOT IN".

  • Legend Format: An optional field to customize the legend's format in the visual representation of the alert.

  • Having: Apply conditions to filter the results further based on aggregate value.

Using Query Builder to the metric to monitor
Using Query Builder to define the metric to monitor

To know more about the functionalities of the Query Builder, checkout the documentation.

Step 2: Define Alert Conditions

In this step, you define the specific conditions that trigger the alert and the notification frequency. The following fields are available:

  • Send a notification when [A] is [above/below] the threshold at least once during the last [X] mins: A condition template to set the threshold for the alert, with options to define when and how often the condition should be checked.

  • Alert Threshold: A field to set the threshold for the alert condition.

  • More Options :

    • Run alert every [X mins]: This option determines the frequency at which the alert condition is checked and notifications are sent.

    • Send a notification if data is missing for [X] mins: A field to specify if a notification should be sent when data is missing for a certain period.

Define the alert conditions
Define the alert conditions

Step 3: Alert Configuration

This step focuses on setting alert properties like severity, description, and other metadata. The following fields are available:

  • Severity: Set the severity level for the alert (e.g., "Warning", "Critical" etc.).

  • Alert Name: A field to name the alert for easy identification.

  • Alert Description: A field for adding a detailed description of the alert, explaining what it monitors and under what conditions it is triggered.

  • Labels: A field to add labels or tags to the alert for categorization.

  • Notification channels: A field to choose the notification channels from those configured in the Alert Channel settings.

  • Test Notification: A button to test the alert to ensure that it works as expected.

Configure the alert
Setting the alert metadata

Result labels in alert description

You can incorporate result labels in the alert descriptions to make the alerts more informative:

Syntax: Use {{.Labels.<label-name>}} to insert label values.

Example: If you have a query that returns the label service_name then to use it in the alert description, you will use {{.Labels.service_name}}which creates an alert that is specific to the particular service.

Examples

1. Alert when memory usage for host goes above 400 MB (or any fixed memory)

Here's a video tutorial for creating this alert:


Step 1: Write Query Builder query to define alert metric

metrics builder query for memory usage
Memory usage metric builder query

The hostmetricsreceiver creates several host system metrics, including system_memory_usage, which contains the memory usage for each state from /proc/meminfo. The states can be free, used, cached, etc. We want to alert when the total memory usage of a host exceeds the threshold, so the WHERE clause excludes the free state. We calculate the average value for each state and then sum them up by host to get the per-host memory usage.

✅ Info

Remember to set the unit of the y-axis to bytes, as that is the unit of the mentioned metric.


Step 2: Set alert conditions

metrics builder query for memory usage
Memory usage alert condition

The condition is set to trigger a notification if the per-minute memory usage exceeds the threshold of 400 MB at least once in the last five minutes.

2. Alert when memory usage for host goes above 70%

You might want to alert based on the percentage rather than a fixed threshold. There are two ways to get the percentage: the convenient option is when the usage percentage is reported directly by the source, or when the source only sends the exact usage in bytes and you need to derive the percentage yourself. This example demonstrates how to derive the percentage from the original bytes metric.

metrics builder query for memory usage
Memory usage percentage query

We use a formula to derive the percentage value from the exact memory usage in bytes. In the example, query A calculates the per-host memory usage, while query B, as shown in the image, doesn't have any WHERE clause filter, thus providing the total memory available. The formula for A/B is interpreted as (memory usage in bytes) / (total memory available in bytes). We set the unit of the y-axis to Percent (0.0 - 1.0) to match the result of the formula.

metrics builder query for memory usage
Memory usage percentage condition

The condition is set to trigger a notification if the per-minute memory usage exceeds the threshold of 70% all the times in the last five minutes.

3. Alert when the error percentage for an endpoint exceeds 5%

SigNoz creates a metric signoz_calls_total from the trace data. The default attributes of the metric are service_name, operation, span_kind, status_code, and http_status_code. There is no separate metric for counting errors; instead, the status_code attribute is used to determine if a request counts as an error. This example demonstrates how to calculate the error percentage and alert on it.

metrics builder query for error percentage
Error percentage query

We use a formula to derive the error percentage from the total calls metric. In the example, query A calculates the per-endpoint error rate, while query B, as shown in the image, doesn't have any WHERE clause filter for status_code, thus providing the per-endpoint total request rate. The formula for A/B is interpreted as (error request rate) / (total request rate), which gives the error percentage per endpoint. We set the unit of the y-axis to Percent (0.0 - 1.0) to match the result of the formula.

metrics builder query for error percentage
Error percentage condition

The condition is set to trigger a notification if the per-minute error percentage exceeds the threshold of 5% all the times in the last five minutes.

4. Alert when P95 latency for an endpoint is above 1200 ms

SigNoz creates a metric signoz_latency_bucket from the trace data. The default attributes of the metric are service_name, operation, span_kind, status_code, and http_status_code. This example demonstrates how to calculate the P95 latency for an endpoint and alert on it.

metrics builder query for latency
Endpoint latency query

We use the P95 aggregation, which gives the 95th-percentile request latency per endpoint. We set the unit of the y-axis to milliseconds to match the unit of the metric.

metrics builder query for latency
Endpoint latency condition

The condition is set to trigger a notification if the per-minute P95 latency exceeds the threshold of 1200 ms at any time in the last five minutes.