ClickHouse queries for building dashboards and alerts
SigNoz gives you the ability to write powerful ClickHouse queries to plot charts. You can run SQL queries supported by ClickHouse to extract immense value out of your distributed tracing data or logs data.
The distributed tables in ClickHouse have been named by prefixing distributed_
to existing single shard table names. If you want to use ClickHouse queries in dashboard or alerts, you should use the distributed table names. Eg, signoz_index_v2
now corresponds to the table of a single shard. To query all the shards, query against distributed_signoz_index_v2
.
- The tagMap column of type Map(String, String) in signoz_traces.distributed_signoz_index_v2 has been divided into stringTagMap of type Map(String, String), numberTagMap of type Map(String, Float64) and boolTagMap of type Map(String, bool).
- All attributes with string values will be present in stringTagMap, attributes with number values will be present in numberTagMap and attributes with boolean values will be present in boolTagMap.
- The gRPCMethod column has been replaced by rpcMethod column.
- The gRPCCode and httpCode column has been replaced by responseStatusCode column.
The gRPCMethod, gRPCCode, httpCode and tagMap column will be deprecated in future releases so it recommended to replace dashboard queries with new columns.
Sharing a few examples of some queries which might be helpful in building dashboards.
GroupBy a tag/attribute in distributed tracing data
SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval,
stringTagMap['peer.service'] AS op_name,
toFloat64(avg(durationNano)) AS value
FROM signoz_traces.distributed_signoz_index_v2
WHERE stringTagMap['peer.service']!=''
AND timestamp > now() - INTERVAL 30 MINUTE
GROUP BY (op_name, interval) order by (op_name, interval) ASC;
customer_id
which is present as attribute of a span event
Show count of each WITH arrayFilter(x -> JSONExtractString(x, 'name')='Getting customer', events) AS filteredEvents
SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval, toFloat64(count()) AS count,
arrayJoin(arrayMap(x -> JSONExtractString(JSONExtractString(x, 'attributeMap'), 'customer_id'), filteredEvents)) AS resultArray
FROM signoz_traces.distributed_signoz_index_v2
WHERE not empty(filteredEvents)
AND timestamp > toUnixTimestamp(now() - INTERVAL 30 MINUTE)
GROUP BY (resultArray, interval) order by (resultArray, interval) ASC;
Avg latency between 2 spans of interest (part of the trace tree)
SELECT
interval,
round(avg(time_diff), 2) AS result
FROM
(
SELECT
interval,
traceID,
if(startTime1 != 0, if(startTime2 != 0, (toUnixTimestamp64Nano(startTime2) - toUnixTimestamp64Nano(startTime1)) / 1000000, nan), nan) AS time_diff
FROM
(
SELECT
toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,
traceID,
minIf(timestamp, if(serviceName='driver', if(name = '/driver.DriverService/FindNearest', if((stringTagMap['component']) = 'gRPC', true, false), false), false)) AS startTime1,
minIf(timestamp, if(serviceName='route', if(name = 'HTTP GET /route', true, false), false)) AS startTime2
FROM signoz_traces.distributed_signoz_index_v2
WHERE (timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}) AND (serviceName IN ('driver', 'route'))
GROUP BY (interval, traceID)
ORDER BY (interval, traceID) ASC
)
)
WHERE isNaN(time_diff) = 0
GROUP BY interval
ORDER BY interval ASC;
customer_id
which is present as attribute of a span event
Show sum of values of WITH arrayFilter(x -> JSONExtractString(x, 'name')='Getting customer', events) AS filteredEvents
SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval, toFloat64(sum(toInt32(resultArray))) AS sum,
arrayJoin(arrayMap(x -> JSONExtractString(JSONExtractString(x, 'attributeMap'), 'customer_id'), filteredEvents)) AS resultArray
FROM signoz_traces.distributed_signoz_index_v2
WHERE not empty(filteredEvents) and timestamp > toUnixTimestamp(now() - INTERVAL 30 MINUTE)
GROUP BY (resultArray, interval) order by (resultArray, interval) ASC;
100ms
interval
Plotting a chart on Plot a chart of 1 minute showing count of spans in 100ms
interval of service frontend
with duration > 50ms
SELECT fromUnixTimestamp64Milli(intDiv( toUnixTimestamp64Milli ( timestamp ), 100) * 100) AS interval,
toFloat64(count()) AS count
FROM (
SELECT timestamp
FROM signoz_traces.distributed_signoz_index_v2
WHERE serviceName='frontend'
AND durationNano>=50*exp10(6)
AND timestamp > now() - INTERVAL 1 MINUTE) GROUP BY interval
ORDER BY interval ASC;
Show count of loglines per minute
SELECT toStartOfInterval(fromUnixTimestamp64Nano(timestamp), INTERVAL 1 MINUTE) AS interval,
toFloat64(count()) AS value
FROM signoz_logs.distributed_logs_v2
WHERE timestamp > toUnixTimestamp64Nano(now64() - INTERVAL 30 MINUTE)
GROUP BY interval
ORDER BY interval ASC;
While writing queries for logs table, if you want to use an attribute/resource in your query you will have to reference it in the following format <type>_<dataType>_value[indexOf(<type>_<dataType>_key, <keyname>)]
where type
can be attributes/resources
, dataType
can be int64/float64/string
and keyname
is the name of the key.
Eg: If your keyname
is status
of dataType
string
and type
attribute
, it needs to be referenced as attributes_string_value[indexOf(attributes_string_key, 'status')]
In the above example, if status
is an selected field, then it can be referenced as attribute_string_status
Building alert queries with ClickHouse data
The most common use case for alerts is to send notifications when a threshold condition is met. For example, conditions like CPU usage exceeds 70% (CPU Usage > 70) or failures in a service exceed your expectations (service error count > 10).
Every alert is composed of a threshold, date range (e.g. last 5 mins) and match condition (e.g. at least once). The rule engine is responsible for evaluating these conditions and triggering notifications. SigNoz allows you a complete flexibility in choosing all three conditions to meet your business requirements.
The evaluation is also based on the source or metric data, the left-hand side of the threshold condition - consider CPU Usage (LHS) > 70 (RHS).
The query editor (ClickHouse
tab in the alert form
) allows you to define a query to fetch the metric (source) data from ClickHouse and evaluate if the threshold condition occurs in it. The query format requires use of following reserved columns and parameters which simplify the syntax and support dynamic parts.
Reserved Aliases (for use in SELECT, GROUP BY only): value
, res
, result
, interval
Reserved Parameters (for use in WHERE condition only):
{{.start_datetime}}
{{.end_datetime}}
{{.start_timestamp}}
{{.end_timestamp}}
{{.start_timestamp_ms}}
{{.end_timestamp_ms}}
{{.start_timestamp_nano}}
{{.end_timestamp_nano}}
Let's explore sample queries to use ClickHouse data in alert evaluation.
The rule engine is always looking for special column aliases. When the query returns data, the rule engine looks into the result for a column with alias value
or result
and compares the vector of all those values with the threshold (defined in alert form).
For example, the following query retrieves not found
errors of a service. When this query returns data, the rule engine will compare value
from each row with the threshold set in alert form.
SELECT count() AS value
FROM signoz_traces.distributed_signoz_error_index_v2
WHERE (serviceName='api-service’)
AND (exceptionType='Not found’);
If the same query is written without the column alias value
or result
, the rule engine will ignore the metric and fail to arrive at a threshold match.
Second reserved column alias is interval
. It is useful when you want to apply threshold check over n time intervals.
SELECT count() AS value, toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS interval
FROM signoz_traces.distributed_signoz_error_index_v2
WHERE (serviceName = 'api-service')
AND (exceptionType = 'Not found')
GROUP BY interval
The rule engine is also capable of dynamically applying date ranges to your SELECT
query. When you choose last 15 mins
option in the alert conditions section, the rule engine applies timestamp values in the timeframe condition below.
SELECT count() AS value, toStartOfInterval(timestamp, INTERVAL 5 MINUTE) AS interval
FROM signoz_traces.distributed_signoz_error_index_v2
WHERE timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}
AND (serviceName = 'api-service')
AND (exceptionType = 'Not found')
GROUP BY interval
The reserved bind variables {{.start_datetime}}
and {{.end_datetime}}
would allow rule engine to dynamically include date range query when the alert query runs.