Writing a Metrics ClickHouse Query
Table schema definitions & examples for metrics
There are four main tables in the database. One is for storing the samples/measurements and the rest are for storing the time series data.
Schema for samples table:
The schemas are not final. We might change it in the future.
CREATE TABLE signoz_metrics.distributed_samples_v4
(
`env` LowCardinality(String) DEFAULT 'default',
`temporality` LowCardinality(String) DEFAULT 'Unspecified',
`metric_name` LowCardinality(String),
`fingerprint` UInt64 CODEC(Delta(8), ZSTD(1)),
`unix_milli` Int64 CODEC(DoubleDelta, ZSTD(1)),
`value` Float64 CODEC(Gorilla, ZSTD(1))
)
ENGINE = Distributed('cluster', 'signoz_metrics', 'samples_v4', cityHash64(env, temporality, metric_name, fingerprint))
Explanation of the columns:
- env: The
deployment.environment
label value. This is used to identify the environment in which the metric was observed. This is used to filter the metrics based on the environment. - temporality: Temporality of the metric. This is used to identify the type of the metric. It can be one of the following values:
- Unspecified: This is used for gauge metrics.
- Cumulative: This is used for monotonic counters.
- Delta: This is used for counters that send the delta values.
- metric_name: Name of the metric
- fingerprint: Fingerprint of the metric. This is used to identify the metric uniquely. Currently, we are using the hash of the labels to generate the fingerprint.
- unix_milli: Timestamp in milliseconds when the metric was observed.
- value: Value of the metric
Example of a samples
- Sample data for Counter metric
signoz_calls_total
.
ββenvββββββ¬βtemporalityββ¬βmetric_nameβββββββββ¬ββββββββfingerprintββ¬ββββunix_milliββ¬βvalueββ
1. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354263778 β 560 β
2. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354323779 β 1140 β
3. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354383778 β 1669 β
4. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354443779 β 2180 β
5. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354503779 β 2745 β
6. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354563778 β 3300 β
7. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354623779 β 3870 β
8. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354683778 β 4460 β
9. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354743778 β 5030 β
10. β default β Cumulative β signoz_calls_total β 826125025100650961 β 1715354803779 β 5590 β
βββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββ
- Sample data for Gauge metric
otelcol_exporter_queue_size
ββenvββββββ¬βtemporalityββ¬βmetric_nameββββββββββββββββββ¬βββββββββfingerprintββ¬ββββunix_milliββ¬βvalueββ
1. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558433378 β 0 β
2. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558493378 β 20 β
3. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558553378 β 5 β
4. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558613378 β 13 β
5. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558673378 β 0 β
6. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558733378 β 12 β
7. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558793378 β 0 β
8. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558853378 β 14 β
9. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558913378 β 0 β
10. β default β Unspecified β otelcol_exporter_queue_size β 8219912421067840979 β 1715558973378 β 6 β
βββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββ
- Sample data for Histogram metrics
signoz_latency_bucket
ββenvββββββ¬βtemporalityββ¬βmetric_nameββββββββββββ¬βββββββfingerprintββ¬ββββunix_milliββ¬βvalueββ
1. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764440740 β 145 β
2. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764500740 β 280 β
3. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764560740 β 412 β
4. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764620740 β 567 β
5. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764680740 β 715 β
6. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764740740 β 855 β
7. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764800740 β 992 β
8. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764860740 β 1135 β
9. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764920740 β 1265 β
10. β default β Cumulative β signoz_latency_bucket β 81477507584972766 β 1715764980742 β 1397 β
βββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββ
Schema for time series tables:
There are three time series tables, namely time_series_v4
, time_series_v4_6hrs
, time_series_v4_1day
. The time series tables are used to store the label set that is used to identify the metric. See the description of the column unix_milli
for the explanation of the granularity of the time series tables.
CREATE TABLE signoz_metrics.distributed_time_series_v4
(
`env` LowCardinality(String) DEFAULT 'default',
`temporality` LowCardinality(String) DEFAULT 'Unspecified',
`metric_name` LowCardinality(String),
`description` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`unit` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`type` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`is_monotonic` Bool DEFAULT false CODEC(ZSTD(1)),
`fingerprint` UInt64 CODEC(Delta(8), ZSTD(1)),
`unix_milli` Int64 CODEC(Delta(8), ZSTD(1)),
`labels` String CODEC(ZSTD(5))
)
ENGINE = Distributed('cluster', 'signoz_metrics', 'time_series_v4', cityHash64(env, temporality, metric_name, fingerprint))
CREATE TABLE signoz_metrics.distributed_time_series_v4_6hrs
(
`env` LowCardinality(String) DEFAULT 'default',
`temporality` LowCardinality(String) DEFAULT 'Unspecified',
`metric_name` LowCardinality(String),
`description` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`unit` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`type` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`is_monotonic` Bool DEFAULT false CODEC(ZSTD(1)),
`fingerprint` UInt64 CODEC(Delta(8), ZSTD(1)),
`unix_milli` Int64 CODEC(Delta(8), ZSTD(1)),
`labels` String CODEC(ZSTD(5))
)
ENGINE = Distributed('cluster', 'signoz_metrics', 'time_series_v4_6hrs', cityHash64(env, temporality, metric_name, fingerprint))
CREATE TABLE signoz_metrics.distributed_time_series_v4_1day
(
`env` LowCardinality(String) DEFAULT 'default',
`temporality` LowCardinality(String) DEFAULT 'Unspecified',
`metric_name` LowCardinality(String),
`description` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`unit` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`type` LowCardinality(String) DEFAULT '' CODEC(ZSTD(1)),
`is_monotonic` Bool DEFAULT false CODEC(ZSTD(1)),
`fingerprint` UInt64 CODEC(Delta(8), ZSTD(1)),
`unix_milli` Int64 CODEC(Delta(8), ZSTD(1)),
`labels` String CODEC(ZSTD(5))
)
ENGINE = Distributed('cluster', 'signoz_metrics', 'time_series_v4_1day', cityHash64(env, temporality, metric_name, fingerprint))
Explanation of the columns:
- env: The
deployment.environment
label value. This is used to identify the environment in which the metric was observed. This is used to filter the metrics based on the environment. - temporality: Temporality of the metric. This is used to identify the type of the metric. It can be one of the following values:
- Unspecified: This is the default value and used for gauge metrics.
- Cumulative: This is used for monotonic counters.
- Delta: This is used for counters that send the delta values.
- metric_name: Name of the metric.
- description: Description of the metric.
- unit: Unit of the metric.
- type: Type of the metric. One of the following values:
- Sum: This is used for monotonic counters.
- Gauge: This is used for gauge metrics.
- Histogram: This is used for histogram metrics.
- ExponentialHistogram: This is used for exponential histogram metrics.
- is_monotonic: This is used to identify if the metric is a monotonic counter. This is used to identify the type of the metric.
- fingerprint: Fingerprint of the metric. This is used to identify the metric uniquely. Currently, we are using the hash of the labels to generate the fingerprint.
- unix_milli: Dependent on the granularity of the time series table. The
time_series_v4
table has a granularity of 1 hour. Thetime_series_v4_6hrs
table has a granularity of 6 hours. Thetime_series_v4_1day
table has a granularity of 1 day. The value is the start of the time interval in milliseconds. Example: If the granularity is 1 hour, then the value will be the start of the hour in milliseconds. Why do we do this? If we were to store the timestamp of the metric, then we would have to store the timestamp for each sample. This would increase the size of the table and slow down the queries. By storing the start of the time interval, we try to balance the ability to know the most recent appearance of the metric and the speed of the queries. Even then, the scans on the time series table can be slow if we are looking for a week's worth of data. Hence, we have additional tables with different granularities to speed up the queries. - labels: Labels of the metric; Stored as a JSON string.
Example of a time series
env: default
temporality: Cumulative
metric_name: signoz_calls_total
description:
unit:
type: Sum
is_monotonic: true
fingerprint: 1918473123734562099
unix_milli: 1715385600000
labels: {"__name__":"signoz_calls_total","__temporality__":"Cumulative","deployment_environment":"default","operation":"FindDriverIDs","resource_signoz_collector_id":"b2f22d66-0851-44f2-9104-2292189958d1","service_name":"redis","service_namespace":"default","signoz_collector_id":"b2f22d66-0851-44f2-9104-2292189958d1","span_kind":"SPAN_KIND_CLIENT","status_code":"STATUS_CODE_UNSET"}
Querying the metrics
Querying the metrics is done in two steps. First, we query the time series table to get the fingerprints and labels of the metrics. Then, we use the fingerprints to query the samples table to get the actual samples. This is done to reduce the amount of data that needs to be scanned. Also, the time series table is much smaller than the samples table.
We use the JOIN operation to join the two tables. The JOIN operation is done on the fingerprint column. The JOIN operation is done on the cluster level. This means that the JOIN operation is done on the data nodes and not on the aggregator nodes. This is done to reduce the amount of data that needs to be transferred between the nodes. This works because the data is sharded based on the env
, temporality
, metric_name
, fingerprint
columns hence all the data for a particular metric and fingerprint will be present on the same node.
The queries should have a result column with the name value
and a column with type DateTime for the graphs to work.
Example queries
Retrieving the fingerprints and labels of the metric
SELECT DISTINCT
fingerprint,
labels
FROM signoz_metrics.distributed_time_series_v4
WHERE (metric_name = 'signoz_calls_total') AND (temporality = 'Cumulative')
LIMIT 10
Retrieving the samples of the metric
SELECT
timestamp_ms,
value
FROM signoz_metrics.distributed_samples_v4
WHERE metric_name = 'signoz_calls_total'
LIMIT 10
Retrieving the label values of a metric
SELECT DISTINCT
fingerprint,
JSONExtractString(labels, 'service_name') AS service_name
FROM signoz_metrics.distributed_time_series_v4_1day
WHERE (metric_name = 'signoz_calls_total') AND (temporality = 'Cumulative')
LIMIT 10
Example queries for the frontend service RED metrics
The following queries can be modified to get the metrics for each service by adding grouping by the service name.
The following queries are used to generate the graphs for the frontend service RED metrics. The innermost join is used to get the raw samples joined with the time series table. Then, we use the runningDifference function to get the rate of change of the samples. Then, we use the outermost query to get the sum of the rate of change of the samples.
Request rate for a service
SELECT
ts,
sum(per_series_value) AS value
FROM
(
SELECT
ts,
If((per_series_value - lagInFrame(per_series_value, 1, 0) OVER rate_window) < 0, nan, If((ts - lagInFrame(ts, 1, toDate('1970-01-01')) OVER rate_window) >= 86400, nan, (per_series_value - lagInFrame(per_series_value, 1, 0) OVER rate_window) / (ts - lagInFrame(ts, 1, toDate('1970-01-01')) OVER rate_window))) AS per_series_value
FROM
(
SELECT
fingerprint,
toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
max(value) AS per_series_value
FROM signoz_metrics.distributed_samples_v4
INNER JOIN
(
SELECT DISTINCT fingerprint
FROM signoz_metrics.time_series_v4_1day
WHERE (metric_name = 'signoz_calls_total') AND (temporality = 'Cumulative') AND (unix_milli >= intDiv({{.start_timestamp_ms}}, 86400000) * 86400000) AND (unix_milli < {{.end_timestamp_ms}}) AND JSONExtractString(labels, 'service_name') = 'frontend'
) AS filtered_time_series USING (fingerprint)
WHERE (metric_name = 'signoz_calls_total') AND (unix_milli >= {{.start_timestamp_ms}}) AND (unix_milli < {{.end_timestamp_ms}})
GROUP BY
fingerprint,
ts
ORDER BY
fingerprint ASC,
ts ASC
)
WINDOW rate_window AS (PARTITION BY fingerprint ORDER BY fingerprint ASC, ts ASC)
)
WHERE isNaN(per_series_value) = 0
GROUP BY ts
ORDER BY ts ASC
Error rate for a service
This is the query for the error rate for a service. The query is similar to the request rate query. The only difference is that we are filtering the samples based on the status code. Then, we divide the error samples by the total samples to get the error rate.
SELECT
A.ts AS ts,
(A.value * 100) / B.value AS value
FROM
(
SELECT
ts,
sum(value) AS value
FROM
(
SELECT
ts,
if(runningDifference(ts) <= 0, nan, if(runningDifference(value) < 0, value / runningDifference(ts), runningDifference(value) / runningDifference(ts))) AS value
FROM
(
SELECT
fingerprint,
toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
max(value) AS value
FROM signoz_metrics.distributed_samples_v4
INNER JOIN
(
SELECT DISTINCT fingerprint
FROM signoz_metrics.time_series_v4_1day
WHERE (metric_name = 'signoz_calls_total') AND (temporality = 'Cumulative') AND (unix_milli >= intDiv({{.start_timestamp_ms}}, 86400000) * 86400000) AND (unix_milli < {{.end_timestamp_ms}}) AND JSONExtractString(labels, 'service_name') = 'redis' AND (JSONExtractString(labels, 'status_code') IN ['STATUS_CODE_ERROR'])
) AS filtered_time_series USING (fingerprint)
WHERE (metric_name = 'signoz_calls_total') AND (unix_milli >= {{.start_timestamp_ms}}) AND (unix_milli <= {{.end_timestamp_ms}})
GROUP BY
fingerprint,
ts
ORDER BY
fingerprint ASC,
ts ASC
)
WHERE isNaN(value) = 0
)
GROUP BY ts
ORDER BY ts ASC
) AS A
INNER JOIN
(
SELECT
ts,
sum(value) AS value
FROM
(
SELECT
ts,
if(runningDifference(ts) <= 0, nan, if(runningDifference(value) < 0, value / runningDifference(ts), runningDifference(value) / runningDifference(ts))) AS value
FROM
(
SELECT
fingerprint,
toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
max(value) AS value
FROM signoz_metrics.distributed_samples_v4
INNER JOIN
(
SELECT DISTINCT fingerprint
FROM signoz_metrics.time_series_v4_1day
WHERE (metric_name = 'signoz_calls_total') AND (temporality = 'Cumulative') AND (unix_milli >= intDiv({{.start_timestamp_ms}}, 86400000) * 86400000) AND (unix_milli < {{.end_timestamp_ms}}) AND JSONExtractString(labels, 'service_name') = 'redis'
) AS filtered_time_series USING (fingerprint)
WHERE (metric_name = 'signoz_calls_total') AND (unix_milli >= {{.start_timestamp_ms}}) AND (unix_milli <= {{.end_timestamp_ms}})
GROUP BY
fingerprint,
ts
ORDER BY
fingerprint ASC,
ts ASC
)
WHERE isNaN(value) = 0
)
GROUP BY ts
ORDER BY ts ASC
) AS B ON A.ts = B.ts
99th percentile latency for a service
This is the query for the 99th percentile latency for a service. The query is similar to the request rate query. We are using the histogramQuantile function to get the 99th percentile latency. The histogramQuantile function is used to get the quantile value from a histogram. The histogram is represented as two arrays. One array contains the buckets and the other array contains the values. The histogramQuantile function takes the two arrays and the quantile value as input and returns the quantile value. The bucket bounds and values are obtained by taking the rate of change for each bucket.
SELECT
ts,
histogramQuantile(arrayMap(x -> toFloat64(x), groupArray(le)), groupArray(value), 0.99) AS value
FROM
(
SELECT
le,
ts,
sum(value) AS value
FROM
(
SELECT
le,
ts,
if(runningDifference(ts) <= 0, nan, if(runningDifference(value) < 0, value / runningDifference(ts), runningDifference(value) / runningDifference(ts))) AS value
FROM
(
SELECT
fingerprint,
le,
toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
max(value) AS value
FROM signoz_metrics.distributed_samples_v4
INNER JOIN
(
SELECT DISTINCT
JSONExtractString(labels, 'le') AS le,
fingerprint
FROM signoz_metrics.time_series_v4_1day
WHERE (metric_name = 'signoz_latency_bucket') AND (temporality = 'Cumulative') AND (unix_milli >= intDiv({{.start_timestamp_ms}}, 86400000) * 86400000) AND (unix_milli < {{.end_timestamp_ms}}) AND JSONExtractString(labels, 'service_name') = 'frontend'
) AS filtered_time_series USING (fingerprint)
WHERE (metric_name = 'signoz_latency_bucket') AND (unix_milli >= {{.start_timestamp_ms}}) AND (unix_milli <= {{.end_timestamp_ms}})
GROUP BY
fingerprint,
le,
ts
ORDER BY
fingerprint ASC,
le ASC,
ts ASC
)
WHERE isNaN(value) = 0
)
GROUP BY le, ts
HAVING isNaN(value) = 0
ORDER BY
le ASC,
ts ASC
)
GROUP BY ts
ORDER BY ts ASC
Using variables in queries
SigNoz supports using variables in queries. This allows you to create a single dashboard for multiple services. For example, you can create a dashboard for the request rate for all the services. Then, you can use the service name as a variable in the query to get the request rate for a particular service. Please refer to the Variables section for more information on how to create variables.
Example queries using variables
Request rate for a service
The variable {{.service_name}}
is used to get the service name from the variable. The variable is replaced with the service name when the query is executed.
SELECT
ts,
sum(per_series_value) AS value
FROM
(
SELECT
ts,
If((per_series_value - lagInFrame(per_series_value, 1, 0) OVER rate_window) < 0, nan, If((ts - lagInFrame(ts, 1, toDate('1970-01-01')) OVER rate_window) >= 86400, nan, (per_series_value - lagInFrame(per_series_value, 1, 0) OVER rate_window) / (ts - lagInFrame(ts, 1, toDate('1970-01-01')) OVER rate_window))) AS per_series_value
FROM
(
SELECT
fingerprint,
toStartOfInterval(toDateTime(intDiv(unix_milli, 1000)), toIntervalSecond(60)) AS ts,
max(value) AS per_series_value
FROM signoz_metrics.distributed_samples_v4
INNER JOIN
(
SELECT DISTINCT fingerprint
FROM signoz_metrics.time_series_v4_1day
WHERE (metric_name = 'signoz_calls_total') AND (temporality = 'Cumulative') AND (unix_milli >= intDiv({{.start_timestamp_ms}}, 86400000) * 86400000) AND (unix_milli < {{.end_timestamp_ms}}) AND JSONExtractString(labels, 'service_name') = {{.service_name}}
) AS filtered_time_series USING (fingerprint)
WHERE (metric_name = 'signoz_calls_total') AND (unix_milli >= {{.start_timestamp_ms}}) AND (unix_milli < {{.end_timestamp_ms}})
GROUP BY
fingerprint,
ts
ORDER BY
fingerprint ASC,
ts ASC
)
WINDOW rate_window AS (PARTITION BY fingerprint ORDER BY fingerprint ASC, ts ASC)
)
WHERE isNaN(per_series_value) = 0
GROUP BY ts
ORDER BY ts ASC
Using the default variables
The following variables are available by default:
{{.start_timestamp_ms}}
- This is the start time of the query in milliseconds{{.end_timestamp_ms}}
- This is the end time of the query in milliseconds