In our last blog post (Monitoring OpenMetrics for Gunicorn and Django application in Prometheus) we had 0.5, 0.9 and 0.99 quantiles from statsd-exporter for individual instances of statsd-exporter. In that blog, quantile was calculated at the client and the quantiles are exposed to prometheus as metrics. In this blog, we shall send observation frequencies in the bucket intervals chosen and shall aggregate those at the backend (prometheus).

Aggregation of quantiles using average does not make sense of the data.

Example:

data1 -> 100, 110, 110, 120, 125, 130, 105, 115, 130, 250, 145 -> median is  117.5

import numpy as np

arr = [100, 110, 110, 120, 125, 130, 105, 115, 130, 250, 145, 115]
print (np.median(arr))

Output: 117.5

data2 -> [405, 115, 390, 550, 250, 330, 440, 180, 525] -> median is 390

arr2 = [405, 115, 390, 550, 250, 330, 440, 180, 525]
print (np.median(arr2))

Output: 390

Actual Median of combined datasets is:

arr.extend(arr2)
print (np.median(arr))

Output: 130.0

Now how to aggregate medians of data1 and data2?

Average gives (117.5 + 390)/2 = 253.75 while the actual median of the combined data set is 130.0

If the above were response time of an application from 2 different nodes, the aggregation would give a misleading insight

Comes into picture Histogram metric type in prometheus. The idea is to send histogram of observations falling in bucketed time intervals like:

print (np.histogram(arr))

Output: (array([11, 2, 0, 2, 0, 1, 2, 1, 0, 2]), array([100., 145., 190., 235., 280., 325., 370., 415., 460., 505., 550.]))

A linear interpolation will give you 145 as the approximated median from histogram which is very close to the real median which is 130

To read more about quantile aggregation using summary and histogram in prometheus, go through the below link. It also explains the problems in linear interpolation and that one should have an idea about your buckets for better results.

Histograms and summaries | Prometheus
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

We shall apply the histogram approach to aggregate quantile on prometheus. For that we will have to enable statsd-exporter to send metrics in buckets (number of observations falling in each bucket).

Let's try to implement this into our existing project. Checkout branch quantile_aggregation from our github repo. This contains config/statsd-mapping.conf file which includes:

defaults:
  timer_type: histogram
  buckets: [ 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 2.0 ]
  match_type: glob
  glob_disable_ordering: false
  ttl: 0 # metrics do not expire
mappings:
  - match: "gunicorn.request.status.*"
    #match_type: regex
    help: "gunicorn http response code"
    name: "gunicorn_http_response_code"
    labels:
      status: "$1"

We have a defaults section where we defined the timer_type as histogram and also defined the bucket ranges to collect observations.

The mappings section re-formats our statsd metrics to prometheus labels. This shall convert metrics like  gunicorn.request.status.200 to gunicorn_http_response_code{status="200"} and similarly for others. This will help in filtering gunicorn_http_response_code based on status codes.

Now, we need to apply this configuration file to statsd-exporter using the argument --statsd.mapping= as shown in the code snippet below.

      containers:
      - name: statsd-exporter
        image: prom/statsd-exporter:latest
        args:
        - '--log.level=info'
        - '--statsd.mapping-config=/etc/config/statsd-mapping.conf'

To get this done, we need to place our local configuration file in the pod at the path specified in the above args, i.e, /etc/config/statsd-mapping.conf. This is achieved by creating a ConfigMap kubernetes resource and mount that config to volume.

We can create the ConfigMap by running below command from project root directory:

kubectl -n django-with-statsd create configmap statsd-mapping-config --from-file=./config/statsd-mapping.conf

The section which mounts volume looks like this in the deployment file.

    spec:
      containers:
      - name: statsd-exporter
        image: prom/statsd-exporter:latest
        args:
        - '--log.level=info'
        - '--statsd.mapping-config=/etc/config/statsd-mapping.conf'
        #- --statsd.event-flush-interval=16s

        volumeMounts:
        - name: config-volume
          mountPath: /etc/config/
 
      - name: django-sample-project
        image: ankitnayan/django_sample_project:quantiles_v1


      volumes:
      - name: config-volume
        configMap:
          name: statsd-mapping-config
          items:
          - key: statsd-mapping.conf
            path: statsd-mapping.conf

Next we apply our deployment file by kubectl -n django-with-statsd apply -f k8s/django-deployment.yml. The pod should now contain the file statsd-mapping.conf at /etc/config/. You can verify this by:

$ POD_NAME=`kubectl -n django-with-statsd get pods -o jsonpath='{.items[0].metadata.name}'`
$ kubectl -n django-with-statsd exec -it pod/$POD_NAME /bin/sh

If you can see the file by ls /etc/config/, then you are good to proceed.

Just to cross reference, we shall now take a look at the metrics which prometheus scrapes by ssh-ing into the cluster and fetching the metric endpoint exposed to prometheus as scrape target.

Since the prometheus expression browser is accessed via tunnel at port 9090 on localhost (as set up in previous blog). Visit http://localhost:9090/targets#job-kubernetes-pods, it gives you the list of pods as targets. Mine looks like this:

The endpoints contain the metrics that prometheus scrapes at scrape-interval. These endpoints are internal addresses and can be accessed only within the cluster. So ssh into any of the ec2 instances which are nodes in your cluster. I chose to do by EC2 Instance Connect (browser-based SSH connection)provided in aws console.

When you get into the instance, run curl http://192.168.19.6:9102/metrics | grep django and replace http://192.168.19.6:9102/metrics by your scrape target. The output will look something like this:

Now our application instrumented metrics does not contain quantiles but shall contain buckets and this can easily be aggregated in prometheus by running below command to your expression browser.

histogram_quantile(0.5, sum(rate(django_request_latency_seconds_bucket{endpoint="/polls/2xx_success/"}[1m])) by (le))

I usually get the output to be around 0.63-0.7. I hope you get something similar.

This is the aggregated 0.5 quantile (median) from all statsd-exporter instances. Similarly, in the above PromQL we can replace 0.5 by 0.9 and 0.95 to get the 90th percentile and 95th percentile respectively.

Keep in mind to generate load from locustio before running these promql queries since we are working on rate (per second) of metrics or else you shall get NaN in output

Plotting percentiles and 5 slowest endpoints in Grafana

You can get the below graph by applying this query in grafana:

histogram_quantile(0.5, sum(rate(django_request_latency_seconds_bucket{endpoint="/polls/2xx_success/"}[5m])) by (le))

Add 2 more queries with 1st arg to histogram_quantile as 0.9 and 0.99.

To get the 5 slowest endpoints follow blog Graph top N time series in Grafana. A slight change will be the Regex expression for the variable. Check mine from the below image.

And, the grafana query becomes histogram_quantile(0.9, sum(rate(django_request_latency_seconds_bucket{endpoint=~"$slow_endpoints"}[5m])) by (le, endpoint))

Finally the graph we plot is:

I have only 3 endpoints so it displays only 3 endpoints in the graph but this works for >5 endpoints also

I hope, this blog helped those looking for ways to aggregate percentiles/quantiles from multiple instances of statsd-exporter in prometheus.

For any Prometheus related query reach me out on Twitter or mail me at ankit@signoz.io