NVIDIA GPU metrics with DCGM Exporter

This document explains how to monitor NVIDIA GPUs using the DCGM Exporter and SigNoz. The exporter collects GPU metrics and exposes them for Prometheus-style scraping.

Prerequisites

NVIDIA GPU(s) with drivers installed
NVIDIA Container Toolkit (for Docker deployments)

Setup

Step 1: Run NVIDIA DCGM Exporter

NVIDIA's official dcgm-exporter exposes GPU metrics on :9400/metrics.

Docker (single node quickstart):

docker run -d \
  --gpus all \
  --cap-add SYS_ADMIN \
  --rm \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04

Verify it's running:

curl localhost:9400/metrics | head

You should see metrics like DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, etc.

Kubernetes: For Kubernetes deployments, NVIDIA recommends installing via the Helm chart.

Step 2: Setup OTel Collector

Refer to this documentation to set up the collector.

Step 3: Configure the Prometheus Receiver

Add a scrape job for the DCGM exporter in your OTel Collector config:

config.yaml

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "dcgm-exporter"
          scrape_interval: 15s
          scrape_timeout: 10s
          static_configs:
            - targets: ["<gpu-node-host>:9400"]

Configuration parameters:

<gpu-node-host>: Hostname or IP of the node running the DCGM exporter

If the OTel Collector runs on the same host as the exporter (non-containerized), use localhost:9400 as the target. In containerized environments, use the container name or service name instead.

Step 4: Enable the Pipeline

Add the receiver to your metrics pipeline:

config.yaml

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlp]

Visualizing GPU Metrics

Once configured, verify ingestion in the Metrics Explorer. Search for metrics starting with DCGM_.

You can use the pre-configured NVIDIA DCGM dashboard to monitor your GPUs:

Dashboards → + New dashboard → Import JSON

Troubleshooting

Common Issues

No metrics appearing in SigNoz
- Verify the DCGM exporter is running and /metrics endpoint is accessible
- Ensure NVIDIA drivers are properly installed
Container fails to start
- Verify NVIDIA Container Toolkit is installed
- Check if GPUs are visible with nvidia-smi
- Ensure --gpus all flag is passed to Docker

NVIDIA GPU metrics with DCGM Exporter

Prerequisites

Setup

Step 1: Run NVIDIA DCGM Exporter

Step 2: Setup OTel Collector

Step 3: Configure the Prometheus Receiver

Step 4: Enable the Pipeline

Visualizing GPU Metrics

Troubleshooting

Common Issues

Was this page helpful?

Is this page helpful?