This document explains how to monitor NVIDIA GPUs using the DCGM Exporter and SigNoz. The exporter collects GPU metrics and exposes them for Prometheus-style scraping.
Prerequisites
- NVIDIA GPU(s) with drivers installed
- NVIDIA Container Toolkit (for Docker deployments)
Setup
Step 1: Run NVIDIA DCGM Exporter
NVIDIA's official dcgm-exporter exposes GPU metrics on :9400/metrics.
Docker (single node quickstart):
docker run -d \
--gpus all \
--cap-add SYS_ADMIN \
--rm \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
Verify it's running:
curl localhost:9400/metrics | head
You should see metrics like DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, etc.
Kubernetes: For Kubernetes deployments, NVIDIA recommends installing via the Helm chart.
Step 2: Setup OTel Collector
Refer to this documentation to set up the collector.
Step 3: Configure the Prometheus Receiver
Add a scrape job for the DCGM exporter in your OTel Collector config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: "dcgm-exporter"
scrape_interval: 15s
scrape_timeout: 10s
static_configs:
- targets: ["<gpu-node-host>:9400"]
Configuration parameters:
<gpu-node-host>: Hostname or IP of the node running the DCGM exporter
If the OTel Collector runs on the same host as the exporter (non-containerized), use localhost:9400 as the target. In containerized environments, use the container name or service name instead.
Step 4: Enable the Pipeline
Add the receiver to your metrics pipeline:
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlp]
Visualizing GPU Metrics
Once configured, verify ingestion in the Metrics Explorer. Search for metrics starting with DCGM_.
You can use the pre-configured NVIDIA DCGM dashboard to monitor your GPUs:
Dashboards → + New dashboard → Import JSON
Troubleshooting
Common Issues
No metrics appearing in SigNoz
- Verify the DCGM exporter is running and
/metricsendpoint is accessible - Ensure NVIDIA drivers are properly installed
- Verify the DCGM exporter is running and
Container fails to start
- Verify NVIDIA Container Toolkit is installed
- Check if GPUs are visible with
nvidia-smi - Ensure
--gpus allflag is passed to Docker