This document explains how to monitor a SLURM cluster using SigNoz. You'll run a Prometheus-compatible SLURM exporter and scrape it with the OpenTelemetry Collector.
Prerequisites
- SLURM cluster running and accessible
- Access to SLURM CLI commands (
sinfo,squeue,sdiag) on the exporter host
Setup
Step 1: Run a SLURM Prometheus Exporter
A commonly used option is prometheus-slurm-exporter, which extracts metrics from SLURM CLI commands and exposes them on a /metrics endpoint (default port :8080).
Run the exporter on a node that has access to SLURM commands.
Step 2: Setup OTel Collector
Refer to this documentation to set up the collector.
Step 3: Configure the Prometheus Receiver
Add a scrape job for the SLURM exporter in your OTel Collector config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: "slurm-exporter"
scrape_interval: 30s
scrape_timeout: 30s
static_configs:
- targets: ["<slurm-exporter-host>:8080"]
Configuration parameters:
<slurm-exporter-host>: Hostname or IP of the node running the SLURM exporterscrape_interval/scrape_timeout: 30s is recommended to avoid overloading the SLURM master
Step 4: Enable the Pipeline
Add the receiver to your metrics pipeline:
service:
pipelines:
metrics:
receivers: [prometheus] # append prometheus to your existing receivers list
processors: [batch]
exporters: [otlp]
Visualizing SLURM Metrics
Once configured, verify ingestion in the Metrics Explorer. Search for SLURM-related metrics (exact names depend on the exporter).
You can use the pre-configured SLURM dashboard to monitor your cluster:
Dashboards → + New dashboard → Import JSON
Troubleshooting
Common Issues
No metrics appearing in SigNoz
- Verify the SLURM exporter is running and
/metricsendpoint is accessible - Ensure firewall allows access to the exporter port
- Verify the SLURM exporter is running and
Metrics showing stale or zero values
- Confirm the exporter host has access to SLURM CLI commands
- Check if SLURM services are running correctly