SLURM metrics

This document explains how to monitor a SLURM cluster using SigNoz. You'll run a Prometheus-compatible SLURM exporter and scrape it with the OpenTelemetry Collector.

Prerequisites

SLURM cluster running and accessible
Access to SLURM CLI commands (sinfo, squeue, sdiag) on the exporter host

Setup

Step 1: Run a SLURM Prometheus Exporter

A commonly used option is prometheus-slurm-exporter, which extracts metrics from SLURM CLI commands and exposes them on a /metrics endpoint (default port :8080).

Run the exporter on a node that has access to SLURM commands.

Step 2: Setup OTel Collector

Refer to this documentation to set up the collector.

Step 3: Configure the Prometheus Receiver

Add a scrape job for the SLURM exporter in your OTel Collector config:

config.yaml

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "slurm-exporter"
          scrape_interval: 30s
          scrape_timeout: 30s
          static_configs:
            - targets: ["<slurm-exporter-host>:8080"]

Configuration parameters:

<slurm-exporter-host>: Hostname or IP of the node running the SLURM exporter
scrape_interval/scrape_timeout: 30s is recommended to avoid overloading the SLURM master

Step 4: Enable the Pipeline

Add the receiver to your metrics pipeline:

config.yaml

service:
  pipelines:
    metrics:
      receivers: [prometheus]  # append prometheus to your existing receivers list
      processors: [batch]
      exporters: [otlp]

Visualizing SLURM Metrics

Once configured, verify ingestion in the Metrics Explorer. Search for SLURM-related metrics (exact names depend on the exporter).

You can use the pre-configured SLURM dashboard to monitor your cluster:

Dashboards → + New dashboard → Import JSON

Troubleshooting

Common Issues

No metrics appearing in SigNoz
- Verify the SLURM exporter is running and /metrics endpoint is accessible
- Ensure firewall allows access to the exporter port
Metrics showing stale or zero values
- Confirm the exporter host has access to SLURM CLI commands
- Check if SLURM services are running correctly