ClickHouse Monitoring Dashboard

This dashboard provides comprehensive monitoring of ClickHouse database performance, offering detailed visibility into query execution times, resource utilization, connection metrics, replication health, and storage operations. It enables teams to effectively monitor their ClickHouse deployments and troubleshoot performance issues.

Dashboards → + New dashboard → Import JSON

What This Dashboard Monitors

This dashboard tracks essential ClickHouse metrics to help you:

  • Query Performance Monitoring: Track query execution times, throughput, and failed queries
  • Resource Utilization: Monitor CPU, memory, and disk usage across ClickHouse nodes
  • Connection Management: Track active connections, connection pool status, and session metrics
  • Replication Health: Monitor replica lag, synchronization status, and cluster health
  • Storage Operations: Track disk usage, merge operations, part statistics, and storage efficiency
  • System Performance: Analyze background processes, cache performance, and system load

Metrics Included

Query Performance Section

Query Execution Metrics

  • Queries Per Second (QPS): Real-time query throughput showing database load

    • Grouped by query type (SELECT, INSERT, etc.) for detailed analysis
    • Essential for capacity planning and performance optimization
  • Query Execution Time: Response time metrics including:

    • Average Duration: Mean query execution time
    • P50/P90/P99 Latency: Percentile-based latency analysis
    • Helps identify slow queries and performance bottlenecks

Query Status Tracking

  • Failed Query Rate: Percentage of failed queries over time

    • Critical for identifying database issues and application problems
    • Grouped by error type for detailed troubleshooting
  • Query Types Distribution: Breakdown of query types (SELECT, INSERT, UPDATE, DELETE)

    • Shows workload patterns and database usage trends
    • Useful for optimizing database schema and indexes

Resource Utilization Section

CPU and Memory Performance

  • CPU Utilization: Host and ClickHouse process CPU consumption

    • Tracks both system-wide and ClickHouse-specific CPU usage
    • Essential for identifying processing bottlenecks
  • Memory Usage: Memory consumption metrics including:

    • Total Memory Usage: Overall memory consumption by ClickHouse
    • Query Memory Usage: Memory used by active queries
    • Cache Memory: Memory used for various ClickHouse caches

Storage Metrics

  • Disk Usage: Storage consumption across ClickHouse data directories

    • Tracks data growth patterns and capacity planning needs
    • Grouped by disk and partition for detailed analysis
  • Disk I/O Operations: Read and write operations per second

    • Monitor disk performance and identify I/O bottlenecks
    • Essential for storage optimization and capacity planning

Connection and Session Management

Connection Metrics

  • Active Connections: Current number of client connections

    • Tracks connection pool utilization and capacity
    • Important for connection limit management
  • Connection Rate: New connections per second

    • Monitors connection establishment patterns
    • Helps identify connection leaks or unusual access patterns

Session Analysis

  • Active Sessions: Currently executing queries and their duration
    • Shows concurrent query execution and resource usage
    • Critical for identifying long-running or stuck queries

Replication and Cluster Health

Replication Monitoring

  • Replica Lag: Maximum lag time between replicas

    • Tracks replication health and data consistency
    • Critical for high-availability deployments
  • Replication Queue: Number of pending replication tasks

    • Shows replication backlog and potential issues
    • Important for maintaining cluster synchronization

Cluster Statistics

  • Cluster Node Status: Health status of all cluster nodes

    • Monitors node availability and cluster integrity
    • Essential for distributed ClickHouse deployments
  • Shard Distribution: Data distribution across cluster shards

    • Shows balance of data and query load
    • Important for cluster optimization

Storage and Merge Operations

Part Management

  • Parts Count: Number of data parts per table and partition

    • Tracks storage fragmentation and merge efficiency
    • High part counts may indicate need for merge optimization
  • Parts Size Distribution: Size analysis of data parts

    • Shows storage efficiency and compression effectiveness
    • Useful for optimizing storage and query performance

Background Operations

  • Merge Operations: Active and queued merge operations

    • Tracks background maintenance activities
    • Important for understanding storage optimization progress
  • Mutation Operations: DDL operations and their progress

    • Monitors schema changes and data modifications
    • Critical for tracking long-running operations

Dashboard Variables

This dashboard includes comprehensive filtering capabilities:

  • cluster: Filter by ClickHouse cluster name for multi-cluster environments
  • database: Select specific database for focused monitoring
  • table: Filter metrics by specific table names
  • node: Select individual ClickHouse nodes for node-specific analysis
  • query_type: Filter by query types (SELECT, INSERT, etc.)

Last updated: January 3, 2025

Edit on GitHub

Was this page helpful?