What is High Cardinality Data?

While reading a GitHub issue on the OpenTelemetry Collector about trying to send two versions of a metric, one with higher and one with lower cardinality, it occurred to me that we’ve never written on this blog about what is high-cardinality data exactly and how it matters to your OpenTelemetry observability.

High-cardinality data refers to a dataset or data attribute that contains a large number of distinct values relative to the total number of data points. In other words, it refers to data with a large number of unique or distinct entries compared to the overall size of the dataset.

What is High Cardinality Data?

High cardinality data is characterized by a large number of distinct values within a dataset. In database terms, cardinality refers to the uniqueness of data values in a column. A high cardinality column contains many unique values, while a low cardinality column has few unique values.

Examples of high cardinality fields include:

User IDs in a large application
Timestamps in time-series data
IP addresses in network logs
Product SKUs in an e-commerce database

To illustrate, consider a database of customer orders:

Low cardinality: Order status (e.g., "Pending," "Shipped," "Delivered")
High cardinality: Order ID (unique for each order)

High cardinality data provides granular information but poses challenges for storage, indexing, and querying.

What makes High Cardinality Data?

Uniqueness of values:
High-cardinality data exhibits a wide variety of unique values. Each value occurs relatively infrequently compared to the total number of data points. For example, if you have a dataset of user logs, a high-cardinality attribute could be the IP address, where each IP address is unique or occurs with low frequency.
Granularity and dimensionality:
High-cardinality data often represents attributes or dimensions that provide detailed information and granularity.

Here are some examples of high-cardinality data attributes in JSON format:

User IDs:
```
{
"user_id": "ABC123",
"name": "John Doe",
"email": "john.doe@example.com",
...
}
```
When tracking UserID’s, there’s very little statistical summary that can be useful. While it might be interesting to see the top 10 user ID’s that have transactions on our service, it’s likely that there’s a ‘long tail’ where every single user ID occurs 1-3 times.
Device IDs:
```
{
"device_id": "DEVICE987",
"type": "Mobile",
"manufacturer": "Apple",
...
}
```
The capture of a metric like device_id can point out how critical it is that we understand the significance of metrics rather than just looking at the data as is. Depending on context, device_id might be a unique value for each device, or only unique by model. If the latter, we could offer quite useful summaries of this high cardinality data if we use a lookup table to group devices by year, size, etc.
Event types:
```
{
"event_type": "click",
"timestamp": "2023-06-25T10:15:00Z",
...
}
```
Timestamps are a prime example of high-cardinality data. While you can make plenty of statistical calculations about event time if your query language supports it, you can’t make simple statements about averages, sums, or median values.
These examples demonstrate attributes that can have a large number of distinct values. Each attribute, such as user IDs, SKUs, device IDs, locations, tags, event types, etc., may have a high cardinality due to the variety and uniqueness of values.

The Importance of Dimensionality in High Cardinality Data

Dimensionality refers to the number of attributes or features in a dataset. High cardinality often correlates with high dimensionality, as datasets with many unique values typically have numerous attributes.

The relationship between cardinality and dimensionality impacts data analysis:

Increased complexity: More dimensions mean more possible combinations of values.
Sparse data: High-dimensional spaces often contain sparse data points.
Curse of dimensionality: As dimensions increase, the volume of the space increases so fast that available data becomes sparse.

Despite challenges, high dimensionality in observability data can provide rich insights. It allows for more detailed analysis of system behavior, user patterns, and anomaly detection.

Benefits of High Cardinality Data

Why include high-cardinality attributes? There are a number of reasons to make sure you’re grabbing unique or at least low-frequency values for some of your metrics.

Analytical significance

High-cardinality data is often significant for analytics and data exploration. Analyzing and aggregating high-cardinality attributes can provide insights into user behavior, patterns, preferences, or other dimensions of interest. For example, analyzing user IDs can help understand individual user actions and behavior patterns.

High Cardinality for High Value Transactions

I remember supporting the observability configuration for an enterprise client, in this case a B2B bank. They were asking for alerts any time a single transaction took more than 5 minutes.

While we worked to support their needs, it took me a while to realize that every transaction of this type was a payment of more than a million dollars! When every single transaction has a business impact, it’s critical to capture as much information as possible.

In these cases low-cardinality data can become much less useful. It doesn’t matter if our average response time is fine if our most high-value client isn’t having a good experience.

Challenges with High Cardinality Data

Indexing challenges:
Storing and indexing high-cardinality data can present challenges. Traditional indexing techniques, such as B-tree indexes, may not be efficient for high-cardinality attributes because the index size would be too large, resulting in slower query performance.
Memory and storage considerations:
Storing high-cardinality data may require additional memory and storage resources compared to low-cardinality data. The increased number of unique values can contribute to larger index sizes, more memory consumption, and potentially increased storage requirements.
Lack of high-level metrics:
This one is counter-intuitive; if you’re storing SKU’s of course there’s not an ‘average’ SKU, but when you’re able to present high-level metrics like average request time, most commonly used geo area, etc, it can stand out that there’s no way to present an average or even most-frequent value meaningfully. In my experience working with APM tools this came up! high-level dashboards would display ‘most frequent’ values for metrics, but on closer inspection the value had occurred three times over 2 million rows. This meant things like the ‘Number 1 userID’ on all transactions represented 0.000001% of all transactions.

Why High Cardinality Data Matters in Modern Applications

High cardinality data plays a crucial role in various modern applications:

Time-series databases: Essential for storing and analyzing data points with unique timestamps.
Analytics precision: Enables granular analysis of user behavior, system performance, and business metrics.
Industrial IoT: Facilitates tracking of individual sensors and devices in large-scale deployments.
Decision-making: Provides detailed insights for data-driven decision-making in business and technology.

For example, in e-commerce, high cardinality data allows for personalized product recommendations based on individual user behavior and preferences.

The Impact of High Cardinality on Time-Series Databases

Time-series databases face unique challenges with high cardinality data:

Increased write amplification: More unique series require more write operations.
Query performance degradation: Searching across numerous unique series can slow down queries.
Higher memory usage: Indexing high cardinality data consumes more memory.

Different time-series databases handle high cardinality data differently:

InfluxDB uses a Time-Structured Merge Tree (TSM) to efficiently store and query high cardinality data.
TimescaleDB leverages PostgreSQL's indexing capabilities to manage high cardinality fields.

Optimizing time-series data with high cardinality often involves trade-offs between query performance and data retention. Strategies include:

Downsampling: Aggregating data points over time to reduce cardinality.
Selective indexing: Carefully choosing which fields to index based on query patterns.
Partitioning: Dividing data into smaller, more manageable chunks.

Solutions and Best Practices for Managing High Cardinality Data

To effectively manage high cardinality data, consider these approaches:

Efficient indexing techniques:
- B-trees: Optimal for range queries on high cardinality data.
- Inverted indexes: Useful for full-text search in high cardinality text fields.
Data modeling strategies:
- Denormalization: Storing redundant data to improve query performance.
- Bucketing: Grouping similar values to reduce cardinality.
Aggregation and downsampling:
- Pre-aggregation: Calculating and storing aggregate values for common queries.
- Time-based downsampling: Reducing data resolution over time.
Columnar storage:
- Leveraging column-oriented databases for improved compression and query performance on high cardinality data.

Observability Tools for High Cardinality Data

Observability is crucial for understanding system behavior in distributed systems with high cardinality data. SigNoz, an open-source observability platform, offers powerful capabilities for handling high cardinality data:

Efficient data ingestion and storage optimized for high cardinality fields.
Advanced querying and visualization tools for exploring high-dimensional data.
Anomaly detection and alerting based on high cardinality metrics and traces.

SigNoz cloud is the easiest way to run SigNoz. Sign up for a free account and get 30 days of unlimited access to all features.

You can also install and self-host SigNoz yourself since it is open-source. With 20,000+ GitHub stars, open-source SigNoz is loved by developers. Find the instructions to self-host SigNoz.

Compared to traditional monitoring solutions, SigNoz provides better scalability and performance for high cardinality scenarios, enabling deeper insights into complex systems.

Future Trends in High Cardinality Data Management

As data volumes and complexity continue to grow, several trends are emerging in high cardinality data management:

Advanced database technologies:
- Vector databases for efficient similarity search in high-dimensional spaces.
- Graph databases for managing complex relationships in high cardinality data.
Machine learning approaches:
- Dimensionality reduction techniques like t-SNE and UMAP for visualizing high-dimensional data.
- Anomaly detection algorithms designed for high cardinality scenarios.
Emerging use cases:
- Edge computing generating vast amounts of high cardinality data from IoT devices.
- Blockchain and distributed ledger technologies producing high cardinality transaction data.
Evolution of data storage and analysis:
- Serverless data platforms for scalable processing of high cardinality data.
- Quantum computing potential for handling complex, high-dimensional datasets.

Conclusions

Managing high-cardinality data efficiently often involves careful consideration of storage strategies, indexing techniques, and data modeling approaches. It's important to strike a balance between capturing the necessary granularity of data and optimizing performance and resource utilization for analysis and querying purposes.

High-cardinality data also comes with considerations, such as increased storage requirements, data management complexities, and potential privacy concerns. Platform teams should carefully evaluate the benefits and costs associated with storing high-cardinality data and implement appropriate data retention policies to ensure efficient and responsible data usage.

FAQs

What's the difference between high cardinality and high dimensionality?

High cardinality refers to many unique values within a single attribute, while high dimensionality involves many attributes or features in a dataset. They often occur together but are distinct concepts.

How does high cardinality affect database performance?

High cardinality can slow down queries, increase storage requirements, and complicate indexing. It may lead to longer search times and higher resource usage, especially in large datasets.

Can you provide examples of high cardinality data in real-world applications?

Examples include user IDs in social media platforms, GPS coordinates in mapping applications, and unique product identifiers in e-commerce systems. Each instance represents a field with numerous unique values.

What are the best practices for designing schemas with high cardinality fields?

Best practices include:

Using appropriate data types to minimize storage.
Carefully selecting indexes based on query patterns.
Considering denormalization for frequently accessed data.
Implementing partitioning strategies to manage large datasets.
Employing columnar storage for analytical workloads.

What is High Cardinality Data?

Author: