What is High Cardinality? Explained with Interactive Examples
High cardinality is the silent killer of monitoring systems. It often crashes Prometheus servers, inflates cloud bills, and turns simple queries into minute-long waits. This guide explains why it happens and what you can do about it.
First, What is Cardinality?
Cardinality is simply the count of unique values in a set.
Consider a simple example: you have 40 users in your system. Each user can be described by different attributes: their region, their user ID, or just counted as part of the whole.
The interactive below shows the same 40 users. Try grouping them in different ways and watch how the number of groups changes.
Count everyone together and you get 1 group, group by region and you get 4, and when you group by user ID and you get 40 groups, one for each unique user.
That is cardinality: the number of unique values. Region has low cardinality (4). User ID has high cardinality (40, and it grows with every new user).
Low Cardinality vs. High Cardinality
Think of a database like a filing cabinet.
If you organize files by region, you have 4 drawers: US-East, US-West, EU, and APAC. Finding a file is easy, you just open the right drawer and grab the file. The database can do this quickly because there are only 4 places to look.
Now imagine organizing by user ID. With 40 users, you need 40 drawers. With a million users, you need a million drawers. The filing cabinet becomes a warehouse. Finding something is still possible, but the system is working much harder to keep track of where everything is.
This is the core issue with high cardinality: more unique values means more things for the database to track, organize, and search through.
The table below shows different columns of a database table with varying level of cardinality, try hovering over headers to see their cardinality estimates:
Neither is inherently "bad." High cardinality fields like user_id are incredibly useful for debugging, you want to know which specific user had a problem. The challenge is that they require different strategies to handle efficiently. More on that later.
The Cardinality Explosion
To understand cardinality explosion, you first need to understand how metrics work in time-series databases like Prometheus.
A metric is not just "CPU usage." A metric is CPU usage with labels attached. These labels allow you to slice and filter your metrics, but each unique combination of labels creates a separate data stream to track.
Here is how it works:
- One server: A metric like
cpu_usagewith a single labelhost=server-1creates 1 data stream. - Two servers: Add another host and you have 2 streams: one for
server-1, one forserver-2. - Add users: If you also label by
user_id, the count multiplies. With 2 servers and 3 users, you jump to 6 streams.
# One server = 1 time series
cpu_usage{host="server-1"}
# Two servers = 2 time series
cpu_usage{host="server-1"}
cpu_usage{host="server-2"}
# Now add user_id... this explodes fast
cpu_usage{host="server-1", user_id="123"}
cpu_usage{host="server-1", user_id="456"}
cpu_usage{host="server-2", user_id="123"}
... and so on for every user
With 2 servers and 1,000 users, you now have 2 × 1,000 = 2,000 time series from a single metric.
Each unique combination of label values creates a new time series and as we add more labels, the number of time series multiply. The formula is simple:
Total Series = Label₁ × Label₂ × Label₃ × ...
A single metric with four labels can theoretically explode into billions of time series. Try it below by clicking each label to activate it and watch the count grow.
How High Cardinality Kills Query Performance
Databases use something called an index to find data quickly. Think of it like the alphabetical tabs in a phone book, instead of reading every page to find "Smith," you jump straight to the "S" section.
Indexes work beautifully when there are a small number of categories to organize by. A phone book for a small town is thin and fast to search. But a phone book for the entire world? It becomes impractical.
When you query by a low-cardinality field like status=200, the database looks up "200" in the index and immediately finds all matching rows. Fast.
When you query by a high-cardinality field like user_id=abc123, the index still works but it has millions of entries to manage. This means finding your specific user requires navigating a massive index structure, which takes more time and memory.
Watch the race below - both queries search the same dataset, but one filters by status code (~50 unique values) and the other by user ID (1 million unique values).
Low Cardinality Query(Indexed Lookup)
Waiting...The database jumps directly to the bucket. It doesn't scan everything.
High Cardinality Query(Full Scan)
Waiting...The database must check every single index entry because there are too many unique values to group effectively.
The low-cardinality query uses a compact index and returns almost instantly. The high-cardinality query must work through a much larger index structure. In production, this difference cascades, the slow queries block other queries, consume CPU, and can spiral into a full outage.
This is why observability vendors often charge by cardinality as high cardinality genuinely costs more in infrastructure (more memory for indexes, more CPU for queries, more storage for data).
When Prometheus Runs Out of Memory
Prometheus keeps its active index entirely in memory. This is what makes it fast, but it is also its weakness when cardinality grows.
Here is what typically happens when high-cardinality labels are introduced:
- You add a label like
user_idto a metric - Each unique user ID creates a new time series
- Assuming each time series requires ~3-4 KB of memory (source)
- With 1 million users, that is 3-4 GB of RAM, just for one metric
- The container runs out of memory and gets killed (OOM)
Drag the slider below to specific user counts and watch how high-cardinality labels consume RAM.
Drag to increase users. Watch how RAM fills up linearly with the exploding series count.
OOM (out-of-memory) kills are the most common symptom of cardinality explosion. The container gets terminated mid-operation, potentially corrupting the Write-Ahead Logging (WAL) and you lose recent data. Teams often discover the problem only after losing visibility during an incident — exactly when they needed monitoring most.
The Reddit threads are full of these stories: "I added a request_id label for debugging and my Prometheus crashed overnight." "Why does my memory usage spike every deploy?" (Answer: new pod IDs creating new series with every restart.)

How Different Databases Handle High Cardinality
We have already seen why time-series databases like Prometheus struggle with high cardinality as their in-memory inverted index grows with every unique label combination until the server runs out of RAM but Prometheus is not the only option. Let us look at how different database architectures handle this problem.
Time-Series Databases (Prometheus, InfluxDB)
TSDBs use an inverted index, a data structure that maps each unique label value to a list of matching time series. Think of it like a reverse lookup table: instead of "row → labels," it stores "label → rows."
This is extremely fast for low-cardinality queries like status=200. The database looks up "200" in the index and immediately retrieves all matching series.
The problem: this index must fit in memory. With 10 million unique user_id values, the index itself consumes gigabytes of RAM before you even store any actual data. When memory runs out, the database crashes.
Note: InfluxDB introduced TSI specifically to move the time-series index to disk (memory-mapped) and reduce RAM as the limiting factor, though high series cardinality can still hurt performance and memory.
Row-Oriented Databases (PostgreSQL, MySQL)
Traditional relational databases store data row by row. All the columns for a single row are stored together on disk, one after another.
Row stores don't create a new time series per label-set, so they avoid the index explosion problem. However, inserts still maintain secondary indexes (e.g., B-trees), which adds write overhead. The bigger issue for row stores is entropy.
When columns with very different data types and patterns are stored together (timestamps next to random UUIDs next to short status codes), the data stream becomes chaotic. This chaos makes compression nearly impossible. A billion random user_id values cannot be compressed efficiently because there are no patterns to exploit.
Apart from compression issues, doing analytical queries in row stores also suffers from high I/O. To answer "What is the average latency for user X?" the database must read entire rows, including columns you do not need, just to extract the latency values. This wastes enormous amounts of I/O.
Columnar Databases (ClickHouse, Apache Parquet)
Columnar stores flip the storage model: instead of storing rows together, they store each column separately.
This layout solves both problems we discussed:
- No postings index in RAM: Columnar systems still use indexes (e.g., primary key/data-skipping), but they usually rely more on compressed column scans than a per-label postings index held fully in RAM.
- Better compression: Values of the same type are stored together. A column of
statuscodes (200, 200, 200, 500, 200...) compresses beautifully because there are patterns. Dictionary encoding helps when values repeat often, though it works best with moderate cardinality rather than millions of unique values. - Efficient scans: To calculate average latency, the database reads only the latency column. The
user_idand other columns stay on disk, untouched.
This is why modern analytics databases like ClickHouse can handle billions of unique values that would crash a Prometheus server. They trade the instant index lookup for fast column scans, and with modern SSDs and vectorized CPU instructions, those scans are surprisingly quick.
This architectural difference explains why SigNoz (built on ClickHouse) can handle high-cardinality data that would bring down a traditional TSDB. For queries like "show me p99 latency grouped by service," columnar stores scan a single compressed column and vectorize the computation while an inverted-index TSDB must touch index entries for every matching series.
The tradeoff: TSDBs excel at real-time alerting on low-cardinality metrics because index lookups are instant. Columnar stores excel at analytical queries over high-cardinality data where you need to scan and aggregate. Choosing the right tool depends on your primary use case.
How Cardinality Affects Metrics, Logs, and Traces
High cardinality impacts each observability pillar differently:
Metrics
Metrics suffer the most. Every unique label combination creates a time series that must be tracked continuously. The golden rule: never use unbounded values as metric labels. If you need per-user data, use traces or logs instead.
Logs
Logs handle high cardinality better because they are naturally event-oriented. Each log line is independent and there is no ongoing "series" to maintain. The challenge is query efficiency: searching for a specific user_id in billions of logs requires good indexing or brute-force scanning. Columnar stores shine here.
Traces
Traces are designed for high cardinality. Each trace has a unique trace_id, and spans carry rich context. The tradeoff is volume: you cannot keep every trace at scale. This is where sampling strategies become essential that is you keep representative samples, prioritize errors and slow requests, aggregate the rest.
Strategies for Managing High Cardinality
You have two main options for taming high cardinality: aggregation and sampling. They serve different use cases and make different tradeoffs.
Toggle between the modes below to see how each strategy transforms the same raw data.
150 raw events. Perfect detail, maximum storage cost.
Aggregation is best for metrics. Instead of tracking latency for each individual user, group latency values into ranges (like 0-100ms, 100-500ms, 500ms+) using histograms . Instead of counting requests per container_id, count per service. You lose individual data points but keep statistical accuracy.
Sampling is best for traces. Keep 1% of successful requests, 100% of errors, and 100% of slow requests. Sampling gives you full detail for the requests you keep while dramatically reducing volume.
These strategies are not mutually exclusive. Use aggregated metrics for dashboards and alerting, sampled traces for debugging, and full logs for compliance.
The Takeaway
High cardinality is a fundamental constraint, not a bug to fix. Every unique value has a cost and the question becomes whether that cost is worth it for your use case or not.
A few practical rules:
- For metrics: Never use unbounded label values. If you are tempted to add
user_idas a label, use traces instead. - For logs: Include high-cardinality fields freely, but choose a storage engine that handles them efficiently (columnar > inverted index).
- For traces: Embrace high cardinality but implement smart sampling. Keep errors and slow requests at 100%, sample successes aggressively.
- For architecture: If you expect high cardinality, choose columnar stores like ClickHouse over traditional TSDB architectures.
Understanding cardinality helps you make better decisions about what data to collect, how to structure it, and which tools to use. The monitoring system that never crashes is the one designed with cardinality in mind from the start.
Handling High Cardinality with SigNoz
If you are looking for an observability platform that handles high-cardinality data without the memory constraints of traditional TSDBs, SigNoz is built for exactly that. Powered by ClickHouse's columnar architecture, SigNoz handles metrics, traces, and logs at scale while keeping costs predictable.
Getting Started with SigNoz
You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.
Those who have data privacy concerns and can't send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.
Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.