What is High Cardinality Data and Why Does It Matter?

High cardinality data refers to datasets containing a large number of unique values, such as user names, email addresses, or product codes. Managing this type of data can be challenging due to its rapid growth and complexity, making analysis more difficult.

However, high cardinality data is highly valuable as it can show significant patterns and insights. In this blog, lets see a clear explanation of what high cardinality data is, discuss its importance, and outline effective strategies and tools for managing it efficiently.

In this Blog Post,

Cardinality in Data
What is High Cardinality Data?
Examples of High Cardinality Data
Why is High Cardinality Important?
Managing High Cardinality Data
Handling High Cardinality Data in PostgreSQL

Cardinality in Data

Cardinality in a data context refers to the number of unique values in a column or attribute within a dataset. For example:

If you have a column for Gender with values like "Male" and "Female," the cardinality is 2 because there are only two unique values.
In contrast, a User ID column in a database that has millions of distinct users would have a high cardinality, as the number of unique identifiers is very large.

What is High Cardinality Data?

High cardinality data refers to datasets or columns that contain a large number of unique values, often in the range of thousands, millions, or more. Some examples of high cardinality data include:

User identifiers (User IDs)
Transaction IDs
Email addresses
Device IDs
IP addresses
URLs
Timestamps (especially in event logging)

These data points can vary greatly and often don’t repeat frequently, which makes them challenging to manage and analyse. In the case of event logs or telemetry data, high cardinality values can appear across millions of entries, making traditional database indexing or filtering methods less efficient.

Examples of High Cardinality Data

Here are real-world examples where high cardinality data is commonly encountered:

E-commerce Applications:

Order IDs: Each transaction in an online store has a unique identifier.
Customer IDs: Each registered user has a distinct ID for tracking purchases, returns, and preferences.

Web Analytics:

IP Addresses: Monitoring website visitors involves tracking millions of unique IPs.
URLs: Every webpage visited or requested can be unique, leading to high cardinality in log data.

Event Logging and Monitoring:

Timestamps: Logs often contain precise timestamps, each representing a unique point in time.
Session IDs: Each user session has a distinct identifier, especially in systems like Google Analytics.

Finance and Banking:

Transaction IDs: Every bank transaction generates a unique ID.
Account Numbers: Each customer has a unique account number for identification.

IoT (Internet of Things):

Device IDs: Sensors and devices in an IoT network generate unique identifiers.
Readings: Data such as temperature or humidity readings vary frequently and can lead to high cardinality.

Why is High Cardinality Important?

High cardinality data can present both opportunities and challenges for data engineers, analysts, and organizations that rely on large-scale data systems. Here are some of the key reasons why it's important:

1. Data Storage & Performance

Managing high cardinality data can strain data storage and querying performance. In relational databases, for instance, indexes created on high cardinality columns can become too large and slow down query execution, particularly when searching or filtering through vast amounts of data.

For this reason, systems like NoSQL databases (such as Cassandra or MongoDB) and distributed systems like ClickHouse or Elasticsearch are often preferred for handling high cardinality datasets, as they are optimized for speed and scalability.

2. Aggregation & Summarization

High cardinality can also impact how you summarize or aggregate data. For example, if you need to compute averages or counts on a high cardinality column (e.g., by counting the occurrences of every unique User ID), the sheer number of distinct values can make it difficult to generate timely or meaningful insights without specialized techniques like approximate aggregation.

3. Complex Queries

Complex queries involving high cardinality columns can be resource-intensive. For instance, performing joins or aggregations on data with millions of unique values could lead to performance bottlenecks, as the system has to scan and process large amounts of data in each query.

Managing High Cardinality Data

To handle high cardinality data effectively, you need strategies that balance data storage, performance, and analysis:

1. Indexing

Indexing is a common way to speed up data retrieval, but with high cardinality data, this can become inefficient if not implemented properly. Instead of indexing every unique value, it’s better to use strategies like bitmap indexes, compound indexes, or hashing for better performance with large datasets.

2. Data Partitioning

Splitting large datasets into smaller, more manageable chunks through partitioning can help. For example, event logs can be partitioned by date or user ID, reducing the overhead when querying the data.

3. Data Sampling

Instead of working with the entire dataset, using data sampling techniques (selecting a random subset of the data) can help you perform analysis more efficiently without sacrificing accuracy. Sampling is particularly useful when high cardinality data isn't necessary for every query or analysis.

4. Approximate Aggregation

For some types of analysis, approximate algorithms like HyperLogLog (for counting distinct values) or Bloom filters (for set membership checks) are used to approximate counts and statistics efficiently without the need for exact calculations on every single unique value.

5. Data Denormalization

In some cases, denormalizing data (storing redundant data) can help speed up queries that rely heavily on high cardinality values. However, this comes at the cost of data redundancy and the potential for inconsistencies, so it needs to be used judiciously.

6. Specialized Data Systems

Tools such as ClickHouse, Elasticsearch, and Apache Druid are designed to handle high cardinality data efficiently. These systems often offer specific optimizations that make them suitable for analytics on large datasets with millions or billions of unique values.

Handling High Cardinality Data in PostgreSQL

We will explore the strategies for effectively managing high cardinality data in PostgreSQL, ensuring your database remains performant even with millions or billions of unique values.

The core issue with high cardinality lies in how database indexes work. Traditional B-tree indexes, while efficient for many use cases, can become less effective when dealing with a vast number of unique values.

Strategies for Optimization:

Here are several techniques to mitigate the performance impact of high cardinality data in PostgreSQL:

1. Proper Indexing:

B-tree Indexes (with Caution): While not ideal for extremely high cardinality, B-tree indexes are still the default and often a good starting point. Ensure you are indexing the correct columns and using composite indexes where appropriate.
Hash Indexes (Limited use cases): Hash indexes are suitable for equality lookups (WHERE column = value). However, they don't support range scans (WHERE column > value) and have limitations on index size. Use them for specific use cases.
BRIN Indexes (Block range indexes): BRIN indexes are exceptionally efficient for large tables where the data is physically correlated with the indexed column (e.g., a timestamp column where data is inserted chronologically). They store summary information about ranges of physical blocks, significantly reducing index size and improving performance for range queries.
GIN and GIST Indexes (Specialized use cases): GIN (Generalized Inverted Index) and GIST (Generalized Search Tree) indexes are designed for specialized data types like arrays, full-text search, and geometric data. They can be helpful if your high cardinality data is stored in these formats.

2. Partitioning:

Partitioning divides a large table into smaller, more manageable pieces. This drastically reduces the amount of data that needs to be scanned during queries.

Range Partitioning: Partitioning by ranges of values (e.g., date ranges, numerical ranges) is often ideal for high cardinality data that has a natural ordering.
List Partitioning: Partitioning by specific values (e.g., country codes, product categories) can be useful in certain scenarios.
Hash Partitioning: Distributes data evenly across partitions based on a hash function, which can be beneficial for evenly distributed high cardinality data.

3. Data Modeling:

Normalization: Properly normalizing your database schema can help reduce data redundancy and improve query performance.
UUIDs (Universally Unique Identifiers): While UUIDs offer global uniqueness, they can lead to index fragmentation due to their random nature. Consider using sequential UUID generators or other strategies to mitigate this.

4. Query Optimization:

EXPLAIN ANALYZE: Use EXPLAIN ANALYZE to understand how PostgreSQL is executing your queries and identify performance bottlenecks.
Statistics Collection: Ensure PostgreSQL has up-to-date statistics about your data by running ANALYZE regularly. This helps the query planner make optimal decisions.

5. Hardware Considerations:

Sufficient RAM: Adequate memory allows PostgreSQL to cache more data, reducing disk I/O.
Fast Storage: Using SSDs or NVMe drives significantly improves read and write speeds.

Example:

Let's illustrate these concepts with a concrete example. Imagine an e-commerce platform with millions of products and a products table with a high cardinality product_id column.

Basic Setup:

CREATE TABLE products (
    product_id BIGINT PRIMARY KEY,
    name TEXT,
    description TEXT,
    category_id INTEGER,
    created_at TIMESTAMP WITHOUT TIME ZONE
);

-- Insert dummy data
INSERT INTO products (product_id, name, description, category_id, created_at)
SELECT generate_series(1, 1000000), 'Product ' || generate_series(1, 1000000), 'Description of product ' || generate_series(1, 1000000), (random() * 100)::int, now() - interval '1 day' * generate_series(1, 1000000);

The Problem (Slow Query):

A simple query to find a product by ID can become slow:

CREATE INDEX idx_products_product_id ON products (product_id);

EXPLAIN ANALYZE SELECT * FROM products WHERE product_id = 500000;

Solutions:

a) Partitioning (Range Partitioning):

If product IDs are assigned sequentially, we can partition by ranges:

DROP INDEX idx_products_product_id;

CREATE TABLE products_partitioned (
    product_id BIGINT NOT NULL,
    name TEXT,
    description TEXT,
    category_id INTEGER,
    created_at TIMESTAMP WITHOUT TIME ZONE
) PARTITION BY RANGE (product_id);

CREATE TABLE products_partitioned_p1 PARTITION OF products_partitioned FOR VALUES FROM (1) TO (250001);
CREATE TABLE products_partitioned_p2 PARTITION OF products_partitioned FOR VALUES FROM (250001) TO (500001);
CREATE TABLE products_partitioned_p3 PARTITION OF products_partitioned FOR VALUES FROM (500001) TO (750001);
CREATE TABLE products_partitioned_p4 PARTITION OF products_partitioned FOR VALUES FROM (750001) TO (1000001);

INSERT INTO products_partitioned SELECT * FROM products;

CREATE INDEX idx_products_partitioned_p1_product_id ON products_partitioned_p1 (product_id);
CREATE INDEX idx_products_partitioned_p2_product_id ON products_partitioned_p2 (product_id);
CREATE INDEX idx_products_partitioned_p3_product_id ON products_partitioned_p3 (product_id);
CREATE INDEX idx_products_partitioned_p4_product_id ON products_partitioned_p4 (product_id);

EXPLAIN ANALYZE SELECT * FROM products_partitioned WHERE product_id = 500000; -- Much faster!

b) BRIN Index (If Applicable):

If product_id correlates with created_at:

DROP TABLE products_partitioned;
DROP TABLE products;
CREATE TABLE products ( -- Recreate the original table
    product_id BIGINT PRIMARY KEY,
    name TEXT,
    description TEXT,
    category_id INTEGER,
    created_at TIMESTAMP WITHOUT TIME ZONE
);
INSERT INTO products (product_id, name, description, category_id, created_at)
SELECT generate_series(1, 1000000), 'Product ' || generate_series(1, 1000000), 'Description of product ' || generate_series(1, 1000000), (random() * 100)::int, now() - interval '1 day' * generate_series(1, 1000000);

CREATE INDEX idx_products_created_at_brin ON products USING BRIN (created_at);

EXPLAIN ANALYZE SELECT * FROM products WHERE created_at BETWEEN '2024-07-26 10:00:00' AND '2024-07-26 12:00:00';

c) Hash Index (For Equality Lookups Only):

DROP INDEX idx_products_product_id;
CREATE INDEX idx_products_product_id_hash ON products USING HASH (product_id);

EXPLAIN ANALYZE SELECT * FROM products WHERE product_id = 500000;

-- This will NOT work:
-- SELECT * FROM products WHERE product_id > 500000;

Conclusion

High cardinality data is a vital component of modern data analytics, but it requires careful management to avoid performance issues and storage challenges.

By implementing the right strategies for indexing, partitioning, sampling, and using specialized systems, organizations can unlock the full potential of high cardinality datasets while ensuring that their systems remain fast and scalable.

From real-time event tracking to customer behaviour analysis, understanding high cardinality is key to efficient data management and analysis in the age of big data.