Optimizing High Cardinality Data in ClickHouse

ClickHouse is known for its fast performance and ability to handle large amounts of data, making it a popular choice for running analytical queries. However, it can face challenges when dealing with high cardinality data, which refers to columns with a large number of unique values. This can affect query performance and storage efficiency if not managed properly.

In this blog, we will explain what high cardinality means in simple terms and share practical ways to handle it in ClickHouse. We will also discuss how it can impact your database performance and why managing it is important in real-world use cases.

Table of Contents:

  1. What is High Cardinality?
  2. Why is High Cardinality a Challenge in ClickHouse?
  3. Strategies for Managing High Cardinality in ClickHouse
  4. Real-World Implementation
  5. Key Improvements

What is High Cardinality?

In database terms, cardinality refers to the uniqueness of values within a column. High cardinality means a column has a large number of distinct values relative to the total number of rows. Examples include:

  • User IDs
  • Product IDs
  • Session IDs
  • Timestamps with high precision

While ClickHouse excels at handling large volumes of data, extremely high cardinality can impact performance, especially for queries involving filtering, grouping, or joining on these columns.

Why is High Cardinality a Challenge in ClickHouse?

ClickHouse's performance is heavily reliant on efficient data compression and indexing. When dealing with high cardinality, several issues can arise:

  • Increased Index Size: Indexes, particularly primary keys, can become significantly larger, consuming more storage and impacting query performance.
  • Reduced Compression Ratio: High cardinality data is often less compressible, leading to larger storage requirements and increased I/O during queries.
  • Slower Filtering and Grouping: Filtering and grouping operations on high cardinality columns can become less efficient as ClickHouse needs to process a larger number of distinct values.

Strategies for Managing High Cardinality in ClickHouse

Here are several effective strategies to mitigate the impact of high cardinality in ClickHouse:

1. Choosing the right data types

  • LowCardinality Data Type: This is ClickHouse's most powerful tool for handling high cardinality. It stores distinct values separately and uses them as a dictionary, significantly reducing storage space and improving query performance, especially for filtering and grouping. Use this whenever possible for high cardinality columns.
CREATE TABLE user_events (
    event_time DateTime,
    user_id LowCardinality(UInt32),  -- Convert high cardinality column
    event_type String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id);
  • Enum Data Type: If the number of distinct values is relatively small and known in advance, Enum can be a very efficient choice.

2. Optimizing primary keys and sorting keys

  • Minimize Primary Key Cardinality: Avoid including extremely high cardinality columns in the primary key if possible. Choose columns that provide good data distribution and are frequently used in filtering.
  • Use Sorting Keys Effectively: Sorting keys determine the physical order of data on disk. Include frequently filtered columns in the sorting key to improve query performance. Consider the order of columns in the sorting key – place higher cardinality columns later.
    CREATE TABLE user_events (
        event_time DateTime,
        user_id LowCardinality(UInt32),
        event_type String
    ) ENGINE = MergeTree()
    PARTITION BY toYYYYMM(event_time)
    ORDER BY (event_time, user_id);  -- user_id is placed after event_time
    ```

3. Using appropriate codecs

  • Delta Compression: For columns with sequential or near-sequential values (like timestamps or auto-incrementing IDs), delta compression can significantly improve compression ratios.
  • Double Delta Compression: Can be even more effective for highly sequential data.
CREATE TABLE user_events (
    event_time DateTime CODEC(Delta, LZ4),  -- Optimal for sequential timestamps
    user_id LowCardinality(UInt32),
    event_type String CODEC(ZSTD(1))  -- Better compression for strings
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id);

4. Data skipping indexes (Min/Max, Set, Bloom Filter, etc.)

These indexes help ClickHouse quickly skip over large portions of data that don't match the query criteria. They can be especially helpful for filtering on high cardinality columns, even when using LowCardinality.

CREATE TABLE user_events (
    event_time DateTime,
    user_id LowCardinality(UInt32),
    event_type String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time)
SETTINGS index_granularity = 8192
SAMPLE BY user_id; -- Sampling by high cardinality column

5. Materialized views

If you frequently perform aggregations or complex calculations on high cardinality data, consider using materialized views to pre-calculate and store the results. This can significantly speed up query performance.

Example: Optimizing user activity tracking

Let's say you are tracking user activity on a website. You have an user_events table with a user_id column.

CREATE MATERIALIZED VIEW daily_user_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id)
AS SELECT
    toDate(event_time) as event_date,
    user_id,
    count() as event_count,
    uniqExact(event_type) as unique_events
FROM user_events
GROUP BY event_date, user_id;

Real-World Implementation

Let us look at a complete example of an optimized event tracking system:

-- Main events table with optimized structure
CREATE TABLE user_events (
    event_time DateTime CODEC(Delta, LZ4),
    user_id LowCardinality(UInt32),
    event_type LowCardinality(String),
    properties String CODEC(ZSTD(1)),

    INDEX event_type_idx event_type TYPE set(100) GRANULARITY 4,
    INDEX user_id_idx user_id TYPE minmax GRANULARITY 4
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
SETTINGS index_granularity = 8192;

-- Materialized view for quick user activity analysis
CREATE MATERIALIZED VIEW user_activity
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, user_id)
AS SELECT
    toDate(event_time) as date,
    user_id,
    count() as event_count,
    groupUniqArray(event_type) as event_types
FROM user_events
GROUP BY date, user_id;

Key Improvements

  • LowCardinality(UInt32) for user_id dramatically reduces storage and improves query performance.
  • Delta compression for event_time improves compression.
  • Including user_id in the ORDER BY clause and using SAMPLE BY improves filtering and grouping performance.

Conclusion

Managing high cardinality data effectively in ClickHouse is crucial for maintaining optimal performance. By leveraging the LowCardinality data type, optimizing primary and sorting keys, using appropriate codecs, and employing data skipping indexes, you can ensure your ClickHouse deployments remain lightning-fast even with massive datasets and high cardinality columns. Always analyse your specific data and query patterns to choose the most effective optimization strategies.