WHERE vs PREWHERE in ClickHouse
ClickHouse is a high-performance, columnar database designed for analytics and big data applications. One of ClickHouse's standout features is its ability to perform filtering directly on compressed data, enabling faster results compared to traditional relational databases.
While working with ClickHouse, you will often encounter the WHERE and PREWHERE clauses, both of which are used to filter data.
But what is the difference between WHERE and PREWHERE, and when should you use one over the other? In this guide, we will dive deep into the differences between WHERE and PREWHERE clause in ClickHouse and understand how to use them effectively to enhance query performance.
Lets get started!
Table of Contents
- What are WHERE and PREWHERE in ClickHouse?
- Key Differences between WHERE and PREWHERE
- How WHERE and PREWHERE work in ClickHouse?
- When to use PREWHERE?
- Combining WHERE and PREWHERE
- WHERE and PREWHERE - Performance Comparison
- Best Practices for using WHERE and PREWHERE in ClickHouse
- How Atatus Uses WHERE and PREWHERE
What are WHERE and PREWHERE in ClickHouse?
WHERE
clause in ClickHouse filters rows after all necessary columns have been read from storage. When you use a WHERE
clause in your query, ClickHouse first retrieves all the columns needed for the query from disk. After this, it applies the filtering condition specified in the WHERE
clause. This means that the filtering does not reduce the amount of data initially read from the storage.
PREWHERE
clause filters rows before reading all columns, limiting the amount of data read from disk. When you use a PREWHERE
clause, ClickHouse applies the filtering condition at an earlier stage of query execution. Only rows that satisfy the PREWHERE
condition have their remaining columns read from storage.
Key Differences between WHERE and PREWHERE
In this section, we will explore the key differences between the WHERE and PREWHERE clauses in ClickHouse.
Here are the key differences:
Feature | WHERE | PREWHERE |
---|---|---|
Execution Stage | Operates after reading columns | Operates during data reading |
I/O Optimization | Does not reduce data read | Reduces the amount of data read |
Use Case | General filtering and post-processing | Early filtering with high selectivity |
How WHERE and PREWHERE work in ClickHouse?
WHERE clause
The WHERE
clause in ClickHouse works after the necessary data is loaded into memory. It filters rows based on the condition you specify but does not reduce the data read from disk.
Lets look at a practical WHERE clause query example to understand how filtering is applied.
Example:
SELECT user_id, event_type
FROM events
WHERE event_type = 'purchase';
Here:
- ClickHouse reads all columns (
user_id
andevent_type
) from disk for the entire dataset. - The filtering (
event_type = 'purchase'
) is applied after the data is loaded into memory.
PREWHERE clause
The PREWHERE
clause applies filters before all columns are loaded, during the data reading stage. This reduces I/O costs and improves performance when the filtering condition is highly selective.
Let’s now look at some practical PREWHERE clause query examples to understand how filtering is applied.
Example:
SELECT user_id, event_type
FROM events
PREWHERE event_type = 'purchase';
Here:
- ClickHouse reads only the rows where
event_type = 'purchase'
. - Other columns (
user_id
) are loaded only for the filtered rows.
When to use PREWHERE?
In what situations does the PREWHERE
clause work best? Let’s look at the technical cases where its early filtering can significantly improve query performance.
- Filtering on a Selective Column: If the filter condition significantly reduces the number of rows (e.g., filtering by primary key or a highly selective column).
- Large Dataset: When the dataset is massive, and you want to minimize I/O by reading only the necessary rows.
- Indexed or Primary Key Columns: When filtering on a column that is part of the primary key or index. ClickHouse can use the index to locate the required rows more efficiently.
For instance, let us look into a PREWHERE clause query example, where you can significantly improve performance by filtering data early in the query execution, especially when filtering on a highly selective column.
SELECT user_id, event_time
FROM logs
PREWHERE log_level = 'ERROR';
When log_level
is highly selective, using PREWHERE log_level = 'ERROR'
ensures that only the rows meeting this condition are read from storage, skipping irrelevant ones.
This early filtering minimizes disk I/O by avoiding unnecessary column reads and reduces resource usage, making the query execution faster and more efficient.
Combining WHERE and PREWHERE
You can use both PREWHERE
and WHERE
in a single query. The PREWHERE
clause applies first, reducing the data read, and the WHERE
clause applies additional filters on the already reduced dataset.
By using PREWHERE
clause query examples combined with WHERE clause query, you can further refine your queries for better performance, especially in large datasets.
Example:
SELECT user_id, event_time
FROM logs
PREWHERE log_level = 'ERROR'
WHERE event_time > '2024-11-01 00:00:00';
Here, the PREWHERE
clause filters rows where log_level = 'ERROR'
before any data is fully read from disk, minimizing disk I/O by reducing the amount of data retrieved. This is especially effective when the log_level
column is highly selective.
After that, the WHERE
clause further filters the already reduced dataset based on event_time > '2024-11-01 00:00:00'
. This two-step approach optimizes query performance by reducing the data read initially and applying additional filters on a smaller, more manageable dataset.
WHERE vs PREWHERE - Performance Comparison
Explore the performance differences between WHERE vs PREWHERE in ClickHouse and how each clause affects query execution, especially in large-scale datasets.
Consider a table orders
with 1 billion rows:
Query with WHERE:
SELECT order_id, total_amount
FROM orders
WHERE customer_id = 123;
Reads all data for order_id
and total_amount
from disk. Filters rows in memory, leading to higher I/O costs.
Query with PREWHERE:
SELECT order_id, total_amount
FROM orders
PREWHERE customer_id = 123;
Reads only rows where customer_id = 123
from disk. Minimizes I/O, especially if only a few rows match the condition.
Best Practices for using WHERE and PREWHERE in ClickHouse
To get the most out of ClickHouse’s performance, it’s important to understand when and how to use WHERE
and PREWHERE
effectively. Here are some best practices:
Use PREWHERE
for Early Filtering:
- Columns with high selectivity.
- Columns in the primary key or index.
Avoid Using PREWHERE
for Non-Selective Filters:
- If a filter matches a large percentage of rows,
PREWHERE
provides little benefit. - Example:
PREWHERE is_active = 1
on a dataset where 90% of rows are active.
Combine PREWHERE
and WHERE
:
- Use
PREWHERE
for initial filtering andWHERE
for additional conditions.
Profile Your Queries:
- Use
EXPLAIN
to understand how your query is executed:
EXPLAIN SELECT * FROM orders PREWHERE customer_id = 123;
How Atatus uses WHERE and PREWHERE?
In Atatus, we deal with large-scale time-series data that spans across multiple observability domains, including logs, metrics, and distributed traces. Here's how WHERE
and PREWHERE
fit into our workflow:
(i). Efficient Log Retrieval: Logs are indexed by fields like timestamp
, log_level
, and application_name
. We use PREWHERE
to filter logs by highly selective columns (e.g., application_name
or log_level
) and WHERE
to apply additional time-based filters.
Example:
SELECT timestamp, log_message
FROM logs
PREWHERE application_name = 'ecommerce-app'
WHERE timestamp > now() - INTERVAL 1 HOUR;
Here the PREWHERE
clause ensures that only logs from ecommerce-app
are read from disk, while the WHERE
clause narrows it further to the last hour. This drastically reduces the data scanned.
(ii). Real-Time Metric Analysis: For metrics like CPU usage, memory utilization, or custom KPIs, we filter data using PREWHERE
to identify the specific metric IDs and WHERE
to define the time range.
Example:
SELECT metric_id, avg(value)
FROM metrics
PREWHERE metric_id IN (123, 456)
WHERE timestamp BETWEEN now() - INTERVAL 1 DAY AND now();
By applying PREWHERE
to filter specific metric IDs early, we avoid scanning irrelevant metrics, making our dashboards and alerts lightning-fast.
(iii). Distributed Tracing:
In distributed tracing, we store spans with fields like trace_id
, span_id
, service_name
, and duration
. PREWHERE
is used to filter specific trace_id
or service_name
, while WHERE
focuses on duration thresholds.
Example:
SELECT trace_id, service_name, duration
FROM spans
PREWHERE service_name = 'checkout-service'
WHERE duration > 1000;
This ensures that only relevant traces from checkout-service
are scanned, and the results focus on spans with durations greater than 1000ms, improving troubleshooting efficiency.
Benefits of using WHERE and PREWHERE clauses in Atatus
- Faster Query Execution: By leveraging
PREWHERE
, we reduce the volume of data read from disk, allowing our platform to deliver instant insights. - Cost Efficiency: Reduced I/O translates to lower infrastructure costs, enabling us to process massive amounts of data without breaking the bank.
- Real-Time Observability: The combination of
PREWHERE
andWHERE
ensures our customers can query logs, traces, and metrics in real-time, without delays. - Scalability: As our customers' data grows, the efficient use of ClickHouse ensures that Atatus remains responsive and scalable.
Try Atatus observability platform for free with a 14-day trial—sign up now!
Conclusion
Understanding the difference between WHERE
and PREWHERE
in ClickHouse is crucial for optimizing query performance. By using PREWHERE
, you can minimize disk I/O and improve query efficiency, especially for large datasets. However, it’s important to choose the right clause based on your dataset and query requirements.
When in doubt, analyze your data and profiling results to determine the best approach. A well-placed PREWHERE
clause can make a world of difference in ClickHouse performance.