WHERE vs PREWHERE in ClickHouse

ClickHouse is a high-performance, columnar database designed for analytics and big data applications.  One of ClickHouse's standout features is its ability to perform filtering directly on compressed data, enabling faster results compared to traditional relational databases.

While working with ClickHouse, you will often encounter the WHERE and PREWHERE clauses, both of which are used to filter data.

But what is the difference between WHERE and PREWHERE, and when should you use one over the other? In this guide, we will dive deep into the differences between WHERE and PREWHERE clause in ClickHouse and understand how to use them effectively to enhance query performance.

Lets get started!

Table of Contents

  1. What are WHERE and PREWHERE in ClickHouse?
  2. Key Differences between WHERE and PREWHERE
  3. How WHERE and PREWHERE work in ClickHouse?
  4. When to use PREWHERE?
  5. Combining WHERE and PREWHERE
  6. WHERE and PREWHERE - Performance Comparison
  7. Best Practices for using WHERE and PREWHERE in ClickHouse
  8. How Atatus Uses WHERE and PREWHERE

What are WHERE and PREWHERE in ClickHouse?

WHERE clause in ClickHouse filters rows after all necessary columns have been read from storage. When you use a WHERE clause in your query, ClickHouse first retrieves all the columns needed for the query from disk. After this, it applies the filtering condition specified in the WHERE clause. This means that the filtering does not reduce the amount of data initially read from the storage.

PREWHERE clause filters rows before reading all columns, limiting the amount of data read from disk. When you use a PREWHERE clause, ClickHouse applies the filtering condition at an earlier stage of query execution. Only rows that satisfy the PREWHERE condition have their remaining columns read from storage.

Key Differences between WHERE and PREWHERE

In this section, we will explore the key differences between the WHERE and PREWHERE clauses in ClickHouse.  

Here are the key differences:

HTML Table with Equal Column Widths
Feature WHERE PREWHERE
Execution Stage Operates after reading columns Operates during data reading
I/O Optimization Does not reduce data read Reduces the amount of data read
Use Case General filtering and post-processing Early filtering with high selectivity

How WHERE and PREWHERE work in ClickHouse?

WHERE clause

The WHERE clause in ClickHouse works after the necessary data is loaded into memory. It filters rows based on the condition you specify but does not reduce the data read from disk.

Lets look at a practical WHERE clause query example to understand how filtering is applied.

Example:

SELECT user_id, event_type
FROM events
WHERE event_type = 'purchase';

Here:

  • ClickHouse reads all columns (user_id and event_type) from disk for the entire dataset.
  • The filtering (event_type = 'purchase') is applied after the data is loaded into memory.

PREWHERE clause

The PREWHERE clause applies filters before all columns are loaded, during the data reading stage. This reduces I/O costs and improves performance when the filtering condition is highly selective.

Let’s now look at some practical PREWHERE clause query examples to understand how filtering is applied.

Example:

SELECT user_id, event_type
FROM events
PREWHERE event_type = 'purchase';

Here:

  • ClickHouse reads only the rows where event_type = 'purchase'.
  • Other columns (user_id) are loaded only for the filtered rows.

When to use PREWHERE?

In what situations does the PREWHERE clause work best? Let’s look at the technical cases where its early filtering can significantly improve query performance.

  • Filtering on a Selective Column: If the filter condition significantly reduces the number of rows (e.g., filtering by primary key or a highly selective column).
  • Large Dataset: When the dataset is massive, and you want to minimize I/O by reading only the necessary rows.
  • Indexed or Primary Key Columns: When filtering on a column that is part of the primary key or index. ClickHouse can use the index to locate the required rows more efficiently.

For instance, let us look into a PREWHERE clause query example, where you can significantly improve performance by filtering data early in the query execution, especially when filtering on a highly selective column.

SELECT user_id, event_time
FROM logs
PREWHERE log_level = 'ERROR';

When log_level is highly selective, using PREWHERE log_level = 'ERROR' ensures that only the rows meeting this condition are read from storage, skipping irrelevant ones.

This early filtering minimizes disk I/O by avoiding unnecessary column reads and reduces resource usage, making the query execution faster and more efficient.

Combining WHERE and PREWHERE

You can use both PREWHERE and WHERE in a single query. The PREWHERE clause applies first, reducing the data read, and the WHERE clause applies additional filters on the already reduced dataset.

By using PREWHERE clause query examples combined with WHERE clause query, you can further refine your queries for better performance, especially in large datasets.

Example:

SELECT user_id, event_time
FROM logs
PREWHERE log_level = 'ERROR'
WHERE event_time > '2024-11-01 00:00:00';

Here, the PREWHERE clause filters rows where log_level = 'ERROR' before any data is fully read from disk, minimizing disk I/O by reducing the amount of data retrieved. This is especially effective when the log_level column is highly selective.

After that, the WHERE clause further filters the already reduced dataset based on event_time > '2024-11-01 00:00:00'. This two-step approach optimizes query performance by reducing the data read initially and applying additional filters on a smaller, more manageable dataset.

WHERE vs PREWHERE - Performance Comparison

explore the performance differences between WHERE vs PREWHERE in ClickHouse and how each clause affects query execution, especially in large-scale datasets.

Consider a table orders with 1 billion rows:

Query with WHERE:

SELECT order_id, total_amount
FROM orders
WHERE customer_id = 123;

Reads all data for order_id and total_amount from disk. Filters rows in memory, leading to higher I/O costs.

Query with PREWHERE:

SELECT order_id, total_amount
FROM orders
PREWHERE customer_id = 123;

Reads only rows where customer_id = 123 from disk. Minimizes I/O, especially if only a few rows match the condition.

Best Practices for using WHERE and PREWHERE in ClickHouse

To get the most out of ClickHouse’s performance, it’s important to understand when and how to use WHERE and PREWHERE effectively. Here are some best practices:

Use PREWHERE for Early Filtering:

  • Columns with high selectivity.
  • Columns in the primary key or index.

Avoid Using PREWHERE for Non-Selective Filters:

  • If a filter matches a large percentage of rows, PREWHERE provides little benefit.
  • Example: PREWHERE is_active = 1 on a dataset where 90% of rows are active.

Combine PREWHERE and WHERE:

  • Use PREWHERE for initial filtering and WHERE for additional conditions.

Profile Your Queries:

  • Use EXPLAIN to understand how your query is executed:
EXPLAIN SELECT * FROM orders PREWHERE customer_id = 123;

How Atatus uses WHERE and PREWHERE?

In Atatus, we deal with large-scale time-series data that spans across multiple observability domains, including logs, metrics, and distributed traces. Here's how WHERE and PREWHERE fit into our workflow:

(i). Efficient Log Retrieval: Logs are indexed by fields like timestamp, log_level, and application_name. We use PREWHERE to filter logs by highly selective columns (e.g., application_name or log_level) and WHERE to apply additional time-based filters.

Example:

SELECT timestamp, log_message
   FROM logs
   PREWHERE application_name = 'ecommerce-app'
   WHERE timestamp > now() - INTERVAL 1 HOUR;

Here the PREWHERE clause ensures that only logs from ecommerce-app are read from disk, while the WHERE clause narrows it further to the last hour. This drastically reduces the data scanned.

(ii). Real-Time Metric Analysis: For metrics like CPU usage, memory utilization, or custom KPIs, we filter data using PREWHERE to identify the specific metric IDs and WHERE to define the time range.

Example:

SELECT metric_id, avg(value)
   FROM metrics
   PREWHERE metric_id IN (123, 456)
   WHERE timestamp BETWEEN now() - INTERVAL 1 DAY AND now();

By applying PREWHERE to filter specific metric IDs early, we avoid scanning irrelevant metrics, making our dashboards and alerts lightning-fast.

(iii). Distributed Tracing:

In distributed tracing, we store spans with fields like trace_id, span_id, service_name, and duration. PREWHERE is used to filter specific trace_id or service_name, while WHERE focuses on duration thresholds.

Example:

SELECT trace_id, service_name, duration
   FROM spans
   PREWHERE service_name = 'checkout-service'
   WHERE duration > 1000;

This ensures that only relevant traces from checkout-service are scanned, and the results focus on spans with durations greater than 1000ms, improving troubleshooting efficiency.

Benefits of using WHERE and PREWHERE clauses in Atatus

  • Faster Query Execution: By leveraging PREWHERE, we reduce the volume of data read from disk, allowing our platform to deliver instant insights.
  • Cost Efficiency: Reduced I/O translates to lower infrastructure costs, enabling us to process massive amounts of data without breaking the bank.
  • Real-Time Observability: The combination of PREWHERE and WHERE ensures our customers can query logs, traces, and metrics in real-time, without delays.
  • Scalability: As our customers' data grows, the efficient use of ClickHouse ensures that Atatus remains responsive and scalable.

Try Atatus observability platform for free with a 14-day trial—sign up now!

Conclusion

Understanding the difference between WHERE and PREWHERE in ClickHouse is crucial for optimizing query performance. By using PREWHERE, you can minimize disk I/O and improve query efficiency, especially for large datasets. However, it’s important to choose the right clause based on your dataset and query requirements.

When in doubt, analyze your data and profiling results to determine the best approach. A well-placed PREWHERE clause can make a world of difference in ClickHouse performance.

Atatus

#1 Solution for Logs, Traces & Metrics

tick-logo APM

tick-logo Kubernetes

tick-logo Logs

tick-logo Synthetics

tick-logo RUM

tick-logo Serverless

tick-logo Security

tick-logo More

Pavithra Parthiban

Pavithra Parthiban

A technical content writer specializing in monitoring and observability tools, adept at making complex concepts easy to understand.
Chennai