Understanding Buckets in Prometheus: A Comprehensive Guide with Real-Time Examples

Prometheus is an open-source monitoring and alerting toolkit that helps developers and operators track the performance and health of their systems. One of its key features is the ability to use buckets to measure and analyse distributions of data.

Buckets are essential for tracking HTTP request durations, database query times, and memory usage, helping to understand system behaviour. In this blog, we will explore what buckets are, how they work, and how to use them effectively, complete with real-time examples.

Table of Contents:

What is a bucket in prometheus?

In Prometheus, a bucket is a predefined range of values used in histogram metrics to group data. It allows you to track how many values fall within specific intervals, making it easy to analyze data distributions over time. For example, you can measure how many HTTP requests took less than 100 ms, between 100 ms and 500 ms, and so on.

Buckets are particularly useful when you need to:

  • Analyze request durations.
  • Track database query execution times.
  • Monitor any metric that involves distributions.

A histogram metric in Prometheus typically includes:

  • *_bucket: Tracks the number of observations in each bucket.
  • *_sum: Tracks the cumulative sum of all observed values.
  • *_count: Tracks the total number of observations.

How histogram buckets solve the billion-metrics problem?

In large-scale distributed systems, the high volume of metrics can become difficult to manage and very expensive. Traditional monitoring methods often have trouble with the problem of metrics explosion.

Metrics explosion in high-traffic systems

In high-traffic systems handling millions of requests per minute, monitoring becomes increasingly complex. These systems often consist of numerous services and components that require detailed performance tracking.

Traditional monitoring methods typically require capturing and storing every single data point, which leads to a high volume of metrics.

Managing this data requires a massive storage infrastructure, which not only increases computational overhead but also brings significant financial costs. This combination of factors makes traditional approaches not suitable for modern, large-scale systems.

Challenges in monitoring metrics

  1. Lack of granularity in metrics: Raw metrics often don’t provide enough detail to understand the full distribution of data. For example, measuring only the average latency hides outliers and prevents identifying performance bottlenecks.
  2. Difficulty in detecting outliers: Extreme values can significantly affect averages but are invisible in metrics that lack detailed breakdowns. This makes it challenging to detect rare but impactful issues.
  3. Inefficient querying for trends: Without structured data, identifying trends like percentiles, spikes, or patterns over time requires complex calculations and additional storage overhead.
  4. Inability to trigger precise alerts: Without detailed thresholds or ranges, alerts are often too broad or too frequent, leading to alert fatigue or missed critical issues.

How prometheus histogram buckets solve these challenges?

Histogram buckets transform metrics collection by:

  1. Granular observations: Buckets group data into predefined ranges, offering granularity that raw averages or sums cannot achieve. Example: Instead of knowing the average response time is 200ms, buckets reveal how many requests fall into specific latency ranges like 0–100ms, 100–500ms, and so on.
  2. Outlier detection: By examining buckets for larger ranges (e.g., requests exceeding 1 second), outliers can be isolated and analysed. Histogram metrics, combined with percentile queries, highlight performance anomalies that may impact user experience.
  3. Maintaining statistical accuracy: Prometheus bucket metrics allow quick querying of specific ranges over time. Functions like histogram_quantile make percentile calculations straightforward, even for complex distributions. Example: histogram_quantile(0.95, sum(rate(api_response_duration_seconds_bucket[5m])) by (le)) calculates the 95th percentile of API response times effortlessly.
  4. Effective alerting: Buckets enable precise thresholds for alerts. For instance, triggering an alert when requests exceeding 1 second cross a specific count ensures actionable notifications.
  5. Reducing storage costs: Decrease storage requirements by up to 99% compared to full data point tracking.
  6. Flexibility across use cases: Buckets work for various metrics beyond response times, such as memory usage, CPU load, or custom application metrics.

Cost savings with histogram buckets

Histogram buckets offer an efficient way to collect and store metrics by summarizing data into predefined ranges or "buckets" rather than capturing every single data point. This method significantly reduces the storage requirements and costs, especially in high-traffic systems handling millions of requests. Below is a comparison of the traditional approach versus the histogram bucket approach,

Cost Savings Table
Aspect Traditional Approach Histogram Bucket Approach
Requests per Minute 1 million 1 million
Data per Metric Point 10 bytes Condensed into 5-10 buckets
Storage Required 10 GB/hour ~100 MB/hour
Annual Storage Cost Hundreds of thousands of dollars Significantly reduced

Example 1: Monitoring database query execution times with buckets

Let’s consider a scenario where you want to track the execution times of database queries. Here is how you might define buckets for query durations.

Sample metric data:

# Metric: db_query_duration_seconds
# Buckets for database query execution times in seconds

db_query_duration_seconds_bucket{le="0.1"} 15
db_query_duration_seconds_bucket{le="0.5"} 40
db_query_duration_seconds_bucket{le="1.0"} 65
db_query_duration_seconds_bucket{le="2.0"} 80
db_query_duration_seconds_bucket{le="5.0"} 95
db_query_duration_seconds_bucket{le="+Inf"} 100

db_query_duration_seconds_sum 145.0
db_query_duration_seconds_count 100
  • The le (less than or equal to) label defines the upper limit for each bucket.
  • The db_query_duration_seconds_bucket metric tracks how many observations fall into each range.
  • The db_query_duration_seconds_sum provides the total query execution time observed.
  • The db_query_duration_seconds_count indicates the total number of observations.

Example query: Database query percentile

To calculate the 95th percentile of query execution times over the last 5 minutes:

histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le))

This query uses the histogram_quantile function to estimate the 95th percentile of query execution times, helping you identify performance bottlenecks.

Example 2: Tracking memory allocation in applications

Another common use case for buckets is monitoring memory allocation. Imagine tracking how much memory various applications consume over time.

Sample metric data:

# Metric: memory_usage_bytes
# Buckets for memory usage in bytes

memory_usage_bytes_bucket{le="512MiB"} 5
memory_usage_bytes_bucket{le="1GiB"} 12
memory_usage_bytes_bucket{le="2GiB"} 20
memory_usage_bytes_bucket{le="4GiB"} 30
memory_usage_bytes_bucket{le="+Inf"} 35

memory_usage_bytes_sum 8000000000
memory_usage_bytes_count 35

Example query: Memory allocation analysis

To monitor applications consuming more than 2 GiB of memory, you can use:

sum(rate(memory_usage_bytes_bucket{le="+Inf"}[5m])) - sum(rate(memory_usage_bytes_bucket{le="2GiB"}[5m]))

This query calculates the rate of memory usage exceeding 2 GiB over the last 5 minutes.

Example 3: Monitoring latency with buckets

Consider a case where you monitor response times, such as API latency or database query execution time. Prometheus buckets enable you to evaluate the distribution of these latencies.

Sample metric data:

# Metric: response_duration_seconds
response_duration_seconds_bucket{le="0.1"} 150
response_duration_seconds_bucket{le="0.3"} 500
response_duration_seconds_bucket{le="1.0"} 750
response_duration_seconds_bucket{le="+Inf"} 1000

response_duration_seconds_sum 350.0
response_duration_seconds_count 1000

Example query: Percentile calculation

To calculate the 90th percentile of response times over the last 5 minutes:

histogram_quantile(0.90, sum(rate(response_duration_seconds_bucket[5m])) by (le))

This query estimates the 90th percentile, useful for pinpointing latency trends.

Example 4: Memory usage

Buckets are also helpful for resource monitoring, such as memory or CPU utilization.

Sample metric data:

memory_usage_bytes_bucket{le="1000000"} 300
memory_usage_bytes_bucket{le="5000000"} 1200
memory_usage_bytes_bucket{le="+Inf"} 1500

memory_usage_bytes_sum 7.5e+06
memory_usage_bytes_count 1500

With these buckets, you can analyse resource consumption patterns and identify anomalies.

Example query: Memory usage distribution analysis

To calculate the percentage of instances consuming less than 5 MB of memory:

(sum(rate(memory_usage_bytes_bucket{le="5000000"}[5m])) / sum(rate(memory_usage_bytes_bucket{le="+Inf"}[5m]))) * 100

This query helps monitor memory usage distribution and identify potential resource allocation issues.

Example 5: File upload size distribution

Track file upload sizes in a web application:

# Metric structure
upload_size_bytes_bucket{endpoint="/upload", le="1048576"} 500    # 1MB
upload_size_bytes_bucket{endpoint="/upload", le="5242880"} 980    # 5MB
upload_size_bytes_bucket{endpoint="/upload", le="10485760"} 999   # 10MB
upload_size_bytes_bucket{endpoint="/upload", le="+Inf"} 1000
# Calculate average upload size
expr: |
  sum(rate(upload_size_bytes_sum[5m]))
  /
  sum(rate(upload_size_bytes_count[5m]))

# Monitor large uploads
expr: |
  sum(
    rate(upload_size_bytes_bucket{le="+Inf"}[5m])
  ) -
  sum(
    rate(upload_size_bytes_bucket{le="5242880"}[5m])
  )

Example 6: Cache performance monitoring

Track cache hit durations and effectiveness:

# Metric structure
cache_operation_duration_seconds_bucket{operation="get", le="0.001"} 8000
cache_operation_duration_seconds_bucket{operation="get", le="0.005"} 9500
cache_operation_duration_seconds_bucket{operation="get", le="0.01"} 9900
cache_operation_duration_seconds_bucket{operation="get", le="+Inf"} 10000
# Calculate cache effectiveness
expr: |
  sum by (operation) (
    rate(cache_operation_duration_seconds_bucket{le="0.001"}[5m])
  ) /
  sum by (operation) (
    rate(cache_operation_duration_seconds_count[5m])
  ) * 100

# Alert on cache degradation
alert: CacheDegradation
expr: |
  histogram_quantile(0.95,
    sum by (le, operation) (
      rate(cache_operation_duration_seconds_bucket{operation="get"}[5m])
    )
  ) > 0.005
for: 5m
labels:
  severity: warning
annotations:
  summary: "Cache performance degradation detected"
  description: "95th percentile cache operation duration exceeds 5ms"

How to configure buckets in prometheus?

When defining buckets for a histogram metric, consider the following:

  1. Choose appropriate bucket sizes: Define buckets that cover the expected range of values. Avoid too many buckets to prevent high memory usage.
  2. Ensure uniform coverage: Distribute buckets evenly based on the expected data distribution.
  3. Test and iterate: Monitor your data and refine bucket sizes to better capture system behaviour.

Example configuration in a Prometheus client library (e.g., Go):

histogram := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "db_query_duration_seconds",
    Help:    "Histogram of database query execution times in seconds",
    Buckets: prometheus.ExponentialBuckets(0.1, 2, 5), // Start at 0.1s, multiply by 2, 5 buckets
})

Common pitfalls and best practices

Potential challenges:

  • Over-bucketing: Too many buckets increase memory consumption.
  • Under-bucketing: Too few buckets reduce metric granularity.
  • Inappropriate bucket ranges: Misaligned buckets provide misleading insights.

Recommended approaches:

  • Start with default bucket configurations.
  • Customise bucket ranges based on the metric's nature.
  • Use exponential or carefully chosen linear bucket distributions.
  • Regularly review and adjust bucket configurations.

Setting alerts based on buckets

Buckets can also be used to define alerts for abnormal behaviour. For instance, if you want to alert when query execution times exceed 2 seconds, you can use the following Prometheus alerting rule:

alert HighQueryExecutionTime {
  expr: sum(rate(db_query_duration_seconds_bucket{le="+Inf"}[5m])) - sum(rate(db_query_duration_seconds_bucket{le="2.0"}[5m])) > 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Database query execution time exceeds 2 seconds"
}

This rule triggers an alert when the 95th percentile of database query durations exceeds 2 seconds over a 5-minute window.

Conclusion

Prometheus histogram buckets offers an improved way to monitor performance. They provide detailed insights into system behaviour, helping you manage complex distributed systems more effectively.

The important part is not just using buckets but knowing how to set them up and use them correctly. With time and practice, histogram buckets can turn your monitoring into a powerful tool for improving system performance, not just reporting on it.

Atatus

#1 Solution for Logs, Traces & Metrics

tick-logo APM

tick-logo Kubernetes

tick-logo Logs

tick-logo Synthetics

tick-logo RUM

tick-logo Serverless

tick-logo Security

tick-logo More

Pavithra Parthiban

Pavithra Parthiban

A technical content writer specializing in monitoring and observability tools, adept at making complex concepts easy to understand.
Chennai