How to Join two metrics in Prometheus?
In Prometheus, metric joining allows you to merge metrics to build more detailed and insightful queries using PromQL (Prometheus Query Language). By joining metrics, you can analyse data from different sources together, providing a more comprehensive view of your system's behaviour.
This metric joining capability enables you to correlate different metrics effectively, leading to better monitoring and troubleshooting. Additionally, understanding how to use labels can significantly enhance your queries, making them more precise and informative.
In this guide, we will explore how metric joining works in Prometheus, the role of labels, PromQL capabilities, and some of the common pitfalls users face during the process.
In this blog post,
- Metric joining in Prometheus - An introduction
- Understanding the Basics
- How to join two metrics in a Prometheus query?
- Common pitfalls and how to avoid them
- Key Takeaways
- Atatus: A Complement to Prometheus
Metric joining in Prometheus - An introduction
Metric joining in Prometheus is a method of combining two or more metrics to build more complex and insightful queries. This is achieved using PromQL (Prometheus Query Language) to create queries that merge related metrics based on their labels, allowing for a more comprehensive view of the data.
The need for joining metrics in Prometheus arises from the limitations of analysing metrics individually. Joining metrics allows you to:
- Uncover correlations and patterns that aren’t visible when looking at metrics separately.
- Gain a clearer understanding of system behaviour by viewing the relationship between components or services.
- Identify root causes more effectively by seeing how different parts of the system interact.
- Create more accurate alerting rules by considering multiple factors simultaneously.
- Correlate application performance with infrastructure metrics for deeper insights.
At this point, you likely have a basic understanding of what metric joining is and why it is necessary. Before diving into the concept, let’s first clarify some fundamental terms and concepts that will enhance your understanding.
Understanding the Basics
(i). Prometheus Metrics
Prometheus metrics are quantitative data points that track the performance and health of applications and infrastructure. They help monitor cloud environments by collecting and storing time-series data. These metrics enable detection of issues by providing insights into where and when problems occur.
(ii). Types of Metrics in Prometheus
Prometheus offer four primary metric types:
- Counter – A cumulative metric that only increases or resets to zero, used for tracking things like requests served or tasks completed.
- Gauge – A metric that can go up or down, often used for values like memory usage or the number of active pods.
- Histogram – A metric used to measure values, grouped into predefined buckets, such as request latency.
- Summary – Similar to a histogram but used when the buckets are unknown in advance, though histograms are preferred in most cases.
(iii). PromQL (Prometheus Query Language)
PromQL (Prometheus Query Language) is a query language used in Prometheus to retrieve, filter, and analyze time-series data. PromQL enables users to create complex queries for monitoring, alerting, and visualizing metrics, making it a core feature for Prometheus users.
For example, http_requests_total
is a metric that tracks the total number of HTTP requests. Querying this in PromQL retrieves all data points related to the total HTTP requests over time.
(iv). Prometheus labels
Prometheus labels are key-value pairs associated with metrics that provide extra context and details. These labels help differentiate the metric data based on specific attributes.
For instance, consider the metric cpu_usage_total
. Without labels, this metric would only show the total CPU usage. By adding labels such as core="0"
and mode="idle"
, you can create separate time series for different CPU cores and modes, allowing for more granular analysis.
How to join two metrics in a Prometheus query?
Vector Matching
Vector matching is a fundamental technique for combining metrics based on their labels. Operations between vectors aim to find a matching element in the right-hand side vector for each entry in the left-hand side. There are two types of vector matching behaviour, One-to-one and many-to-one/one-to-many.
Vector matching keywords enable the comparison of series with different sets of labels by using:
- on
- ignoring
The label lists provided with these keywords control how the vectors are combined during the matching process.
One-to-one vector matches
One-to-one matching identifies a unique pair of entries from both sides of the operation. By default, this occurs in operations structured as vector1 <operator> vector2
. Two entries are matched when they share the exact same set of labels and their values. The ignoring
keyword lets you exclude specific labels during matching, while the on
keyword restricts the matching to a specified list of labels.
Sample Input:
method_code:http_errors:rate5m{method="get", code="500"} 28
method_code:http_errors:rate5m{method="get", code="404"} 35
method_code:http_errors:rate5m{method="put", code="501"} 5
method_code:http_errors:rate5m{method="post", code="500"} 10
method_code:http_errors:rate5m{method="post", code="404"} 18
method_code:http_errors:rate5m{method="patch", code="403"} 7
method:http_requests:rate5m{method="get"} 650
method:http_requests:rate5m{method="del"} 45
method:http_requests:rate5m{method="post"} 140
method:http_requests:rate5m{method="put"} 50
method:http_requests:rate5m{method="patch"} 80
Example query:
method_code:http_errors:rate5m{code="500"} /
ignoring(code) method:http_requests:rate5m
The ignoring(code)
ensures that only the method
label is considered when matching the vectors, making the comparison possible.
Output:
{method="get"} 0.0431 // 28 / 650
{method="post"} 0.0714 // 10 / 140
The query returns the fraction of HTTP requests that resulted in a 500 error for the methods get
and post
. The methods put
and patch
do not appear in the output because there were no corresponding entries in the metrics.
Many-to-one and one-to-many vector matches
Many-to-one and one-to-many matching are advanced cases that need to be used cautiously. Usually, using ignoring(<labels>) gives the expected result.
For many-to-one or one-to-many vector matching, group modifiers are used:
- group_left
- group_right
Grouping modifiers can only be applied in comparisons and arithmetic operations. In operations like and, unless, and or, Prometheus automatically matches all possible entries from the right-hand vector by default.
Many-to-one and one-to-many matching refer to situations where one element on one side matches with multiple elements on the other side. To achieve this, you use group_left or group_right modifiers, depending on which side has higher cardinality.
Sample Input:
method_code:http_errors:rate5m{method="get", code="500"} 28
method_code:http_errors:rate5m{method="get", code="404"} 35
method_code:http_errors:rate5m{method="put", code="501"} 5
method_code:http_errors:rate5m{method="post", code="500"} 10
method_code:http_errors:rate5m{method="post", code="404"} 18
method_code:http_errors:rate5m{method="patch", code="403"} 7
method:http_requests:rate5m{method="get"} 650
method:http_requests:rate5m{method="del"} 45
method:http_requests:rate5m{method="post"} 140
method:http_requests:rate5m{method="put"} 50
method:http_requests:rate5m{method="patch"} 80
Example query:
method_code:http_errors:rate5m / ignoring(code) group_left method:http_requests:rate5m
The ignoring(code)
part tells the system to ignore the status codes when matching data. It only focuses on the HTTP method.
The group_left
modifier is necessary here because the error rate data contains both the method
and code
labels, while the total request data only includes the method
label.
By using group_left
, the query aligns the two datasets based on the shared method
label, while retaining the code
information from the left-side dataset.
This approach ensures that each combination of method
and code
is appropriately matched with its corresponding total request count based on the method
, allowing for an accurate comparison.
Output:
{method="get", code="500"} 0.043 // 28 / 650
{method="get", code="404"} 0.054 // 35 / 650
{method="put", code="501"} 0.10 // 5 / 50
{method="post", code="500"} 0.071 // 10 / 140
{method="post", code="404"} 0.129 // 18 / 140
{method="patch", code="403"} 0.0875 // 7 / 80
Once metrics are joined in Prometheus using label matching techniques (like on
, ignoring
, group_left
, or group_right
), you can apply various mathematical operations to analyze the data and derive meaningful insights. Here's a simple explanation of how these operations can help:
Addition (+): This operation helps in combining the values of two metrics. You can sum metrics to get the total count or combined rate of something over multiple series.
service_a_errors_total + service_b_errors_total
Subtraction (-): Subtraction allows you to find the difference between two metrics. This can be useful when comparing the values of related metrics or identifying anomalies.
total_requests - successful_requests
This query gives you the number of failed requests.
Multiplication (*): Multiplication helps to scale a metric by another value or combine two metrics proportionally. It’s useful for converting units or calculating percentages.
response_time_seconds * 1000 // Convert seconds to milliseconds
Division (/): Division is commonly used to calculate rates, ratios, or percentages. You can divide one metric by another to normalize or compare values.
errors_total / total_requests
Rate Calculations: Prometheus provides functions like rate()
to calculate per-second averages over time. This can be applied to a time series to show how a metric changes over time.
rate(errors_total[5m])
This query get the rate of errors over the last 5 minutes.
Aggregations: You can also aggregate metrics across multiple dimensions. Functions like sum()
, avg()
, min()
, max()
, etc., help summarize data across label sets, giving you a broader view of your metrics.
sum(cpu_usage_total) by (instance)
Common pitfalls and how to avoid them
When joining two metrics in Prometheus, there are several common pitfalls you might encounter. Here’s a breakdown of those pitfalls and how to avoid them:
(i). When two metrics have different sets of labels, joining them without properly handling the labels can result in no matching or incorrect matching.
How to Avoid: Use the on()
modifier to match only on the common labels. Use the ignoring()
modifier to ignore irrelevant labels.
(ii). Using group_left
or group_right
incorrectly can lead to unexpected results, such as duplicate time series or incorrect matching. These modifiers should only be used when one side has more unique label combinations than the other.
How to Avoid: Only use these modifiers for one-to-many or many-to-one relationships, and not for simple one-to-one joins. Ensure you use group_left
when the left-hand side of the query has more label combinations and group_right
when the right-hand side has more.
(iii). Performing operations like division without checking for zero values can lead to NaN
(Not a Number) results, which can skew your metrics.
How to Avoid: Use the clamp_min()
function to prevent division by zero. This ensures that the divisor is always at least 1, avoiding division by zero.
(iv). Using logical operators like or
, and
, or unless
can result in unexpected matches, as they may match all possible entries from the right vector by default.
How to Avoid: Only use these operators when you fully understand their behaviour, and avoid them in vector matching operations unless its the desired outcome. For more controlled matching, stick to arithmetic operations like +
, -
, *
, or /
.
(v). Joining metrics with high cardinality can result in performance issues, as the query may take longer to execute and consume more memory.
How to Avoid: Limit your join operations to only essential labels by using the on()
and ignoring()
modifiers.
Key Takeaways
- Metric joining enables combining metrics for deeper insights using PromQL.
- Prometheus supports four metric types: Counter, Gauge, Histogram, and Summary.
- Labels add context to metrics, helping differentiate data based on specific attributes.
- PromQL is used for querying and manipulating time-series data in Prometheus.
- Vector matching techniques (on, ignoring) join metrics by labels for accurate comparison.
- Use ignoring() to exclude irrelevant labels and on() to match only common labels in metric joining.
- One-to-one matching ensures unique metric pairs, but can be limiting if metrics don't share labels.
- Many-to-one/one-to-many matching allows flexible joins but requires careful use of
group_left
orgroup_right
modifiers to avoid errors. - Math operations like addition, subtraction, and division analyse joined metrics effectively.
Atatus: A Complement to Prometheus
If you are looking for a full-stack observability platform, Atatus complements Prometheus by providing a unified solution for metrics, traces, and logs. While Prometheus excels in metrics collection, Atatus enhances your monitoring capabilities, allowing for faster troubleshooting.
Here’s why Atatus is a valuable addition:
- Unified Telemetry: Atatus combines metrics, traces, and logs into one platform for full visibility.
- Correlation for Root Cause Analysis: Easily correlate metrics with traces to quickly resolve performance issues.
- Advanced Querying and Visualizations: Powerful querying and visualizations simplify data analysis.
- Quick Setup: An intuitive setup process allows for rapid onboarding, enhancing Prometheus's capabilities.
If you are not yet an Atatus customer, you can sign up for a 14-day free trial.