Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a prometheus label mapping component #2025

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

vaxvms
Copy link

@vaxvms vaxvms commented Nov 4, 2024

PR Description

This PR add a prometheus component to create a label based on a source_label value and a static mapping table.

Which issue(s) this PR fixes

For a large mapping table, using regex with prometheus.relabel is inefficient

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated

@vaxvms vaxvms requested review from clayton-cornell and a team as code owners November 4, 2024 14:51
@CLAassistant
Copy link

CLAassistant commented Nov 4, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial doc review comments

@clayton-cornell clayton-cornell requested a review from a team November 5, 2024 18:27
@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Nov 5, 2024
@thampiotr
Copy link
Contributor

Thanks for contributing.

For a large mapping table, using regex with prometheus.relabel is inefficient

Could you share the use case and performance differences? In many cases users can use discovery.relabel to add the team=X labels - and then the overhead is much smaller, because it's not samples, but targets that are relabeled.

If the reason to add this is that prometheus.relabel it's not efficient, we will need some more data on how inefficient it is and whether it could be optimised to work faster. Any new solution we would consider will require benchmarks to prove that it's more efficient. We also generally want to avoid having many ways of doing the same thing - but there can be exceptions.

@vaxvms
Copy link
Author

vaxvms commented Nov 6, 2024

I have to map a label holding the service ID to a label holding the tenant ID.

My test case is using X services ID mapped to Y tenants. My test case is mapping 100 different services ID to a single tenant.
Obviously, Y<=X.

Using prometheus.relabel, with X set to 20k (and Y to 200), i'm consuming 20 vCPU. I'm also having a 49% failure rate

image

Using prometheus.mapping, with X set to 100k (and Y to 1k), i'm consuming 2 vCPU

image

As my worst case can be X == Y, i haven't try to summarize regex so i have a rule block for each service ID

@thampiotr
Copy link
Contributor

Thanks for sharing some performance data. Have you tried to use discovery.relabel instead?

The issue with prometheus.relabel is that you are relabelling every single sample. With discovery.relabel you can relabel the targets before they are scraped.

So, for example, in a cluster of 1k targets and each exposing 1k metrics, you would have 1 million relabels with prometheus.relabel as opposed to 1k relabels with discovery.relabel.

If we find that for some reason discovery.relabel cannot be used, I would like to understand better why. If we need to optimise prometheus.relabel I would like to explore that option too before committing to create a new feature to do a very similar job. We may also need to go through proposal process to let team/people comment on this.

@vaxvms
Copy link
Author

vaxvms commented Nov 20, 2024

Thanks for your answer.

I'm not using alloy to scrape targets but I'm receiving data thru remote_write from multiple data producers.
The data producers only have the information about the service_id but not the tenant_id, therefore they cannot apply the correct label at the scraping time.

I'm unable to see how we could optimize prometheus.relabel to be as performant as a key lookup. We have made some attempt to share cache between instances #1692 which aims to lower the memory consumption.

Here is a extract of my test configuration, using prometheus.relabel:

prometheus.receive_http "ingestion_head" {
    http {
        listen_address = "0.0.0.0"
        listen_port    = 9000
    }
    forward_to = [prometheus.relabel.allow_list.receiver]
}

prometheus.relabel "allow_list" {
    forward_to = [prometheus.remote_write.ingestion_tail.receiver]
    rule {
        action        = "keep"
        regex         = "(.+)"
        source_labels = ["service_id"]
    }


    // One rule block per service_id
    rule {
        source_labels = ["service_id"]
        regex         = "^000001$"
        target_label  = "tenant_id"
        replacement   = "tenant_42"
    }
    
    rule {
        source_labels = ["service_id"]
        regex         = "^000002$"
        target_label  = "tenant_id"
        replacement   = "tenant_190"
    }
    
    // ...
    
    rule {
        source_labels = ["service_id"]
        regex         = "^12345678$"
        target_label  = "tenant_id"
        replacement   = "tenant_8876"
    }
}

prometheus.remote_write "ingestion_tail" {
    endpoint {
        url = "http://mimir/"
    }
}

The prometheus.relabel can be fairly heavy on the resources especially when we do a 1:1 match, hence the proposal to have a mapping. Here is an example of a working proof of concept based on the code in the PR.
Here is a schema:

image

prometheus.receive_http "ingestion_head" {
  http {
    listen_address = "0.0.0.0"
    listen_port    = 9000
  }
  forward_to = [prometheus.relabel.allow_list.receiver]
}

prometheus.relabel "allow_list" {
  forward_to = [prometheus.mapping.tenants_mapping.receiver]

  // First ditch everything that don't have service_id
  rule {
    action        = "keep"
    regex         = "(.+)"
    source_labels = ["service_id"]
  }
}

prometheus.mapping "tenants_mapping" {
  forward_to = [prometheus.remote_write.ingestion_tail.receiver]

  src_label_name = "service_id"
  output_label_name = "tenant_id"

  mapping = {
   "000001" = "tenant_42",
   "000002" = "tenant_190",
   // ....
   "12345678" = "tenant8876",
  }
}
prometheus.remote_write "ingestion_tail" {
        endpoint {
                url = "http://mimir/"
        }
}

@wilfriedroset
Copy link

What we forgot to explicit with @vaxvms is that our initial design would benefit from #521 we would remove the need for cortex-tenant

@clayton-cornell
Copy link
Contributor

@vaxvms Did the doc topic at components/prometheus/prometheus.relabel accidentally fall off this PR in one of the force/push updates? Or was it removed intentionally? I don't see it in the file list anymore.

@vaxvms
Copy link
Author

vaxvms commented Nov 26, 2024

@vaxvms Did the doc topic at components/prometheus/prometheus.relabel accidentally fall off this PR in one of the force/push updates? Or was it removed intentionally? I don't see it in the file list anymore.

Are you talking about components/prometheus/prometheus.mapping ? I accidentally removed it. It back again.

I've updated the way the component work: The target_label isn't fixed anymore. each value of the source label can be mapped to severals labels

@vaxvms
Copy link
Author

vaxvms commented Nov 28, 2024

Thanks for review @clayton-cornell
Most of the phrasing came from prometheus.relabel.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants