Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode_opentelemetry: add cut off for otel payloads for prometheus mimir #223

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

This issue is reported in fluent/fluent-bit#9400.

This is because Prometheus mimir limits the metrics' timestamps within 5 minutes in the same batch:
https://github.com/grafana/mimir/blob/main/pkg/distributor/distributor.go#L1010-L1020

@edsiper
Copy link
Member

edsiper commented Sep 26, 2024

what is the side effect of this for other endpoints/users ? is it ok to remove metrics for everybody ?

@ElectricWeasel
Copy link

A far I investigated fluent-bit is repeating infinitely (until restarted) metrics from devices or mounts that no longer exist:

  | Sep 27, 2024 @ 10:37:02.140 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:50:15.946Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra6.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:48.274 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:49:17.062Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra5.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:41.445 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:44:55.162Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra2.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:32.213 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:40:47.164Z and is from series node_filesystem_device_error{device="tmpfs", fstype="tmpfs", host_name="petra1.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:18.366 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:40:47.164Z and is from series node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", host_name="petra1.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:17.153 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:50:15.946Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra6.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:36:03.301 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:48:17.259Z and is from series node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", host_name="petra4.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:35:53.855 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:51:37.82Z and is from series node_filesystem_free_bytes{device="tmpfs", fstype="tmpfs", host_name="petra7.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)
  | Sep 27, 2024 @ 10:35:48.239 | user=anonymous: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-24T11:49:17.062Z and is from series node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", host_name="petra5.vrit.dev", metrics_agent="fluent-bit", metrics_source="host-metrics", mountpoint="/run/user/2137"} (sampled 1/10)

Trying to push metrics from 3 days ago... (tmpfs filesystem after user session)
I don't think anyone can benefit from this.

Regards
Rafał

@cosmo0920
Copy link
Contributor Author

Trying to push metrics from 3 days ago... (tmpfs filesystem after user session) I don't think anyone can benefit from this.

Regards Rafał

Just for confirming that this your log is applied this patch or not?

@ElectricWeasel
Copy link

Trying to push metrics from 3 days ago... (tmpfs filesystem after user session) I don't think anyone can benefit from this.
Regards Rafał

Just for confirming that this your log is applied this patch or not?

Ah sorry, i'ts a standard 3.1.2 version, I can try to compile from this branch and confirm.

Regards
Rafał

@cosmo0920
Copy link
Contributor Author

what is the side effect of this for other endpoints/users ? is it ok to remove metrics for everybody ?

I added APIs to specify cutoff options. This could be avoiding breaking changes for users who are using otel encoding.

src/cmt_encode_opentelemetry.c Outdated Show resolved Hide resolved
@cosmo0920 cosmo0920 force-pushed the cosmo0920-add-cut-off-for-otel-payloads-for-prometheus-mimir branch from faab663 to 6c74f7e Compare October 15, 2024 06:29
@cosmo0920 cosmo0920 force-pushed the cosmo0920-add-cut-off-for-otel-payloads-for-prometheus-mimir branch from 6c74f7e to ed94318 Compare October 15, 2024 06:30
@Brodiemm
Copy link

Is this being planned in for a release soon? Any other testing etc. that is needed?

@cosmo0920
Copy link
Contributor Author

I believe so. But even if it will be merged into fluent-bit tree, there is more works for implementing the cutoff related parameters on out_opentelemetry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants