Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the resource usage of Alloy #3724

Closed
2 tasks
Tracked by #3520
Rotfuks opened this issue Oct 15, 2024 · 12 comments
Closed
2 tasks
Tracked by #3520

Check the resource usage of Alloy #3724

Rotfuks opened this issue Oct 15, 2024 · 12 comments
Assignees
Labels
team/atlas Team Atlas

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Oct 15, 2024

Motivation

We've seen some weird numbers when checking the golem installation on which alloy is already rolled out. It seems that alloy is using significantly more resources than promtail and prometheus-agent combined. This is not good.

Todo

Outcome

  • We are confident that Alloy is not using significantly more resources than prometheus-agent and promtail
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Oct 15, 2024
@Rotfuks Rotfuks added the team/atlas Team Atlas label Oct 15, 2024
@hervenicol
Copy link

hervenicol commented Oct 17, 2024

Actual RAM usage for Alloy on golem

Queries

RAM: sum(container_memory_working_set_bytes{cluster_id="golem", namespace="kube-system", pod=~"alloy-metrics.*", container!="", image!=""}) by (pod)

Series: sum(prometheus_remote_write_wal_storage_active_series{pod=~"alloy-metrics-.*"}) by (pod)

Numbers

Currently:

  • 2 alloy-metrics pods
  • scraping 350k and 400k metrics
  • RAM usage is 9GB and 3.5GB

=> the pod scraping 400k metrics is the one using less RAM. But it's also the youngest one.

Pprof

Extracting data:

  • port-forward with ks port-forward alloy-metrics-0 12345
  • get heap data with curl localhost:12345/debug/pprof/heap -o heap.pprof

Visualizing data with https://play.grafana.org/a/grafana-pyroscope-app/ad-hoc

  • alloy-metrics-0:

    • alloc_objects:
      Image
    • alloc_space:
      Image
    • inuse_objects:
      Image
    • inuse_space:
      Image
  • alloy-metrics-1:

    • alloc_objects:
      Image
    • alloc_space:
      Image
    • inuse_objects:
      Image
    • inuse_space:
      Image

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Oct 22, 2024

Our Alloy app is already up-to-date

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Oct 22, 2024

So what's the outcome here then?

Currently:
2 alloy-metrics pods
scraping 350k and 400k metrics
RAM usage is 9GB and 3.5GB

This sounds closer to what prometheus-agents had - so we improved here and closed the gap?

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Oct 22, 2024

Looking into Alloy memory usage over the last 2 days I see no anomaly by this I mean that the memory usage stays relatively flat and changes according to the number of timeseries observed. Memory usage ~ 12GiB

Image
source

Putting this into perspective with Prometheus agent with a similiar amount of time series it seems we are consuming far less memory in this case. Memory usage ~5GiB

Image
source

Note that this comparison is made across 2 different installations and might be inaccurate due the nature of the underlying data which might differ in terms of labels cardinality which plays a big part in the actual memory usage.

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Oct 24, 2024

I replaced Alloy with Prometheus agent on golem so we can collect data over next couple of days and compare data then. I am aiming at minimum 2 days of data and maximum 7 days.

Explore prometheus-agent query

@TheoBrigitte
Copy link
Member

prometheus-agent do consume same amount of memory as alloy after few (5) days

Note that there are 3 prometheus agent pods consuming respectively 5, 4.8, and 1.2 GiB, so ~11GiB in total with 750k time series

Image

ramping up over 5 days

Image

@hervenicol
Copy link

I struggle with reading those graphs.

  • There's lots of different RAM metrics
  • there's metrics from agents running on the WCs

So I'll refer to your numbers:

3 prometheus agent pods consuming respectively 5, 4.8, and 1.2 GiB, so ~11GiB in total with 750k time series

With alloy I had these numbers (no graph either, sorry):

2 alloy-metrics pods
scraping 350k and 400k metrics
RAM usage is 9GB and 3.5GB

So,

  • 750k series in both cases
  • 12.5GB RAM over 2 pods for alloy / 11GB RAM over 3 pods for prometheus agent

That's a bit more RAM usage for Alloy (+10%), but I think we can accept that.

@QuentinBisson
Copy link

Yes I think that's something we can accept. Maybe we should find ways to reduce the difference in resource usage between the 2 pods (I think it's KSM and we should maybe shard) but those are good results :)

Is there anything else to do here? Maybe write those results down somewhere?

@TheoBrigitte
Copy link
Member

There seems to be something wrong in the scaling for prometheus agent pods, as the number of observed timeseries and current shards do not match with what the operator is supposed to have configured

Image
source

I added more unit test cases in the operator to ensure the number of shards computed by the operator is correct, giantswarm/observability-operator#160

We'll leave this for now, and anyone in the future observing a similar difference is gladly welcomed to investigate further :)

@QuentinBisson
Copy link

We configured the sharding value in config at 500.000 timeseries and not 1.000.000 anymore. That could explain the différence you see

@QuentinBisson
Copy link

But as long as we're okay with thé résults we don't have a reason to keep this issue right?

@TheoBrigitte
Copy link
Member

Having a lower threshold questions even more the results we are observing. But for now we are good here and we can close this.

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/atlas Team Atlas
Projects
Archived in project
Development

No branches or pull requests

4 participants