Check the resource usage of Alloy #3724

Rotfuks · 2024-10-15T11:43:21Z

Motivation

We've seen some weird numbers when checking the golem installation on which alloy is already rolled out. It seems that alloy is using significantly more resources than promtail and prometheus-agent combined. This is not good.

Todo

Check what the resource usage of alloy is compared to prometheus-agent and promtail on the installations
investigate if we can improve that resource usage in any way
- One such way can be https://github.com/giantswarm/giantswarm/issues/31559 - is this enough?

Outcome

We are confident that Alloy is not using significantly more resources than prometheus-agent and promtail

hervenicol · 2024-10-17T17:25:32Z

Actual RAM usage for Alloy on `golem`

Queries

RAM: sum(container_memory_working_set_bytes{cluster_id="golem", namespace="kube-system", pod=~"alloy-metrics.*", container!="", image!=""}) by (pod)

Series: sum(prometheus_remote_write_wal_storage_active_series{pod=~"alloy-metrics-.*"}) by (pod)

Numbers

Currently:

2 alloy-metrics pods
scraping 350k and 400k metrics
RAM usage is 9GB and 3.5GB

=> the pod scraping 400k metrics is the one using less RAM. But it's also the youngest one.

Pprof

Extracting data:

port-forward with ks port-forward alloy-metrics-0 12345
get heap data with curl localhost:12345/debug/pprof/heap -o heap.pprof

Visualizing data with https://play.grafana.org/a/grafana-pyroscope-app/ad-hoc

alloy-metrics-0:
- alloc_objects:
- alloc_space:
- inuse_objects:
- inuse_space:
alloy-metrics-1:
- alloc_objects:
- alloc_space:
- inuse_objects:
- inuse_space:

TheoBrigitte · 2024-10-22T09:41:40Z

Our Alloy app is already up-to-date

upstream 1.4.2 is in v0.6.1 https://github.com/giantswarm/observability-bundle/pull/248/files which is in o11y-bundle v1.7.0 and waiting in CAPx v30 CAPZ: Release v30.0.0. releases#1359 CAPA: Release v30.0.0. releases#1357
upstreadm 1.4.3 is in main Update Helm release alloy to v0.9.2 alloy-app#68

Rotfuks · 2024-10-22T09:57:45Z

So what's the outcome here then?

Currently:
2 alloy-metrics pods
scraping 350k and 400k metrics
RAM usage is 9GB and 3.5GB

This sounds closer to what prometheus-agents had - so we improved here and closed the gap?

TheoBrigitte · 2024-10-22T13:36:42Z

Looking into Alloy memory usage over the last 2 days I see no anomaly by this I mean that the memory usage stays relatively flat and changes according to the number of timeseries observed. Memory usage ~ 12GiB

source

Putting this into perspective with Prometheus agent with a similiar amount of time series it seems we are consuming far less memory in this case. Memory usage ~5GiB

source

Note that this comparison is made across 2 different installations and might be inaccurate due the nature of the underlying data which might differ in terms of labels cardinality which plays a big part in the actual memory usage.

TheoBrigitte · 2024-10-24T14:54:03Z

I replaced Alloy with Prometheus agent on golem so we can collect data over next couple of days and compare data then. I am aiming at minimum 2 days of data and maximum 7 days.

Explore prometheus-agent query

TheoBrigitte · 2024-11-04T16:15:24Z

prometheus-agent do consume same amount of memory as alloy after few (5) days

Note that there are 3 prometheus agent pods consuming respectively 5, 4.8, and 1.2 GiB, so ~11GiB in total with 750k time series

ramping up over 5 days

hervenicol · 2024-11-04T16:51:26Z

I struggle with reading those graphs.

There's lots of different RAM metrics
there's metrics from agents running on the WCs

So I'll refer to your numbers:

3 prometheus agent pods consuming respectively 5, 4.8, and 1.2 GiB, so ~11GiB in total with 750k time series

With alloy I had these numbers (no graph either, sorry):

2 alloy-metrics pods
scraping 350k and 400k metrics
RAM usage is 9GB and 3.5GB

So,

750k series in both cases
12.5GB RAM over 2 pods for alloy / 11GB RAM over 3 pods for prometheus agent

That's a bit more RAM usage for Alloy (+10%), but I think we can accept that.

QuentinBisson · 2024-11-07T08:26:14Z

Yes I think that's something we can accept. Maybe we should find ways to reduce the difference in resource usage between the 2 pods (I think it's KSM and we should maybe shard) but those are good results :)

Is there anything else to do here? Maybe write those results down somewhere?

TheoBrigitte · 2024-11-08T14:13:21Z

There seems to be something wrong in the scaling for prometheus agent pods, as the number of observed timeseries and current shards do not match with what the operator is supposed to have configured

source

I added more unit test cases in the operator to ensure the number of shards computed by the operator is correct, giantswarm/observability-operator#160

We'll leave this for now, and anyone in the future observing a similar difference is gladly welcomed to investigate further :)

QuentinBisson · 2024-11-08T14:52:20Z

We configured the sharding value in config at 500.000 timeseries and not 1.000.000 anymore. That could explain the différence you see

QuentinBisson · 2024-11-08T14:53:08Z

But as long as we're okay with thé résults we don't have a reason to keep this issue right?

TheoBrigitte · 2024-11-08T15:02:25Z

Having a lower threshold questions even more the results we are observing. But for now we are good here and we can close this.

github-project-automation bot added this to Roadmap Oct 15, 2024

Rotfuks mentioned this issue Oct 15, 2024

Migrate to Alloy #3520

Closed

github-project-automation bot moved this to Inbox 📥 in Roadmap Oct 15, 2024

Rotfuks added the team/atlas Team Atlas label Oct 15, 2024

Rotfuks assigned hervenicol and TheoBrigitte Oct 21, 2024

TheoBrigitte mentioned this issue Nov 8, 2024

Add more test cases for sharding strategy giantswarm/observability-operator#160

Merged

TheoBrigitte closed this as completed Nov 8, 2024

github-project-automation bot moved this from Inbox 📥 to Done ✅ in Roadmap Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check the resource usage of Alloy #3724

Check the resource usage of Alloy #3724

Rotfuks commented Oct 15, 2024 •

edited

Loading

hervenicol commented Oct 17, 2024 •

edited

Loading

TheoBrigitte commented Oct 22, 2024 •

edited

Loading

Rotfuks commented Oct 22, 2024

TheoBrigitte commented Oct 22, 2024 •

edited

Loading

TheoBrigitte commented Oct 24, 2024 •

edited

Loading

TheoBrigitte commented Nov 4, 2024

hervenicol commented Nov 4, 2024

QuentinBisson commented Nov 7, 2024

TheoBrigitte commented Nov 8, 2024

QuentinBisson commented Nov 8, 2024

QuentinBisson commented Nov 8, 2024

TheoBrigitte commented Nov 8, 2024

Check the resource usage of Alloy #3724

Check the resource usage of Alloy #3724

Comments

Rotfuks commented Oct 15, 2024 • edited Loading

Motivation

Todo

Outcome

hervenicol commented Oct 17, 2024 • edited Loading

Actual RAM usage for Alloy on golem

Queries

Numbers

Pprof

TheoBrigitte commented Oct 22, 2024 • edited Loading

Rotfuks commented Oct 22, 2024

TheoBrigitte commented Oct 22, 2024 • edited Loading

TheoBrigitte commented Oct 24, 2024 • edited Loading

TheoBrigitte commented Nov 4, 2024

hervenicol commented Nov 4, 2024

QuentinBisson commented Nov 7, 2024

TheoBrigitte commented Nov 8, 2024

QuentinBisson commented Nov 8, 2024

QuentinBisson commented Nov 8, 2024

TheoBrigitte commented Nov 8, 2024

Rotfuks commented Oct 15, 2024 •

edited

Loading

hervenicol commented Oct 17, 2024 •

edited

Loading

Actual RAM usage for Alloy on `golem`

TheoBrigitte commented Oct 22, 2024 •

edited

Loading

TheoBrigitte commented Oct 22, 2024 •

edited

Loading

TheoBrigitte commented Oct 24, 2024 •

edited

Loading