You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We want to enforce rate limit to protect our tempo infrastructure. When testing Tempo (version: 2.5) running in a k8's Cluster, managed via tanka in order to determine a supportable global rate limit; we have our ingesters (10 GiB Memory) OOM killed even at low write volumes (18 MiB/s) during load with x6-client-tracing extension with incoming traces ranging from (5 KiB to 250 KiB).
We observed that despite setting a low global rate limit and discarding many spans; we cannot protect ingesters from OOMing if the incoming trace sizes were large. (table below) We are able to support much higher write volumes in production without a rate limit in place where our average trace size does not exceed 20KiB
We want to understand the logic behind this behavior and determine a global rate limit for our distributors. With OOM kills happening at low write volumes, we are unable to enforce a rate limit and that can protect our infrastructure.
Describe the solution you'd like
Being able to enforce a rate limit that protects tempo-infrastructure.
Load Test Results at set burst_size_bytes and rate_limit_bytes:
OOM Kills
burst_size_bytes
rate_limit_bytes
Average Trace Size (Bytes)
Live Traces (30k)
Distributor bytes limit (burst + rate)
Distributor (N) x Ingester (N)
Ingester Memory (Max)
Rate Limit Strategy
Time Under Test
Average Trace Size * Live Traces (MiB)
0
17 MiB
14 MiB
57000
15000
29MiB
3 x 3
80%
Global
25m
815.3915405
0
17 MiB
14 MiB
48000
18000
29 MiB
3 x 3
70%
Global
25m
823.9746094
0
17 MiB
14 MiB
38000
25000
28 MiB
3 x 3
60%
Global
25m
905.9906006
1
17 MiB
14 MiB
187000
2000
18 MiB
3 x 3
N/A
Global
< 10m
356.6741943
1
17 MiB
14 MiB
219000
1200
18.9 MiB
3 x 3
N/A
Global
< 10m
250.6256104
To get an idea about the average Trace Size we used: [Replication factor: 3]
( sum (rate (tempo_distributor_bytes_received_total{cluster=""}[$__interval]) ) by (cluster) / ( sum (rate (tempo_ingester_traces_created_total{cluster=""}[$__interval])) by (cluster)/3) ) / 1024 / 1024
Additional Context
xk6-client-tracing param.js
import { sleep } from 'k6';
import tracing from 'k6/x/tracing';
export const options = {
vus: 120,
stages: [
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
]
};
const endpoint = __ENV.ENDPOINT || "https://<>:443"
const client = new tracing.Client({
endpoint,
exporter: tracing.EXPORTER_OTLP,
tls: {
insecure: true,
}
});
export default function () {
let pushSizeTraces = 50;
let pushSizeSpans = 0;
let t = [];
for (let i = 0; i < pushSizeTraces; i++) {
let c = 100
pushSizeSpans += c;
t.push({
random_service_name: false,
spans: {
count: c,
size: 400, // changed with each load test run from 100 to 1200 for average trace size.
random_name: true,
fixed_attrs: {
"test": "test",
},
}
});
}
let gen = new tracing.ParameterizedGenerator(t)
let traces = gen.traces()
sleep(5)
console.log(traces);
client.push(traces);
}
export function teardown() {
client.shutdown();
}
The text was updated successfully, but these errors were encountered:
adhinneupane
changed the title
Rightsizing Tempo Ingesters when trace sizes vary
Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills
Dec 4, 2024
Is your feature request related to a problem? Please describe.
We want to enforce rate limit to protect our tempo infrastructure. When testing Tempo (version: 2.5) running in a k8's Cluster, managed via tanka in order to determine a supportable global rate limit; we have our ingesters (10 GiB Memory) OOM killed even at low write volumes (18 MiB/s) during load with x6-client-tracing extension with incoming traces ranging from (5 KiB to 250 KiB).
We observed that despite setting a low global rate limit and discarding many spans; we cannot protect ingesters from OOMing if the incoming trace sizes were large. (table below) We are able to support much higher write volumes in production without a rate limit in place where our average trace size does not exceed 20KiB
We want to understand the logic behind this behavior and determine a global rate limit for our distributors. With OOM kills happening at low write volumes, we are unable to enforce a rate limit and that can protect our infrastructure.
Describe the solution you'd like
Being able to enforce a rate limit that protects tempo-infrastructure.
Load Test Results at set burst_size_bytes and rate_limit_bytes:
To get an idea about the average Trace Size we used: [Replication factor: 3]
( sum (rate (tempo_distributor_bytes_received_total{cluster=""}[$__interval]) ) by (cluster) / ( sum (rate (tempo_ingester_traces_created_total{cluster=""}[$__interval])) by (cluster)/3) ) / 1024 / 1024
Additional Context
xk6-client-tracing param.js
The text was updated successfully, but these errors were encountered: