Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills #4412

adhinneupane · 2024-12-04T14:43:19Z

Is your feature request related to a problem? Please describe.

We want to enforce rate limit to protect our tempo infrastructure. When testing Tempo (version: 2.5) running in a k8's Cluster, managed via tanka in order to determine a supportable global rate limit; we have our ingesters (10 GiB Memory) OOM killed even at low write volumes (18 MiB/s) during load with x6-client-tracing extension with incoming traces ranging from (5 KiB to 250 KiB).

We observed that despite setting a low global rate limit and discarding many spans; we cannot protect ingesters from OOMing if the incoming trace sizes were large. (table below) We are able to support much higher write volumes in production without a rate limit in place where our average trace size does not exceed 20KiB

We want to understand the logic behind this behavior and determine a global rate limit for our distributors. With OOM kills happening at low write volumes, we are unable to enforce a rate limit and that can protect our infrastructure.

Describe the solution you'd like
Being able to enforce a rate limit that protects tempo-infrastructure.

Load Test Results at set burst_size_bytes and rate_limit_bytes:

OOM Kills	burst_size_bytes	rate_limit_bytes	Average Trace Size (Bytes)	Live Traces (30k)	Distributor bytes limit (burst + rate)	Distributor (N) x Ingester (N)	Ingester Memory (Max)	Rate Limit Strategy	Time Under Test	Average Trace Size * Live Traces (MiB)
0	17 MiB	14 MiB	57000	15000	29MiB	3 x 3	80%	Global	25m	815.3915405
0	17 MiB	14 MiB	48000	18000	29 MiB	3 x 3	70%	Global	25m	823.9746094
0	17 MiB	14 MiB	38000	25000	28 MiB	3 x 3	60%	Global	25m	905.9906006
1	17 MiB	14 MiB	187000	2000	18 MiB	3 x 3	N/A	Global	< 10m	356.6741943
1	17 MiB	14 MiB	219000	1200	18.9 MiB	3 x 3	N/A	Global	< 10m	250.6256104

To get an idea about the average Trace Size we used: [Replication factor: 3]

( sum (rate (tempo_distributor_bytes_received_total{cluster=""}[$__interval]) ) by (cluster) / ( sum (rate (tempo_ingester_traces_created_total{cluster=""}[$__interval])) by (cluster)/3) ) / 1024 / 1024

Additional Context

xk6-client-tracing param.js

import { sleep } from 'k6';
import tracing from 'k6/x/tracing';

export const options = {
    vus: 120,
    stages: [
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    { duration: '10s', target: 120 },
    { duration: '2m', target: 120 },
    ]
};

const endpoint = __ENV.ENDPOINT || "https://<>:443"
const client = new tracing.Client({
    endpoint,
    exporter: tracing.EXPORTER_OTLP,
    tls: {
      insecure: true,
    }
});

export default function () {
    let pushSizeTraces = 50;
    let pushSizeSpans = 0;
    let t = [];
    for (let i = 0; i < pushSizeTraces; i++) {
        let c = 100
        pushSizeSpans += c;
        t.push({
            random_service_name: false,
            spans: {
                count: c,
                size: 400, // changed with each load test run from 100 to 1200 for average trace size. 
                random_name: true,
                fixed_attrs: {
                    "test": "test",
                },
            }
        });
    }

    let gen = new tracing.ParameterizedGenerator(t)
    let traces = gen.traces()
    sleep(5)
    console.log(traces);
    client.push(traces);
}

export function teardown() {
    client.shutdown();
}

The text was updated successfully, but these errors were encountered:

joe-elliott · 2024-12-11T21:00:37Z

Is this the same as this issue? #4424

Can we keep the conversation in one place?

adhinneupane changed the title ~~Rightsizing Tempo Ingesters when trace sizes vary~~ Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills #4412

Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills #4412

adhinneupane commented Dec 4, 2024 •

edited

Loading

joe-elliott commented Dec 11, 2024

Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills #4412

Rightsizing Tempo Ingesters when trace sizes vary to prevent OOM kills #4412

Comments

adhinneupane commented Dec 4, 2024 • edited Loading

joe-elliott commented Dec 11, 2024

adhinneupane commented Dec 4, 2024 •

edited

Loading