Skip to content

Commit

Permalink
[release-v2.7] Add doc for max_span_attr_byte and restructure trouble…
Browse files Browse the repository at this point in the history
…shoot doc (#4561)

Co-authored-by: Jack Baldry <[email protected]>
Co-authored-by: Clayton Cornell <[email protected]>
Co-authored-by: Kim Nylander <[email protected]>
  • Loading branch information
4 people authored Jan 16, 2025
1 parent 20c9363 commit e7db70c
Show file tree
Hide file tree
Showing 14 changed files with 216 additions and 57 deletions.
30 changes: 23 additions & 7 deletions docs/sources/tempo/configuration/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,22 @@ This document explains the configuration options for Tempo as well as the detail

{{< admonition type="tip" >}}
Instructions for configuring Tempo data sources are available in the [Grafana Cloud](/docs/grafana-cloud/send-data/traces/) and [Grafana](/docs/grafana/latest/datasources/tempo/) documentation.
{{% /admonition %}}
{{< /admonition >}}

The Tempo configuration options include:

- [Configure Tempo](#configure-tempo)
- [Use environment variables in the configuration](#use-environment-variables-in-the-configuration)
- [Server](#server)
- [Distributor](#distributor)
- [Set max attribute size to help control out of memory errors](#set-max-attribute-size-to-help-control-out-of-memory-errors)
- [Ingester](#ingester)
- [Metrics-generator](#metrics-generator)
- [Query-frontend](#query-frontend)
- [Limit query size to improve performance and stability](#limit-query-size-to-improve-performance-and-stability)
- [Limit the spans per spanset](#limit-the-spans-per-spanset)
- [Cap the maximum query length](#cap-the-maximum-query-length)
- [Querier](#querier)
- [Cap the maximum query length](#cap-the-maximum-query-length)
- [Querier](#querier)
- [Compactor](#compactor)
- [Storage](#storage)
- [Local storage recommendations](#local-storage-recommendations)
Expand Down Expand Up @@ -251,6 +252,21 @@ distributor:
[stale_duration: <duration> | default = 15m0s]
```
### Set max attribute size to help control out of memory errors
Tempo queriers can run out of memory when fetching traces that have spans with very large attributes.
This issue has been observed when trying to fetch a single trace using the [`tracebyID` endpoint](https://grafana.com/docs/tempo/<TEMPO_VERSION>/api_docs/#query).
While a trace might not have a lot of spans (roughly 500), it can have a larger size (approximately 250KB).
Some of the spans in that trace had attributes whose values were very large in size.

To avoid these out-of-memory crashes, use `max_span_attr_byte` to limit the maximum allowable size of any individual attribute.
Any key or values that exceed the configured limit are truncated before storing.
The default value is `2048`.

Use the `tempo_distributor_attributes_truncated_total` metric to track how many attributes are truncated.

For additional information, refer to [Troubleshoot out-of-memory errors](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/out-of-memory-errors/).

## Ingester

For more information on configuration options, refer to [this file](https://github.com/grafana/tempo/blob/main/modules/ingester/config.go).
Expand Down Expand Up @@ -315,7 +331,7 @@ If you want to enable metrics-generator for your Grafana Cloud account, refer to
You can limit spans with end times that occur within a configured duration to be considered in metrics generation using `metrics_ingestion_time_range_slack`.
In Grafana Cloud, this value defaults to 30 seconds so all spans sent to the metrics-generation more than 30 seconds in the past are discarded or rejected.

For more information about the `local-blocks` configuration option, refer to [TraceQL metrics](https://grafana.com/docs/tempo/latest/operations/traceql-metrics/#configure-the-local-blocks-processor).
For more information about the `local-blocks` configuration option, refer to [TraceQL metrics](https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/traceql-metrics/#activate-and-configure-the-local-blocks-processor).

```yaml
# Metrics-generator configuration block
Expand Down Expand Up @@ -724,14 +740,14 @@ In a similar manner, excessive queries result size can also negatively impact qu
#### Limit the spans per spanset

You can set the maximum spans per spanset by setting `max_spans_per_span_set` for the query-frontend.
The default value is 100.
The default value is 100.

In Grafana or Grafana Cloud, you can use the **Span Limit** field in the [TraceQL query editor](https://grafana.com/docs/grafana-cloud/connect-externally-hosted/data-sources/tempo/query-editor/) in Grafana Explore.
This field sets the maximum number of spans to return for each span set.
The maximum value that you can set for the **Span Limit** value (or the spss query) is controlled by `max_spans_per_span_set`.
To disable the maximum spans per span set limit, set `max_spans_per_span_set` to `0`.
When set to `0`, there is no maximum and users can put any value in **Span Limit**.
However, this can only be set by a Tempo administrator, not by the user.
When set to `0`, there is no maximum and users can put any value in **Span Limit**.
However, this can only be set by a Tempo administrator, not by the user.

#### Cap the maximum query length

Expand Down
22 changes: 13 additions & 9 deletions docs/sources/tempo/troubleshooting/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ menuTitle: Troubleshoot
description: Learn how to troubleshoot operational issues for Grafana Tempo.
weight: 700
aliases:
- ../operations/troubleshooting/
- ../operations/troubleshooting/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/troubleshooting/
---

# Troubleshoot Tempo
Expand All @@ -16,18 +16,22 @@ In addition, the [Tempo runbook](https://github.com/grafana/tempo/blob/main/oper

## Sending traces

- [Spans are being refused with "pusher failed to consume trace data"](https://grafana.com/docs/tempo/<TEMMPO_VERSION>/troubleshooting/max-trace-limit-reached/)
- [Is Grafana Alloy sending to the backend?](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/alloy/)
- [Spans are being refused with "pusher failed to consume trace data"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/send-traces/max-trace-limit-reached/)
- [Is Grafana Alloy sending to the backend?](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/send-traces/alloy/)

## Querying

- [Unable to find my traces in Tempo](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/unable-to-see-trace/)
- [Error message "Too many jobs in the queue"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/too-many-jobs-in-queue/)
- [Queries fail with 500 and "error using pageFinder"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/bad-blocks/)
- [I can search traces, but there are no service name or span name values available](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/search-tag)
- [Error message `response larger than the max (<number> vs <limit>)`](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/response-too-large/)
- [Search results don't match trace lookup results with long-running traces](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/long-running-traces/)
- [Unable to find my traces in Tempo](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/unable-to-see-trace/)
- [Error message "Too many jobs in the queue"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/too-many-jobs-in-queue/)
- [Queries fail with 500 and "error using pageFinder"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/bad-blocks/)
- [I can search traces, but there are no service name or span name values available](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/search-tag)
- [Error message `response larger than the max (<number> vs <limit>)`](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/response-too-large/)
- [Search results don't match trace lookup results with long-running traces](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/long-running-traces/)

## Metrics-generator

- [Metrics or service graphs seem incomplete](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/metrics-generator/)

## Out-of-memory errors

- [Set the max attribute size to help control out of memory errors](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/out-of-memory-errors/)
12 changes: 6 additions & 6 deletions docs/sources/tempo/troubleshooting/metrics-generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@ menuTitle: Metrics-generator
description: Gain an understanding of how to debug metrics quality issues.
weight: 500
aliases:
- ../operations/troubleshooting/metrics-generator/
- ../operations/troubleshooting/metrics-generator/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/troubleshooting/metrics-generator/
---

# Troubleshoot metrics-generator

If you are concerned with data quality issues in the metrics-generator, we'd first recommend:

- Reviewing your telemetry pipeline to determine the number of dropped spans. We are only looking for major issues here.
- Reviewing the [service graph documentation]({{< relref "../metrics-generator/service_graphs" >}}) to understand how they are built.
- Reviewing your telemetry pipeline to determine the number of dropped spans. You are only looking for major issues here.
- Reviewing the [service graph documentation](https://grafana.com/docs/tempo/<TEMPO_VERSION>/metrics-generator/service_graphs/) to understand how they are built.

If everything seems ok from these two perspectives, consider the following topics to help resolve general issues with all metrics and span metrics specifically.
If everything seems acceptable from these two perspectives, consider the following topics to help resolve general issues with all metrics and span metrics specifically.

## All metrics

### Dropped spans in the distributor

The distributor has a queue of outgoing spans to the metrics-generators. If that queue is full then the distributor
will drop spans before they reach the generator. Use the following metric to determine if that is happening:
The distributor has a queue of outgoing spans to the metrics-generators.
If the queue is full, then the distributor drops spans before they reach the generator. Use the following metric to determine if that's happening:

```
sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m]))
Expand Down
102 changes: 102 additions & 0 deletions docs/sources/tempo/troubleshooting/out-of-memory-errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
title: Troubleshoot out-of-memory errors
menuTitle: Out-of-memory errors
description: Gain an understanding of how to debug out-of-memory (OOM) errors.
weight: 600
---

# Troubleshoot out-of-memory errors

Learn about out-of-memory (OOM) issues and how to troubleshoot them.

## Set the max attribute size to help control out of memory errors

Tempo queriers can run out of memory when fetching traces that have spans with very large attributes.
This issue has been observed when trying to fetch a single trace using the [`tracebyID` endpoint](https://grafana.com/docs/tempo/latest/api_docs/#query).

To avoid these out-of-memory crashes, use `max_span_attr_byte` to limit the maximum allowable size of any individual attribute.
Any key or values that exceed the configured limit are truncated before storing.

Use the `tempo_distributor_attributes_truncated_total` metric to track how many attributes are truncated.

```yaml
# Optional
# Configures the max size an attribute can be. Any key or value that exceeds this limit will be truncated before storing
# Setting this parameter to '0' would disable this check against attribute size
[max_span_attr_byte: <int> | default = '2048']
```
Refer to the [configuration for distributors](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#set-max-attribute-size-to-help-control-out-of-memory-errors) documentation for more information.
## Max trace size
Traces which are long-running (minutes or hours) or large (100K - 1M spans) will spike the memory usage of each component when it is encountered.
This is because Tempo treats traces as single units, and keeps all data for a trace together to enable features like structural queries and analysis.
When reading a large trace, it can spike the memory usage of the read components:
* query-frontend
* querier
* ingester
* metrics-generator
When writing a large trace, it can spike the memory usage of the write components:
* ingester
* compactor
* metrics-generator
Start with a smaller trace size limit of 15MB, and increase it as needed.
With an average span size of 300 bytes, this allows for 50K spans per trace.
Always ensure that the limit is configured, and the largest recommended limit is 60 MB.
Configure the limit in the per-tenant overrides:
```yaml
overrides:
'tenant123':
max_bytes_per_trace: 1.5e+07
```
Refer to the [Overrides](# https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#standard-overrides) documentation for more information.
## Large attributes
Very large attributes, 10KB or longer, can spike the memory usage of each component when they are encountered.
Tempo's Parquet format uses dictionary-encoded columns, which works well for repeated values.
However, for very large and high cardinality attributes, this can require a large amount of memory.
A common source of large attributes is auto-instrumentation in these areas:
* HTTP
* Request or response bodies
* Large headers
* [http.request.header.&lt;key>](https://opentelemetry.io/docs/specs/semconv/attributes-registry/http/)
* Large URLs
* http.url
* [url.full](https://opentelemetry.io/docs/specs/semconv/attributes-registry/url/)
* Databases
* Full query statements
* db.statement
* [db.query.text](https://opentelemetry.io/docs/specs/semconv/attributes-registry/db/)
* Queues
* Message bodies
When reading these attributes, they can spike the memory usage of the read components:
* query-frontend
* querier
* ingester
* metrics-generator
When writing these attributes, they can spike the memory usage of the write components:
* ingester
* compactor
* metrics-generator
You can [automatically limit attribute sizes](https://github.com/grafana/tempo/pull/4335) using [`max_span_attr_byte`]((https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#set-max-attribute-size-to-help-control-out-of-memory-errors).
You can also use these options:

* Manually update application instrumentation to remove or limit these attributes
* Drop the attributes in the tracing pipeline using [attribute processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/attributesprocessor)
12 changes: 12 additions & 0 deletions docs/sources/tempo/troubleshooting/querying/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Issues with querying
menuTitle: Querying
description: Troubleshoot issues related to querying.
weight: 300
---

# Issues with querying

Learn about issues related to querying.

{{< section withDescriptions="true">}}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ title: Bad blocks
description: Troubleshoot queries failing with an error message indicating bad blocks.
weight: 475
aliases:
- ../operations/troubleshooting/bad-blocks/
- ../../operations/troubleshooting/bad-blocks/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/troubleshooting/bad-blocks/
- ../bad-blocks/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/bad-blocks/
---

# Bad blocks
Expand All @@ -26,7 +27,7 @@ To fix such a block, first download it onto a machine where you can run the `tem

Next run the `tempo-cli`'s `gen index` / `gen bloom` commands depending on which file is corrupt/deleted.
The command will create a fresh index/bloom-filter from the data file at the required location (in the block folder).
To view all of the options for this command, see the [cli docs]({{< relref "../operations/tempo_cli" >}}).
To view all of the options for this command, see the [CLI docs](https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/tempo_cli/).

Finally, upload the generated index or bloom-filter onto the object store backend under the folder for the block.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ title: Long-running traces
description: Troubleshoot search results when using long-running traces
weight: 479
aliases:
- ../operations/troubleshooting/long-running-traces/
- ../../operations/troubleshooting/long-running-traces/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/troubleshooting/long-running-traces/
- ../long-running-traces/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/long-running-traces/
---

# Long-running traces
Expand All @@ -20,7 +21,7 @@ different blocks, which can lead to inconsistency in a few ways:
matching blocks, which yields greater accuracy when combined.

1. When using [`spanset`
operators](https://grafana.com/docs/tempo/latest/traceql/#combining-spansets),
operators](https://grafana.com/docs/tempo/<TEMPO_VERSION>/traceql/#combine-spansets),
Tempo only evaluates the contiguous trace of the current block. This means
that for a single block the conditions may evaluate to false, but to
consider all parts of the trace from all blocks would evaluate true.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@ title: Response larger than the max
description: Troubleshoot response larger than the max error message
weight: 477
aliases:
- ../operations/troubleshooting/response-too-large/
- ../operations/troubleshooting/response-too-large/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/troubleshooting/response-too-large/
- ../response-too-large/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/response-too-large/
---

# Response larger than the max

The error message will take a similar form to the following:
The error message is similar to the following:

```
500 Internal Server Error Body: response larger than the max (<size> vs <limit>)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ title: Tag search
description: Troubleshoot No options found in Grafana tag search
weight: 476
aliases:
- ../operations/troubleshooting/search-tag/
- ../../operations/troubleshooting/search-tag/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/troubleshooting/search-tag/
- ../search-tag/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/search-tag/
---

# Tag search
Expand All @@ -25,4 +26,4 @@ when a query exceeds the configured value.
There are two main solutions to this issue:

* Reduce the cardinality of tags pushed to Tempo. Reducing the number of unique tag values will reduce the size returned by a tag search query.
* Increase the `max_bytes_per_tag_values_query` parameter in the [overrides]({{< relref "../configuration#overrides" >}}) block of your Tempo configuration to a value as high as 50MB.
* Increase the `max_bytes_per_tag_values_query` parameter in the [overrides](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#overrides) block of your Tempo configuration to a value as high as 50MB.
Loading

0 comments on commit e7db70c

Please sign in to comment.