Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Invalid UTF-8 error handling policy #257

Closed
wants to merge 11 commits into from
245 changes: 245 additions & 0 deletions text/0257-utf8-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# Consistent position towards UTF-8 validation in OpenTelemetry

The OpenTelemetry project has not taken a strong position on how to handle
invalid UTF-8 in its SDKs and Collector components.

## Motivation

When handling of invalid UTF-8 is left unspecified, anything can
happen. As a project, OpenTelemetry should give its component authors
guidance on how to treat invalid UTF-8 so that users can confidently
rely on OpenTelemetry software/systems.

## Explanation

OpenTelemetry has existing [error handling
guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md)
which dictate the project's general posture towards handling. Users
expect telemetry libraries to be harmless and safe to use, so that they
can confidently instrument their application without unnecessary risk.

> The API and SDK SHOULD provide safe defaults for missing or invalid arguments.

Generally, users should not have to check for errors from potentially
fallible instrumentation APIs, because it burdens the user, and that
instead OpenTelemetry SDKs should resort to corrective action.
Quoting the error handling guidelines,

> For instance, a name like empty may be used if the user passes in
> null as the span name argument during Span construction.

Users also make this demand of the OpenTelemetry collector, which is
expected to safely transport telemetry data.

OpenTelemetry's OTLP protocol, specifically dictates how [valid and
invalid UTF-8 strings should be mapped into attribute
values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values),
but there is no guidance on how to treat many other `string` fields
found in the OTLP protocol. The Span `Name` field, for example,
SHOULD be a valid UTF-8 string according to the Protocol Buffer
`proto3`-syntax specification. What is expected from the SDK when a
presumptive UTF-8 field is actually invalid UTF-8? What is expected
from the Collector? Without a clear definition, users are likely to
experience harmful loss of data.

This document proposes to update OpenTelemetry error handling
guidelines to require SDKs and Collectors to protect users from loss
of telemetry caused by string-valued fields that are not well-formed.
A number of options are available, all of which are better than doing
nothing about this problem.

## Internal details

Text handling is such a key aspect of programming languages that
practically, they all present different a approach. While most
languages provide guardrails that maintain a well-formed character
encoding in memory at runtime, there is usually a way for a programmer
handling bytes incorrectly to yield an invalid string encoding.

These are honest mistakes. If the telemetry will pass through an OTLP
representation and UTF-8 validation is left unchecked, there is an
opportunity for an entire batch of data to be rejected by the protocol
buffer subsystem used in both gRPC and HTTP code paths.

This is especially harmful to users, as the observability tools they
have in place suddely can't help them. Users can't log, span, or
metric about the invalid UTF-8 data, and terminals may also have
trouble displaying the data. For users to learn the source of their
problem, some component has to correct the invalid data before data
loss happens.

### Possible resolutions

Here are some three options:

1. Automatically correct invalid UTF-8.
2. Drop partial batches of data containing invalid UTF-8.
3. Permissively allow invalid UTF-8 to propagate (further) into the pipeline.

In this proposal, we reject the third option. See alternatives considered, below.

#### Automatic correction

This document proposes that OpenTelemetry's default approach should be
the one taken by the Rust `String::from_utf8_lossy` method, which
converts invalid UTF-8 byte sequences to the Unicode replacement
character, `�` (U+FFFD). This is the most preferable outcome as it is
simple and preserves what valid content can be recovered from the
data.

#### Dropping data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the cost of performing this check on valid strings in the collector?

I see this as a performance tradeoff for where to do the enforcement of utf-8, and my preference would be to push as much to generation side as possible.

I'll read your alternatives considered, as you probably call this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Generation side" - SDK/Exporter? Then the question becomes "should the collector trust the input" (I think the answer is "no").


Among the options available, we view dropping whole batches of
telemetry as an unacceptable outcome. OpenTelemetry SDKs and
Collectors SHOULD prefer to drop individual items of telemetry as
opposed to dropping whole batches of telemetry.

When a Collector drops individual items of telemetry as a result of
invalid UTF-8, it SHOULD return a `rejected_spans`,
`rejected_data_points`, or `rejected_log_records` count along with a
message indicating data loss.

### Survey of existing systems
Copy link
Contributor

@jsuereth jsuereth Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a few more.

E.g. Java - only enforces UTF-8 when attempting to read the bytes into Java's String format. See: https://github.com/protocolbuffers/protobuf/blob/0bfe41b27e3dd8a30ae383210d7af10c28a642ea/java/core/src/main/java/com/google/protobuf/Internal.java#L56 for the gore-y details


The Go gRPC-Go implementation with its default protocol buffer checks
for valid UTF-8 in both the client (`rpc error: code = Internal desc =
grpc: error while marshaling: string field contains invalid UTF-8`)
and the server (`rpc error: code = Internal desc = grpc: error
unmarshalling request: string field contains invalid UTF-8`).

The Rust Hyperium gRPC implementation checks for valid UTF-8 on the
server but not on the client (`Error: Status { code: Internal,
message: "failed to decode Protobuf message: HelloRequest.name:
invalid string value: data is not UTF-8 encoded"...`).

The OpenTelemetry Collector OTLP exporter and receiver do not validate
UTF-8, as a result of choosing the "Gogofaster" protocol buffer
library option.

Given these out-of-the-box configurations, it would be possible for a
Rust SDK to send invald UTF-8 to an OpenTelemetry Collector's OTLP
receiver. That collector could forward to another OpenTelemtry
Collector over OTLP successfully, but a non-OTLP exporter using a
different protocol buffer implementation would likely be the first to
observe the invalid UTF-8, and by that time, a large batch of
telemetry could fail as a result.

### Responsibility to the SDK user

The existing specification [dictates a safety mechanism for exporting
invalid string-valued attributes safely][SAFETY], however it only
applies to attribute strings:

[SAFETY]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values

```
## String Values
String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field.

String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string.
```

To satisfy the existing error-handling requirements, the OpenTelemetry
SDK specifications will be modified for all signals with an opt-out
validation feature.

#### Proposed Collector behavior change

The SDK SHOULD in its default configuration validate all string-valued
telemetry data fields. Each run of invalid UTF-8 (i.e., any invalid
UTF-8 sequences) will be replaced by a single Unicode replacement
character, `�` (U+FFFD). The exact behavior of this correction is
undefined. When possible, SDKs SHOULD use a built-in library for this
repair (for example, [Golang's
`strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or
[Rust's
`String::from_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy)
satisfy this requirement).

SDKs MAY permit users to opt-out of UTF-8 validation for performance reasons.

Optional UTF-8 validation MUST be applied before the attribute
transformation rule stated for invalid UTF-8 strings. This means
UTF-8 validation prevents downgrading to a `bytes_value` encoding
for string attributes.

#### No byte-slice valued attribute API

As a caveat, the OpenTelemetry project has previously debated and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ever formally rejected?

I'd like this option because we technically already support it in Erlang. The fact the main string type in Erlang/Elixir is binary and the SDK safety mechanism mentioned elsewhere in this doc that stores invalid utf8 in bytes_value of the proto.

So because attribute values are already type binary the user can pass any binary data they want as an attribute value and it gets used.

I recall it being rejected informally in spec sig meetings but maybe different luck with a formal proposal. Do you think there would be any chance of that?

rejected the potential to support a byte-slice typed attribute in
OpenTelemetry APIs.

This potential feature was rejected because it allows API users a way
to record a possibly uninterpretable sequence of bytes. Users with
invalid UTF-8 are left with a few options, for example:

- Base64-encode the invalid data wrapped in human-readable syntax
(e.g., `base64:WNhbHF1ZWVuY29hY2hod`).
- Transmute the byte slice to an integer slice, which is supported.

### Responsibility to the Collector user

Users need to trust that the OpenTelemetry Collector will not cause
unintended data loss, and we observe this can easily result from a
mismatch of permissive receiver and strict exporter. However, we are
aware that a strict receiver will cause whole batches of telemetry to
be lost, therefore it seems better for Collector pipelines to
explicitly handle UTF-8 validation, rather than leave it to the
protocol buffer library.

The OpenTelemetry Collector should support automatic UTF-8 validation to
protect users, however there are several places this could be done:

1. Following a receiver, with data coming from an external source,
2. Following a mutating processor, with data modified by potentially
untrustworthy code,
3. Prior to an exporter, with data coming from either an external
source and/or modified by potentially untrustworhty code.

Each of these approaches will take significant effort and vary in cost
at runtime.

#### Proposed Collector behavior change

To reduce the cost of UTF-8 validation to a minimum, we propose:

- UTF-8 validation SHOULD be enabled by default for all Receiver components
- Users SHOULD be able to opt-out of UTF-8 validation
- A `receiverhelper` library SHOULD offer a function to correct
invalid UTF-8 in-place, with two configurable outcomes (1) reject
individual items containing invalid UTF-8, meaning to count them as
rejected spans/points/logs, and (2) repair invalid UTF-8 as specified
for SDKs above.

When an OpenTelemetry collector receives telemetry data in any
protocol, in cases where the underlying RPC or protocol buffer
libraries does not guarantee valid UTF-8, the `receiverhelper` library
SHOULD be called to optionally validate UTF-8 according to the
confiuration described above.

### Alternatives considered

#### Exhaustive validation

The Collector behavior proposed above only validates data after it is
received, not after it is modified by a processor. We consider the
risk of a malfunctioning processor to be acceptable. If this happens,
it will be considered a bug in the Collector component. In other
words, the Collector SHOULD NOT perform internal data validation and
it SHOULD perform external data validation.

#### Permissive transport

We observe that some protocol buffer implementations are permissive
with respect to invalid UTF-8, while others are strict. It can be
risky to combine dissimilar protocol buffer implementations in a
pipeline, since a receiver might accept data that an exporter cannot
send, simply resulting from invalid UTF-8.

Considering whether components could support permissive data
transport, it appears to be risky and for little reward. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to understand the cost implications for validating on receivers.

I think the tradeoff in permissive is risky -

You require ANYONE who needs to interpret a string as UTF-8 to handle failure, at that moment.

However, in a risk/reward trade-off, for well-behaving systems, avoiding UTF-8 validation at every endpoint can add up.

I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.

  1. How likely do we think utf-8 issues are in practice?
  2. What is the cost of performing this check in collector components?

Personally - I think, related to your "consider invalid utf8 a bug in a processor", we should push repsonsible utf-8 as close to generation as possible, so I'd rather see this as an opt-in feature of otel than opt-out. BUT, I may be missing some context or use cases where this is highly problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.

+1, I don't just "like" it, I think we SHOULD do this.

Here are my reasons:

  1. The collector is normally sending the data to some endpoints. Many backend services already perform such validation/correction, so folks might want to just do it once in the backend rather than duplicating the effort.
  2. There are cases where data can be consumed directly on the collector side (e.g. a collector running in a local data center might decide to trigger a rollback due to certain metrics KPI drop during a deployment), I think this is a general pattern for things that run on Edge.
  3. Depending on the ownership of different parts of the system, the parts could be designed to be trusting each other or not. Collector needs to provide flexibility.
  4. Things could break between Collector and backend (e.g. bit flips caused by high energy particles from the universe, hardware failures), certain software needs to handle these as part of the design.

Regarding which one should be the default, based on what I've seen in Microsoft across Windows/Office/Azure/etc. I think it should be off-by-default and allow folks to opt-in.

OpenTelemetry decided to allow this, it would not be a sensible
default. On the other hand, existing OpenTelemetry Collectors have
this behavior, and changing it will require time. In the period of
time while this situation persists, we effectively have permissive
data transport (and the associated risks).