-
Notifications
You must be signed in to change notification settings - Fork 164
Invalid UTF-8 error handling policy #257
Changes from all commits
4f6c3a5
6df233d
c30ff3d
e9a0d56
47b656f
15353f1
f23f8b0
d79c75f
e84d1e9
10721f9
4dbb16f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,245 @@ | ||
# Consistent position towards UTF-8 validation in OpenTelemetry | ||
|
||
The OpenTelemetry project has not taken a strong position on how to handle | ||
invalid UTF-8 in its SDKs and Collector components. | ||
|
||
## Motivation | ||
|
||
When handling of invalid UTF-8 is left unspecified, anything can | ||
happen. As a project, OpenTelemetry should give its component authors | ||
guidance on how to treat invalid UTF-8 so that users can confidently | ||
rely on OpenTelemetry software/systems. | ||
|
||
## Explanation | ||
|
||
OpenTelemetry has existing [error handling | ||
guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md) | ||
which dictate the project's general posture towards handling. Users | ||
expect telemetry libraries to be harmless and safe to use, so that they | ||
can confidently instrument their application without unnecessary risk. | ||
|
||
> The API and SDK SHOULD provide safe defaults for missing or invalid arguments. | ||
|
||
Generally, users should not have to check for errors from potentially | ||
fallible instrumentation APIs, because it burdens the user, and that | ||
instead OpenTelemetry SDKs should resort to corrective action. | ||
Quoting the error handling guidelines, | ||
|
||
> For instance, a name like empty may be used if the user passes in | ||
> null as the span name argument during Span construction. | ||
|
||
Users also make this demand of the OpenTelemetry collector, which is | ||
expected to safely transport telemetry data. | ||
|
||
OpenTelemetry's OTLP protocol, specifically dictates how [valid and | ||
invalid UTF-8 strings should be mapped into attribute | ||
values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values), | ||
but there is no guidance on how to treat many other `string` fields | ||
found in the OTLP protocol. The Span `Name` field, for example, | ||
SHOULD be a valid UTF-8 string according to the Protocol Buffer | ||
`proto3`-syntax specification. What is expected from the SDK when a | ||
presumptive UTF-8 field is actually invalid UTF-8? What is expected | ||
from the Collector? Without a clear definition, users are likely to | ||
experience harmful loss of data. | ||
|
||
This document proposes to update OpenTelemetry error handling | ||
guidelines to require SDKs and Collectors to protect users from loss | ||
of telemetry caused by string-valued fields that are not well-formed. | ||
A number of options are available, all of which are better than doing | ||
nothing about this problem. | ||
|
||
## Internal details | ||
|
||
Text handling is such a key aspect of programming languages that | ||
practically, they all present different a approach. While most | ||
languages provide guardrails that maintain a well-formed character | ||
encoding in memory at runtime, there is usually a way for a programmer | ||
handling bytes incorrectly to yield an invalid string encoding. | ||
|
||
These are honest mistakes. If the telemetry will pass through an OTLP | ||
representation and UTF-8 validation is left unchecked, there is an | ||
opportunity for an entire batch of data to be rejected by the protocol | ||
buffer subsystem used in both gRPC and HTTP code paths. | ||
|
||
This is especially harmful to users, as the observability tools they | ||
have in place suddely can't help them. Users can't log, span, or | ||
metric about the invalid UTF-8 data, and terminals may also have | ||
trouble displaying the data. For users to learn the source of their | ||
problem, some component has to correct the invalid data before data | ||
loss happens. | ||
|
||
### Possible resolutions | ||
|
||
Here are some three options: | ||
|
||
1. Automatically correct invalid UTF-8. | ||
2. Drop partial batches of data containing invalid UTF-8. | ||
3. Permissively allow invalid UTF-8 to propagate (further) into the pipeline. | ||
|
||
In this proposal, we reject the third option. See alternatives considered, below. | ||
|
||
#### Automatic correction | ||
|
||
This document proposes that OpenTelemetry's default approach should be | ||
the one taken by the Rust `String::from_utf8_lossy` method, which | ||
converts invalid UTF-8 byte sequences to the Unicode replacement | ||
character, `�` (U+FFFD). This is the most preferable outcome as it is | ||
simple and preserves what valid content can be recovered from the | ||
data. | ||
|
||
#### Dropping data | ||
|
||
Among the options available, we view dropping whole batches of | ||
telemetry as an unacceptable outcome. OpenTelemetry SDKs and | ||
Collectors SHOULD prefer to drop individual items of telemetry as | ||
opposed to dropping whole batches of telemetry. | ||
|
||
When a Collector drops individual items of telemetry as a result of | ||
invalid UTF-8, it SHOULD return a `rejected_spans`, | ||
`rejected_data_points`, or `rejected_log_records` count along with a | ||
message indicating data loss. | ||
|
||
### Survey of existing systems | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should add a few more. E.g. Java - only enforces UTF-8 when attempting to read the bytes into Java's String format. See: https://github.com/protocolbuffers/protobuf/blob/0bfe41b27e3dd8a30ae383210d7af10c28a642ea/java/core/src/main/java/com/google/protobuf/Internal.java#L56 for the gore-y details |
||
|
||
The Go gRPC-Go implementation with its default protocol buffer checks | ||
for valid UTF-8 in both the client (`rpc error: code = Internal desc = | ||
grpc: error while marshaling: string field contains invalid UTF-8`) | ||
and the server (`rpc error: code = Internal desc = grpc: error | ||
unmarshalling request: string field contains invalid UTF-8`). | ||
|
||
The Rust Hyperium gRPC implementation checks for valid UTF-8 on the | ||
server but not on the client (`Error: Status { code: Internal, | ||
message: "failed to decode Protobuf message: HelloRequest.name: | ||
invalid string value: data is not UTF-8 encoded"...`). | ||
|
||
The OpenTelemetry Collector OTLP exporter and receiver do not validate | ||
UTF-8, as a result of choosing the "Gogofaster" protocol buffer | ||
library option. | ||
|
||
Given these out-of-the-box configurations, it would be possible for a | ||
Rust SDK to send invald UTF-8 to an OpenTelemetry Collector's OTLP | ||
receiver. That collector could forward to another OpenTelemtry | ||
Collector over OTLP successfully, but a non-OTLP exporter using a | ||
different protocol buffer implementation would likely be the first to | ||
observe the invalid UTF-8, and by that time, a large batch of | ||
telemetry could fail as a result. | ||
|
||
### Responsibility to the SDK user | ||
|
||
The existing specification [dictates a safety mechanism for exporting | ||
invalid string-valued attributes safely][SAFETY], however it only | ||
applies to attribute strings: | ||
|
||
[SAFETY]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values | ||
|
||
``` | ||
## String Values | ||
String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field. | ||
|
||
String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string. | ||
``` | ||
|
||
To satisfy the existing error-handling requirements, the OpenTelemetry | ||
SDK specifications will be modified for all signals with an opt-out | ||
validation feature. | ||
|
||
#### Proposed Collector behavior change | ||
|
||
The SDK SHOULD in its default configuration validate all string-valued | ||
telemetry data fields. Each run of invalid UTF-8 (i.e., any invalid | ||
UTF-8 sequences) will be replaced by a single Unicode replacement | ||
character, `�` (U+FFFD). The exact behavior of this correction is | ||
undefined. When possible, SDKs SHOULD use a built-in library for this | ||
repair (for example, [Golang's | ||
`strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or | ||
[Rust's | ||
`String::from_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy) | ||
satisfy this requirement). | ||
|
||
SDKs MAY permit users to opt-out of UTF-8 validation for performance reasons. | ||
|
||
Optional UTF-8 validation MUST be applied before the attribute | ||
transformation rule stated for invalid UTF-8 strings. This means | ||
UTF-8 validation prevents downgrading to a `bytes_value` encoding | ||
for string attributes. | ||
|
||
#### No byte-slice valued attribute API | ||
|
||
As a caveat, the OpenTelemetry project has previously debated and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was this ever formally rejected? I'd like this option because we technically already support it in Erlang. The fact the main string type in Erlang/Elixir is So because attribute values are already type I recall it being rejected informally in spec sig meetings but maybe different luck with a formal proposal. Do you think there would be any chance of that? |
||
rejected the potential to support a byte-slice typed attribute in | ||
OpenTelemetry APIs. | ||
|
||
This potential feature was rejected because it allows API users a way | ||
to record a possibly uninterpretable sequence of bytes. Users with | ||
invalid UTF-8 are left with a few options, for example: | ||
|
||
- Base64-encode the invalid data wrapped in human-readable syntax | ||
(e.g., `base64:WNhbHF1ZWVuY29hY2hod`). | ||
- Transmute the byte slice to an integer slice, which is supported. | ||
|
||
### Responsibility to the Collector user | ||
|
||
Users need to trust that the OpenTelemetry Collector will not cause | ||
unintended data loss, and we observe this can easily result from a | ||
mismatch of permissive receiver and strict exporter. However, we are | ||
aware that a strict receiver will cause whole batches of telemetry to | ||
be lost, therefore it seems better for Collector pipelines to | ||
explicitly handle UTF-8 validation, rather than leave it to the | ||
protocol buffer library. | ||
|
||
The OpenTelemetry Collector should support automatic UTF-8 validation to | ||
protect users, however there are several places this could be done: | ||
|
||
1. Following a receiver, with data coming from an external source, | ||
2. Following a mutating processor, with data modified by potentially | ||
untrustworthy code, | ||
3. Prior to an exporter, with data coming from either an external | ||
source and/or modified by potentially untrustworhty code. | ||
|
||
Each of these approaches will take significant effort and vary in cost | ||
at runtime. | ||
|
||
#### Proposed Collector behavior change | ||
|
||
To reduce the cost of UTF-8 validation to a minimum, we propose: | ||
|
||
- UTF-8 validation SHOULD be enabled by default for all Receiver components | ||
- Users SHOULD be able to opt-out of UTF-8 validation | ||
- A `receiverhelper` library SHOULD offer a function to correct | ||
invalid UTF-8 in-place, with two configurable outcomes (1) reject | ||
individual items containing invalid UTF-8, meaning to count them as | ||
rejected spans/points/logs, and (2) repair invalid UTF-8 as specified | ||
for SDKs above. | ||
|
||
When an OpenTelemetry collector receives telemetry data in any | ||
protocol, in cases where the underlying RPC or protocol buffer | ||
libraries does not guarantee valid UTF-8, the `receiverhelper` library | ||
SHOULD be called to optionally validate UTF-8 according to the | ||
confiuration described above. | ||
|
||
### Alternatives considered | ||
|
||
#### Exhaustive validation | ||
|
||
The Collector behavior proposed above only validates data after it is | ||
received, not after it is modified by a processor. We consider the | ||
risk of a malfunctioning processor to be acceptable. If this happens, | ||
it will be considered a bug in the Collector component. In other | ||
words, the Collector SHOULD NOT perform internal data validation and | ||
it SHOULD perform external data validation. | ||
|
||
#### Permissive transport | ||
|
||
We observe that some protocol buffer implementations are permissive | ||
with respect to invalid UTF-8, while others are strict. It can be | ||
risky to combine dissimilar protocol buffer implementations in a | ||
pipeline, since a receiver might accept data that an exporter cannot | ||
send, simply resulting from invalid UTF-8. | ||
|
||
Considering whether components could support permissive data | ||
transport, it appears to be risky and for little reward. If | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd still like to understand the cost implications for validating on receivers. I think the tradeoff in permissive is risky - You require ANYONE who needs to interpret a string as UTF-8 to handle failure, at that moment. However, in a risk/reward trade-off, for well-behaving systems, avoiding UTF-8 validation at every endpoint can add up. I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.
Personally - I think, related to your "consider invalid utf8 a bug in a processor", we should push repsonsible utf-8 as close to generation as possible, so I'd rather see this as an opt-in feature of otel than opt-out. BUT, I may be missing some context or use cases where this is highly problematic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1, I don't just "like" it, I think we SHOULD do this. Here are my reasons:
Regarding which one should be the default, based on what I've seen in Microsoft across Windows/Office/Azure/etc. I think it should be off-by-default and allow folks to opt-in. |
||
OpenTelemetry decided to allow this, it would not be a sensible | ||
default. On the other hand, existing OpenTelemetry Collectors have | ||
this behavior, and changing it will require time. In the period of | ||
time while this situation persists, we effectively have permissive | ||
data transport (and the associated risks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the cost of performing this check on valid strings in the collector?
I see this as a performance tradeoff for where to do the enforcement of utf-8, and my preference would be to push as much to generation side as possible.
I'll read your alternatives considered, as you probably call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Generation side" - SDK/Exporter? Then the question becomes "should the collector trust the input" (I think the answer is "no").