This repository has been archived by the owner on Dec 6, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Invalid UTF-8 error handling policy #257
Invalid UTF-8 error handling policy #257
Changes from 2 commits
4f6c3a5
6df233d
c30ff3d
e9a0d56
47b656f
15353f1
f23f8b0
d79c75f
e84d1e9
10721f9
4dbb16f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the cost of performing this check on valid strings in the collector?
I see this as a performance tradeoff for where to do the enforcement of utf-8, and my preference would be to push as much to generation side as possible.
I'll read your alternatives considered, as you probably call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Generation side" - SDK/Exporter? Then the question becomes "should the collector trust the input" (I think the answer is "no").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a few more.
E.g. Java - only enforces UTF-8 when attempting to read the bytes into Java's String format. See: https://github.com/protocolbuffers/protobuf/blob/0bfe41b27e3dd8a30ae383210d7af10c28a642ea/java/core/src/main/java/com/google/protobuf/Internal.java#L56 for the gore-y details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to understand the cost implications for validating on receivers.
I think the tradeoff in permissive is risky -
You require ANYONE who needs to interpret a string as UTF-8 to handle failure, at that moment.
However, in a risk/reward trade-off, for well-behaving systems, avoiding UTF-8 validation at every endpoint can add up.
I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.
Personally - I think, related to your "consider invalid utf8 a bug in a processor", we should push repsonsible utf-8 as close to generation as possible, so I'd rather see this as an opt-in feature of otel than opt-out. BUT, I may be missing some context or use cases where this is highly problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I don't just "like" it, I think we SHOULD do this.
Here are my reasons:
Regarding which one should be the default, based on what I've seen in Microsoft across Windows/Office/Azure/etc. I think it should be off-by-default and allow folks to opt-in.