Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should String Elements and UTF-8 Elements be handled differently? #134

Open
FreezyLemon opened this issue Apr 27, 2023 · 8 comments
Open
Labels
spec-compliance Related to EBML or Matroska specification compliance

Comments

@FreezyLemon
Copy link
Contributor

EBML differentiates between String Elements and UTF-8 Elements.

They are almost exactly the same, except for this:

  • "String" only allows ASCII and terminator bytes.
  • "UTF-8" only allows UTF-8 and terminator bytes.

Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does:

impl<'a> EbmlParsable<'a> for String {
fn try_parse(data: &'a [u8]) -> Result<Self, ErrorKind> {
String::from_utf8(data.to_vec()).map_err(|_| ErrorKind::StringNotUtf8)
}
}

However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.

A small overview:

String = UTF-8 String != UTF-8
only String String + new type
from_utf8 (std) from_utf8 (std) + from_ascii (custom, maybe faster?)
mostly compliant compliant

Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?

@FreezyLemon FreezyLemon added the spec-compliance Related to EBML or Matroska specification compliance label Apr 27, 2023
@lu-zero
Copy link
Member

lu-zero commented Apr 27, 2023

Accepting is nearly ok, writing on the other hand is to be avoided if we had a strict mode.

@robUx4 do you have opinions? What do you do in the reference implementation?

@Luni-4
Copy link
Member

Luni-4 commented Apr 27, 2023

Just my opinion, but I prefer being compliant with the specification in order to have a clear and unanmbiguous implementation.

Is a big effort converting bytes into UTF-8 or ASCII in terms of memory and time consumption?

Perhaps I'm wrong, but we need these conversions to write a compliant matroska file. Since we have to do that for muxing can't we apply the same ones for parsing?

@lu-zero
Copy link
Member

lu-zero commented Apr 27, 2023

You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.

@Luni-4
Copy link
Member

Luni-4 commented Apr 28, 2023

You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.

Is this spec compliant? If we find an ascii string I suppose we should save it somewhere. If it is non-ascii, when there should be an ascii string, doesn't it mean that we have a corrupted input?

Just posing questions to see if I've understood the problem correctly

@lu-zero
Copy link
Member

lu-zero commented Apr 28, 2023

Given that ascii and utf8 have the same representation for what's in ascii, in order to produce always valid files we have to reject non-ascii-for-ascii or put a placeholder instead of an utf-8 character depending on how lenient we want to be.

On consumption, I would only reject non-utf8 no matter the source.

@robUx4
Copy link

robUx4 commented May 2, 2023

You should not be able to write UTF-8 data in an ASCII string. When reading you may also want to check the string is really ASCII, although I'm not aware of any reader doing so. However since this is Rust it may not be nice to pass around UTF-8 data pretending to the ASCII.

@FreezyLemon
Copy link
Contributor Author

As a note to the person implementing this (maybe me soon):

  • ASCII strings could be represented by Vec<u8> or a type like this, probably not worth the extra complexity (at least for now)
    • this would require validation on the parsing side, but (given some prerequisites) not on the serialization side
    • negligible memory efficiency improvement (nothing when String/&str; a bit when iterating over chars)
    • maybe more CPU-efficient?
  • Serialization of EBML ASCII Element must check validity if Rust string types are used (maybe char::is_ascii(self)? or alternatively, iterate over .bytes() and check manually)
  • Rusts UTF-8 types don't need to be validated for EBMLs UTF-8 type. Strings can only be invalid UTF-8 if the caller "messes up" (unsafe code + broken prerequisite). std assumes valid UTF-8 in any string, so we should be able to assume that too.

@Luni-4
Copy link
Member

Luni-4 commented Jun 11, 2023

@FreezyLemon
I agree with your analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec-compliance Related to EBML or Matroska specification compliance
Projects
None yet
Development

No branches or pull requests

4 participants