Should String Elements and UTF-8 Elements be handled differently? #134

FreezyLemon · 2023-04-27T18:31:22Z

EBML differentiates between String Elements and UTF-8 Elements.

They are almost exactly the same, except for this:

"String" only allows ASCII and terminator bytes.
"UTF-8" only allows UTF-8 and terminator bytes.

Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does:

matroska/src/ebml/parse.rs

Lines 80 to 84 in 7f15b7c

    
           impl<'a> EbmlParsable<'a> for String { 
        
               fn try_parse(data: &'a [u8]) -> Result<Self, ErrorKind> { 
        
                   String::from_utf8(data.to_vec()).map_err(|_| ErrorKind::StringNotUtf8) 
        
               } 
        
           }

However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.

A small overview:

String = UTF-8	String != UTF-8
only `String`	`String` + new type
`from_utf8` (std)	`from_utf8` (std) + `from_ascii` (custom, maybe faster?)
mostly compliant	compliant

Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?

The text was updated successfully, but these errors were encountered:

lu-zero · 2023-04-27T18:40:41Z

Accepting is nearly ok, writing on the other hand is to be avoided if we had a strict mode.

@robUx4 do you have opinions? What do you do in the reference implementation?

Luni-4 · 2023-04-27T21:36:35Z

Just my opinion, but I prefer being compliant with the specification in order to have a clear and unanmbiguous implementation.

Is a big effort converting bytes into UTF-8 or ASCII in terms of memory and time consumption?

Perhaps I'm wrong, but we need these conversions to write a compliant matroska file. Since we have to do that for muxing can't we apply the same ones for parsing?

lu-zero · 2023-04-27T23:53:50Z

You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.

Luni-4 · 2023-04-28T06:55:15Z

You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.

Is this spec compliant? If we find an ascii string I suppose we should save it somewhere. If it is non-ascii, when there should be an ascii string, doesn't it mean that we have a corrupted input?

Just posing questions to see if I've understood the problem correctly

lu-zero · 2023-04-28T08:15:36Z

Given that ascii and utf8 have the same representation for what's in ascii, in order to produce always valid files we have to reject non-ascii-for-ascii or put a placeholder instead of an utf-8 character depending on how lenient we want to be.

On consumption, I would only reject non-utf8 no matter the source.

robUx4 · 2023-05-02T05:39:20Z

You should not be able to write UTF-8 data in an ASCII string. When reading you may also want to check the string is really ASCII, although I'm not aware of any reader doing so. However since this is Rust it may not be nice to pass around UTF-8 data pretending to the ASCII.

FreezyLemon · 2023-06-10T16:40:27Z

As a note to the person implementing this (maybe me soon):

ASCII strings could be represented by Vec<u8> or a type like this, probably not worth the extra complexity (at least for now)
- this would require validation on the parsing side, but (given some prerequisites) not on the serialization side
- negligible memory efficiency improvement (nothing when String/&str; a bit when iterating over chars)
- maybe more CPU-efficient?
Serialization of EBML ASCII Element must check validity if Rust string types are used (maybe char::is_ascii(self)? or alternatively, iterate over .bytes() and check manually)
Rusts UTF-8 types don't need to be validated for EBMLs UTF-8 type. Strings can only be invalid UTF-8 if the caller "messes up" (unsafe code + broken prerequisite). std assumes valid UTF-8 in any string, so we should be able to assume that too.

Luni-4 · 2023-06-11T10:11:28Z

@FreezyLemon
I agree with your analysis

FreezyLemon added the spec-compliance Related to EBML or Matroska specification compliance label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should String Elements and UTF-8 Elements be handled differently? #134

Should String Elements and UTF-8 Elements be handled differently? #134

FreezyLemon commented Apr 27, 2023

lu-zero commented Apr 27, 2023

Luni-4 commented Apr 27, 2023 •

edited

Loading

lu-zero commented Apr 27, 2023

Luni-4 commented Apr 28, 2023 •

edited

Loading

lu-zero commented Apr 28, 2023

robUx4 commented May 2, 2023

FreezyLemon commented Jun 10, 2023

Luni-4 commented Jun 11, 2023

Should String Elements and UTF-8 Elements be handled differently? #134

Should String Elements and UTF-8 Elements be handled differently? #134

Comments

FreezyLemon commented Apr 27, 2023

lu-zero commented Apr 27, 2023

Luni-4 commented Apr 27, 2023 • edited Loading

lu-zero commented Apr 27, 2023

Luni-4 commented Apr 28, 2023 • edited Loading

lu-zero commented Apr 28, 2023

robUx4 commented May 2, 2023

FreezyLemon commented Jun 10, 2023

Luni-4 commented Jun 11, 2023

Luni-4 commented Apr 27, 2023 •

edited

Loading

Luni-4 commented Apr 28, 2023 •

edited

Loading