Is this UTF8-friendly? #1

DonaldTsang · 2019-07-03T05:57:06Z

Was wondering if this will work with UTF8 (and other Unicode encoding)

kosarev · 2019-07-03T11:21:05Z

It depends on what you are going to achieve. The two likely most common tasks about Unicode are making a given sequence to be string-safe by removing the quote characters and the other one is encoding binary data as a Unicode string. To be efficient, the latter requires an encoding specifically aimed to deal with Unicode, I believe, which is not the case with the escapeless encodings. And for the first case, some generalization of the algorithm would be needed to make sure the quote characters are not mapped to a byte that would result in invalid UTF-8 sequences.

DonaldTsang · 2019-07-05T09:13:18Z

@kosarev So escapeless is not a Python/JS affair then? (If it is so Unicode-unsafe)?
See https://github.com/rinick/base2e15 https://github.com/grandchild/base32k https://github.com/qntm/base32768

kosarev · 2019-07-05T14:44:45Z

Yes, comparing to these Unicode-specific encodings mentioned, escapeless is a different animal. It is most efficient when you need to strip certain characters/bytes from a stream by the cost of a low fixed-size overhead.

DonaldTsang · 2019-07-05T14:57:24Z

@kosarev so is it possible to create a compatibility format than can be converted from Unicode-safe "alt-format" to Escapeless? without the need for a python-like bytes format?

kosarev · 2019-07-11T18:43:06Z

If I take the idea right, sure, there should be no problem to use escapeless in the middle of a chain of Unicode-specific encodings. As to representation of binary data, I guess you mean JS, in which case an array of bytes sounds like a good replacement for the Python's byte strings, with likely no changes in the algorithms themselves.

DonaldTsang · 2019-07-12T03:44:38Z

@kosarev so basically data => escapeless => Unicode or JSON compatible string => escapeless => data

kosarev · 2019-07-12T08:25:05Z

Yes, given by Unicode or JSON compatible string you mean Unicode-safe binary-to-text and text-to-binary encodings, and not just emitting raw binary data to strings.

DonaldTsang · 2019-07-12T08:50:56Z

@kosarev I mean the JSON spec does allow certain "special characters" to slip through, right?
{"string": "<as many types of characters as possible>"}
The angle bracket only disallow whitespace characters I think. What else can you think of?

kosarev · 2019-07-13T18:16:14Z

Well, escapeless wouldn't allow you to exclude those special characters, if that's what you mean, because it has to be in the middle of the encoding chain, that is, it processes purely binary data and so has to be surrounded with Unicode-specific encodings on both the ends of the chain. By removing certain characters from that binary data in the middle we can't generally affect which characters will appear in the encoded JSON string as it depends on that Unicode-specific encoding used.

DonaldTsang · 2019-07-13T18:20:43Z

@kosarev but escapeless can have down to 225 characters, so surely some of the forbidden code space can be stripped off right?

kosarev · 2019-07-13T18:25:57Z

It can strip off even more characters, it just won't be efficient comparing to other approaches. Answering your question, the thing is that removing certain characters in binary data doesn't mean these or some other characters will disappear from their Unicode-encoded version, because most likely there will be no 1-to-1 correspondence.

kosarev added the question Further information is requested label Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this UTF8-friendly? #1

Is this UTF8-friendly? #1

DonaldTsang commented Jul 3, 2019

kosarev commented Jul 3, 2019

DonaldTsang commented Jul 5, 2019

kosarev commented Jul 5, 2019

DonaldTsang commented Jul 5, 2019 •

edited

Loading

kosarev commented Jul 11, 2019

DonaldTsang commented Jul 12, 2019

kosarev commented Jul 12, 2019 •

edited

Loading

DonaldTsang commented Jul 12, 2019 •

edited

Loading

kosarev commented Jul 13, 2019

DonaldTsang commented Jul 13, 2019

kosarev commented Jul 13, 2019

Is this UTF8-friendly? #1

Is this UTF8-friendly? #1

Comments

DonaldTsang commented Jul 3, 2019

kosarev commented Jul 3, 2019

DonaldTsang commented Jul 5, 2019

kosarev commented Jul 5, 2019

DonaldTsang commented Jul 5, 2019 • edited Loading

kosarev commented Jul 11, 2019

DonaldTsang commented Jul 12, 2019

kosarev commented Jul 12, 2019 • edited Loading

DonaldTsang commented Jul 12, 2019 • edited Loading

kosarev commented Jul 13, 2019

DonaldTsang commented Jul 13, 2019

kosarev commented Jul 13, 2019

DonaldTsang commented Jul 5, 2019 •

edited

Loading

kosarev commented Jul 12, 2019 •

edited

Loading

DonaldTsang commented Jul 12, 2019 •

edited

Loading