Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this UTF8-friendly? #1

Open
DonaldTsang opened this issue Jul 3, 2019 · 11 comments
Open

Is this UTF8-friendly? #1

DonaldTsang opened this issue Jul 3, 2019 · 11 comments
Labels
question Further information is requested

Comments

@DonaldTsang
Copy link

Was wondering if this will work with UTF8 (and other Unicode encoding)

@kosarev
Copy link
Owner

kosarev commented Jul 3, 2019

It depends on what you are going to achieve. The two likely most common tasks about Unicode are making a given sequence to be string-safe by removing the quote characters and the other one is encoding binary data as a Unicode string. To be efficient, the latter requires an encoding specifically aimed to deal with Unicode, I believe, which is not the case with the escapeless encodings. And for the first case, some generalization of the algorithm would be needed to make sure the quote characters are not mapped to a byte that would result in invalid UTF-8 sequences.

@DonaldTsang
Copy link
Author

@kosarev So escapeless is not a Python/JS affair then? (If it is so Unicode-unsafe)?
See https://github.com/rinick/base2e15 https://github.com/grandchild/base32k https://github.com/qntm/base32768

@kosarev
Copy link
Owner

kosarev commented Jul 5, 2019

Yes, comparing to these Unicode-specific encodings mentioned, escapeless is a different animal. It is most efficient when you need to strip certain characters/bytes from a stream by the cost of a low fixed-size overhead.

@DonaldTsang
Copy link
Author

DonaldTsang commented Jul 5, 2019

@kosarev so is it possible to create a compatibility format than can be converted from Unicode-safe "alt-format" to Escapeless? without the need for a python-like bytes format?

@kosarev
Copy link
Owner

kosarev commented Jul 11, 2019

If I take the idea right, sure, there should be no problem to use escapeless in the middle of a chain of Unicode-specific encodings. As to representation of binary data, I guess you mean JS, in which case an array of bytes sounds like a good replacement for the Python's byte strings, with likely no changes in the algorithms themselves.

@DonaldTsang
Copy link
Author

@kosarev so basically data => escapeless => Unicode or JSON compatible string => escapeless => data

@kosarev
Copy link
Owner

kosarev commented Jul 12, 2019

Yes, given by Unicode or JSON compatible string you mean Unicode-safe binary-to-text and text-to-binary encodings, and not just emitting raw binary data to strings.

@DonaldTsang
Copy link
Author

DonaldTsang commented Jul 12, 2019

@kosarev I mean the JSON spec does allow certain "special characters" to slip through, right?
{"string": "<as many types of characters as possible>"}
The angle bracket only disallow whitespace characters I think. What else can you think of?

@kosarev
Copy link
Owner

kosarev commented Jul 13, 2019

Well, escapeless wouldn't allow you to exclude those special characters, if that's what you mean, because it has to be in the middle of the encoding chain, that is, it processes purely binary data and so has to be surrounded with Unicode-specific encodings on both the ends of the chain. By removing certain characters from that binary data in the middle we can't generally affect which characters will appear in the encoded JSON string as it depends on that Unicode-specific encoding used.

@DonaldTsang
Copy link
Author

@kosarev but escapeless can have down to 225 characters, so surely some of the forbidden code space can be stripped off right?

@kosarev
Copy link
Owner

kosarev commented Jul 13, 2019

It can strip off even more characters, it just won't be efficient comparing to other approaches. Answering your question, the thing is that removing certain characters in binary data doesn't mean these or some other characters will disappear from their Unicode-encoded version, because most likely there will be no 1-to-1 correspondence.

@kosarev kosarev added the question Further information is requested label Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants