-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use a visually more appealing encoding #261
base: main
Are you sure you want to change the base?
use a visually more appealing encoding #261
Conversation
This commit uses Unidecode to translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations. If unidecode returns an empty string, the name "__not_representable_<X>" is used, where "<X>" is the length of the original string.
It should be noted that the results of |
This commit ensures that mangle_path is tested with "printable" unicode characters, e.g. `ä`. that will be converted into ascii characters by `unidecode()`.
Why is
done, rather than only the fallback on the hexcodes? I cannot see from the test diff alone how it would look. Need to handcraft a test dataset and try. |
Good question. The answer is that we aimed at a human readable representation, and using the hexcodes would probably be confusing. I think it is a good idea though. If we would do that, we have to decide whether we want to distinguish hex-code-file names that are generated because We could also leave to interpretation to the user, who might know, which file names are "genuine" dataset file names and which file names are just a hex-code representation of names that are mapped on empty-strings by All in all, the simplest approach might be to use hex-codes if the I will change the code. |
Fixes #232
This PR uses
Unidecode
to translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations.If
unidecode()
returns an empty string, the name"__not_representable_<X>"
is used, where<X>
is the length of the original string.