You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By design, charamel returns encodings that likely can decode a sequence of bytes into a string correctly. It does not have to be the same encoding that was used to encode the string as long as the result of .decode(encoding) is the same.
This holds true for the first test with abc because the most probable returned encoding is UTF-7 and it can decode ASCII correctly. However, the second test is indeed not working as expected because it returns shift_jis_2004 which is used to encode Japanese text. Thank you for notifying me about that. I am currently working on a new release and will take that into account.
Indeed your answer makes sense.
From the user perspective I can think of the following two enchancements:
When providing a python unicode string it might make sense to either raise an error (like chardet.detect) or return a specific reference to the internal python unicode encoding (instead of CP1006 which is not really meaningful in that case).
When processing an arbitrary string, there's some value in having Detector() telling you that the ASCII encoding is sufficient to decode it (like chardet does, allowing charamel to be a dropped-in replacement).
I do not know if this would need to be applied to other encodings that are strict subsets of others, i.e. tweak charamel to return the smaller subset that can decode the string.
Hi,
I think that the test set for this package is too reduced, the default values for very simple strings are wrong:
echo $LANG
en_US.UTF-8
The first one should return ascii and the second one UTF-8.
Thanks in advance for looking into that,
The text was updated successfully, but these errors were encountered: