Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF-16 and UTF-32 support #28

Open
cgmb opened this issue Mar 19, 2018 · 0 comments
Open

Add UTF-16 and UTF-32 support #28

cgmb opened this issue Mar 19, 2018 · 0 comments

Comments

@cgmb
Copy link
Owner

cgmb commented Mar 19, 2018

Currently, UTF-8 is the only supported encoding for guardonce. UTF-16 and UTF-32 files cannot be processed. I began to improve the handling of those files in #22 by ensuring they were flagged as a problem or ignored. However, it's possible to do better.

There's are a few heuristics that can be used to guess that a file is UTF-16 or UTF-32. The simplest is to check if the first few bytes of the file match a BOM. It's extremely unlikely that any header file in another encoding would start with the same bytes as a UTF BOM, so this seems sufficient for our case.

The encoding used for reading should also be used for writing when processing the file in place. However, I'm less sure about the correct behaviour for printing the new file to stdout. I suspect that I should just use the same encoding there too, though perhaps I should use the output stream's desired encoding if it is known.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant