Add UTF-16 and UTF-32 support #28

cgmb · 2018-03-19T05:25:37Z

Currently, UTF-8 is the only supported encoding for guardonce. UTF-16 and UTF-32 files cannot be processed. I began to improve the handling of those files in #22 by ensuring they were flagged as a problem or ignored. However, it's possible to do better.

There's are a few heuristics that can be used to guess that a file is UTF-16 or UTF-32. The simplest is to check if the first few bytes of the file match a BOM. It's extremely unlikely that any header file in another encoding would start with the same bytes as a UTF BOM, so this seems sufficient for our case.

The encoding used for reading should also be used for writing when processing the file in place. However, I'm less sure about the correct behaviour for printing the new file to stdout. I suspect that I should just use the same encoding there too, though perhaps I should use the output stream's desired encoding if it is known.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UTF-16 and UTF-32 support #28

Add UTF-16 and UTF-32 support #28

cgmb commented Mar 19, 2018

Add UTF-16 and UTF-32 support #28

Add UTF-16 and UTF-32 support #28

Comments

cgmb commented Mar 19, 2018