Clarify support (or not) for character encodings other than UTF-8 #42

sacundim · 2016-09-15T09:27:01Z

The documentation in the README.md doesn't explain what is xsv's support or policy for character encodings. I think it really ought to.

Looking through the code for xsv and the csv crate, it looks like there isn't a consistent policy:

Most of the code reads rows with the byte_records() function.
xsv search, however, uses the records() function, which interprets the data as UTF-8.
There are a few places where the code calls str::from_utf8() on byte data.
The select module uses String to represent field names, which is UTF-8. What happens when you try to xsv select from a file that has Latin-1 field names?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-09-15T15:10:42Z

I think my intention was to support text encodings that are "ASCII compatible," which should include Latin-1. For example, in almost all cases from str::from_utf8 is used, there is an actual fallback that runs with just the raw bytes. So there shouldn't be any places where, say, a true latin-1 encoding would be a problem.

Of course, you did pick out a few! In particular:

Field name selection does appear to be limited to utf-8. Fixing that probably means moving the parser to &[u8] instead of &str.
Searching via regex required &str at the time I wrote the code, but we can switch to byte based regexes. (The search pattern must still be UTF-8, but, one can search for arbitrary bytes with hex escapes. That isn't particularly ideal, but does make latin-1 support possible...)

eddy-geek · 2016-12-12T15:38:31Z

IMO xsv should be UTF-8 first:

supporting other charsets is not really a requirement as you can always convert from anything else to unicode very quickly... but the reverse is not true
all the rust ecosystem is very UTF8-centric for good reasons, and the performance of UTF8 regexes is stellar as you very well know ;-)
latin1 is dying at least on the web

but I guess you had specific motivations for latin1 support?

BurntSushi · 2016-12-12T16:56:24Z

@eddy-geek I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be ASCII compatible. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue.

This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more.

eddy-geek · 2016-12-12T17:12:25Z

Ok I see, sorry for the noise

…

On 12 Dec 2016 5:56 pm, "Andrew Gallant" ***@***.***> wrote: @eddy-geek <https://github.com/eddy-geek> I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be *ASCII compatible*. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue. This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACpOGRdfgX_9dVE-Ti8T43HGpGKtXvUsks5rHXy4gaJpZM4J9qb5> .

For the same reasons as BurntSushi/xsv#42 we should only support UTF-8 and other encodings should be converted to UTF-8 before processing.

velocityzen · 2024-03-26T21:06:40Z

What about UTF-16, UTF-16BE, UTF-16LE ?

BurntSushi · 2024-03-26T23:11:07Z

Not supported.

sacundim changed the title ~~Clarify (non)support for character encodings other than UTF-8~~ Clarify support (or not) for character encodings other than UTF-8 Sep 15, 2016

BurntSushi added the bug label Sep 15, 2016

silasb added a commit to silasb/csv-to-json that referenced this issue May 15, 2019

Should only support UTF-8

18c37ca

For the same reasons as BurntSushi/xsv#42 we should only support UTF-8 and other encodings should be converted to UTF-8 before processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify support (or not) for character encodings other than UTF-8 #42

Clarify support (or not) for character encodings other than UTF-8 #42

sacundim commented Sep 15, 2016

BurntSushi commented Sep 15, 2016

eddy-geek commented Dec 12, 2016

BurntSushi commented Dec 12, 2016

eddy-geek commented Dec 12, 2016 via email

velocityzen commented Mar 26, 2024

BurntSushi commented Mar 26, 2024

Clarify support (or not) for character encodings other than UTF-8 #42

Clarify support (or not) for character encodings other than UTF-8 #42

Comments

sacundim commented Sep 15, 2016

BurntSushi commented Sep 15, 2016

eddy-geek commented Dec 12, 2016

BurntSushi commented Dec 12, 2016

eddy-geek commented Dec 12, 2016 via email

velocityzen commented Mar 26, 2024

BurntSushi commented Mar 26, 2024