Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify support (or not) for character encodings other than UTF-8 #42

Open
sacundim opened this issue Sep 15, 2016 · 6 comments
Open

Clarify support (or not) for character encodings other than UTF-8 #42

sacundim opened this issue Sep 15, 2016 · 6 comments
Labels

Comments

@sacundim
Copy link

The documentation in the README.md doesn't explain what is xsv's support or policy for character encodings. I think it really ought to.

Looking through the code for xsv and the csv crate, it looks like there isn't a consistent policy:

  1. Most of the code reads rows with the byte_records() function.
  2. xsv search, however, uses the records() function, which interprets the data as UTF-8.
  3. There are a few places where the code calls str::from_utf8() on byte data.
  4. The select module uses String to represent field names, which is UTF-8. What happens when you try to xsv select from a file that has Latin-1 field names?
@sacundim sacundim changed the title Clarify (non)support for character encodings other than UTF-8 Clarify support (or not) for character encodings other than UTF-8 Sep 15, 2016
@BurntSushi
Copy link
Owner

I think my intention was to support text encodings that are "ASCII compatible," which should include Latin-1. For example, in almost all cases from str::from_utf8 is used, there is an actual fallback that runs with just the raw bytes. So there shouldn't be any places where, say, a true latin-1 encoding would be a problem.

Of course, you did pick out a few! In particular:

  1. Field name selection does appear to be limited to utf-8. Fixing that probably means moving the parser to &[u8] instead of &str.
  2. Searching via regex required &str at the time I wrote the code, but we can switch to byte based regexes. (The search pattern must still be UTF-8, but, one can search for arbitrary bytes with hex escapes. That isn't particularly ideal, but does make latin-1 support possible...)

@BurntSushi BurntSushi added the bug label Sep 15, 2016
@eddy-geek
Copy link

IMO xsv should be UTF-8 first:

  • supporting other charsets is not really a requirement as you can always convert from anything else to unicode very quickly... but the reverse is not true
  • all the rust ecosystem is very UTF8-centric for good reasons, and the performance of UTF8 regexes is stellar as you very well know ;-)
  • latin1 is dying at least on the web

but I guess you had specific motivations for latin1 support?

@BurntSushi
Copy link
Owner

@eddy-geek I don't really understand what's motivating your comment. CSV itself doesn't have a specified character encoding, and most CSV parsers are written to be ASCII compatible. ASCII compatibility is the goal, and as a result, encodings like latin-1 wind up being supported. This is important because CSV data is often quite messy, and there's nothing worse than failing to read CSV data because of a character encoding issue.

This issue is basically "fix a few places in xsv where UTF-8 is assumed." That's it. Nothing more.

@eddy-geek
Copy link

eddy-geek commented Dec 12, 2016 via email

silasb added a commit to silasb/csv-to-json that referenced this issue May 15, 2019
For the same reasons as BurntSushi/xsv#42 we
should only support UTF-8 and other encodings should be converted to
UTF-8 before processing.
@velocityzen
Copy link

What about UTF-16, UTF-16BE, UTF-16LE ?

@BurntSushi
Copy link
Owner

Not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants