-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sniffing fails on non-UTF-8 files #13
Comments
It looks like it might be an issue with qsv, since it works for me locally:
I'll try testing it out from qsv as well. |
Aha, I didn't realise that the build ships with a CLI. From the README I thought it was library-only. Perhaps it's worth including the CLI in the README? |
Definitely! I'll leave this open as a documentation issue then. |
Hhmmm... weird, it's not working for me using the csv-sniffer CLI:
Happens on both Windows 11 and Ubuntu Linux LTS 20.04... |
Interesting! and of course I run it on macOS. I'll check it out in containers. |
Hmmm... it also happens on my MacBook Air 2018 running Monterey... I wonder if its because of some locale settings @jblondin |
Interesting, I'm on an M1 running Monterey. I'll investigate. |
The encoding on this test file seems to be 'Western (Windows 1252)', while the library currently only supports UTF-8. At some point when I originally was testing this I apparently re-saved it at UTF-8 prior to running my test, which is why it initially worked for me. I just re-downloaded and re-ran it and am encountering the same error. |
@jblondin I encountered the same issue with qsv. To minimize encoding errors, I used https://github.com/BurntSushi/encoding_rs_io to automatically transcode to UTF-8. |
I found the same 😄 I started looking into it yesterday, should have a fix soon |
So it doesn't look like it's as simple a fix as just using
I'll dig into this a little more to see if there's a solution. |
Hi @jblondin , thanks to your research, I ended up just saying that qsv requires UTF-8 and leave it at that. With that assumption, I then fully embraced using from_utf8_unchecked for the extra performance. 😉 That said, it'd be awesome if csv-sniffer can also sniff a csv's encoding! |
For now, I'm just going to have csv-sniffer accept the encoding as a parameter. Sniffing the encoding is a whole other (fascinating) subject, I may add it or spin off another crate with that functionality. My understanding is that it's very heuristic-based, and prone to error, but there are definitely tools that do it (Sublime Text for instance detects and handles different encodings just fine). Having an encoding sniffer + automatic transcoder might be really useful. For qsv, an expectation of UTF-8 makes sense 😄 I did notice that you already use |
Yes. I was really trying to squeeze as much performance as possible from And now that I've decided to embrace the utf8 requirement, I'm doing more As to detecting encoding, I found https://github.com/thuleqaid/rust-chardet, which is inspired by https://pypi.org/project/chardet/. It seems promising, even reqwest at one time was considering using it were it not for the incompatible license, though it looks unmaintained. I also found chardetng, but its targeted for web use, as the character encoding detector of Firefox (I found this writeup fascinating!) Perhaps you can leverage it for csv-sniffer? |
Hey all,
I tried running the csv sniffer (as part of qsv, see dathere/qsv#199) on the following file, but it doesn't seem to work.
test.csv
Not sure if the problem is in qsv or here, but since qsv works on other files I though it's most likely an issue here...
The text was updated successfully, but these errors were encountered: