Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sniffing fails on non-UTF-8 files #13

Open
harrybiddle opened this issue Mar 21, 2022 · 14 comments · May be fixed by #16
Open

Sniffing fails on non-UTF-8 files #13

harrybiddle opened this issue Mar 21, 2022 · 14 comments · May be fixed by #16
Assignees
Labels
enhancement New feature or request

Comments

@harrybiddle
Copy link

Hey all,

I tried running the csv sniffer (as part of qsv, see dathere/qsv#199) on the following file, but it doesn't seem to work.

test.csv

Not sure if the problem is in qsv or here, but since qsv works on other files I though it's most likely an issue here...

@jblondin
Copy link
Owner

It looks like it might be an issue with qsv, since it works for me locally:

 $ ./target/release/sniff test.csv
Metadata
========
Dialect:
        Delimiter: ;
        Has header row?: true
        Number of preamble rows: 0
        Quote character: none
        Flexible: false

Number of fields: 11
Types:
        0: Text
        1: Unsigned
        2: Float
        3: Float
        4: Text
        5: Text
        6: Text
        7: Text
        8: Text
        9: Text
        10: Text

I'll try testing it out from qsv as well.

@harrybiddle
Copy link
Author

Aha, I didn't realise that the build ships with a CLI. From the README I thought it was library-only. Perhaps it's worth including the CLI in the README?

@jblondin jblondin added the documentation Extra documentation is requested label Mar 21, 2022
@jblondin
Copy link
Owner

Definitely! I'll leave this open as a documentation issue then.

@jblondin jblondin changed the title Sniffing seems to fail on a test file Clarify existence of CLI tool in README Mar 21, 2022
@jqnatividad
Copy link
Contributor

Hhmmm... weird, it's not working for me using the csv-sniffer CLI:

./target/release/sniff ~/Downloads/test.csv 
ERROR: IO error: stream did not contain valid UTF-8

Happens on both Windows 11 and Ubuntu Linux LTS 20.04...

@jblondin
Copy link
Owner

Interesting! and of course I run it on macOS. I'll check it out in containers.

@jqnatividad
Copy link
Contributor

Hmmm... it also happens on my MacBook Air 2018 running Monterey...

I wonder if its because of some locale settings @jblondin

@jblondin
Copy link
Owner

Interesting, I'm on an M1 running Monterey. I'll investigate.

@jblondin jblondin changed the title Clarify existence of CLI tool in README Sniffing fails on non-UTF-8 files Mar 27, 2022
@jblondin
Copy link
Owner

jblondin commented Mar 27, 2022

The encoding on this test file seems to be 'Western (Windows 1252)', while the library currently only supports UTF-8.

At some point when I originally was testing this I apparently re-saved it at UTF-8 prior to running my test, which is why it initially worked for me. I just re-downloaded and re-ran it and am encountering the same error.

@jblondin jblondin added enhancement New feature or request and removed documentation Extra documentation is requested labels Mar 27, 2022
@jblondin jblondin self-assigned this Mar 27, 2022
@jqnatividad
Copy link
Contributor

@jblondin I encountered the same issue with qsv. To minimize encoding errors, I used https://github.com/BurntSushi/encoding_rs_io to automatically transcode to UTF-8.

@jblondin
Copy link
Owner

I found the same 😄 I started looking into it yesterday, should have a fix soon

@jblondin
Copy link
Owner

jblondin commented Apr 4, 2022

So it doesn't look like it's as simple a fix as just using encoding_rs_io as a transcoder, since that only handles transcoding between UTF-16 and UTF-8.

qsv is mostly resilient to encoding woes due to its predominant usage of ByteRecords from the csv package, but does fail on some commands, e.g.

> qsv pseudo -d ';' COD.IMPDR.EXPDR tests/data/semicolon.nonUTF.csv
DIA.DESEMB,COD.SUBITEM.NCM,VMLE.DOLAR.BAL.EXP,PESO.LIQ.MERC.BAL.EXP,COD.IMPDR.EXPDR,NOME.IMPDR.EXPDR,PAIS.ORIGEM.DESTINO,UA.LOCAL.DESBQ.EMBQ,NOME.IMPORTADOR.ESTRANGEIRO,NUM.DDE,NUM.RE
CSV parse error: record 1 (line 1, field: 6, byte: 184): invalid utf-8: invalid UTF-8 in field 6 near byte index 4

I'll dig into this a little more to see if there's a solution.

@jqnatividad
Copy link
Contributor

Hi @jblondin , thanks to your research, I ended up just saying that qsv requires UTF-8 and leave it at that.

With that assumption, I then fully embraced using from_utf8_unchecked for the extra performance. 😉

That said, it'd be awesome if csv-sniffer can also sniff a csv's encoding!

@jblondin
Copy link
Owner

jblondin commented Apr 5, 2022

For now, I'm just going to have csv-sniffer accept the encoding as a parameter.

Sniffing the encoding is a whole other (fascinating) subject, I may add it or spin off another crate with that functionality. My understanding is that it's very heuristic-based, and prone to error, but there are definitely tools that do it (Sublime Text for instance detects and handles different encodings just fine). Having an encoding sniffer + automatic transcoder might be really useful.

For qsv, an expectation of UTF-8 makes sense 😄 I did notice that you already use from_utf8_unchecked at times (in the stats module, for instance), which I believe is undefined behavior on non-UTF-8 files, but it seems to work fine on this file at least!

@jqnatividad
Copy link
Contributor

jqnatividad commented Apr 5, 2022

Yes. I was really trying to squeeze as much performance as possible from qsv stats, as its central to the project I'm working on (scanning a CSV file for stats and data types, and then prepopulating metadata about it - data dictionary, frequency table, descriptive stats, jsonschema) in a CKAN data catalog while they're entering the metadata.

And now that I've decided to embrace the utf8 requirement, I'm doing more from_utf8_unchecked throughout qsv, especially, in the hot loops.

As to detecting encoding, I found https://github.com/thuleqaid/rust-chardet, which is inspired by https://pypi.org/project/chardet/.

It seems promising, even reqwest at one time was considering using it were it not for the incompatible license, though it looks unmaintained.

I also found chardetng, but its targeted for web use, as the character encoding detector of Firefox (I found this writeup fascinating!)

Perhaps you can leverage it for csv-sniffer?

@jqnatividad jqnatividad linked a pull request May 22, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants