Sniffing fails on non-UTF-8 files #13

harrybiddle · 2022-03-21T13:32:12Z

Hey all,

I tried running the csv sniffer (as part of qsv, see dathere/qsv#199) on the following file, but it doesn't seem to work.

test.csv

Not sure if the problem is in qsv or here, but since qsv works on other files I though it's most likely an issue here...

jblondin · 2022-03-21T15:03:46Z

It looks like it might be an issue with qsv, since it works for me locally:

 $ ./target/release/sniff test.csv
Metadata
========
Dialect:
        Delimiter: ;
        Has header row?: true
        Number of preamble rows: 0
        Quote character: none
        Flexible: false

Number of fields: 11
Types:
        0: Text
        1: Unsigned
        2: Float
        3: Float
        4: Text
        5: Text
        6: Text
        7: Text
        8: Text
        9: Text
        10: Text

I'll try testing it out from qsv as well.

harrybiddle · 2022-03-21T16:35:23Z

Aha, I didn't realise that the build ships with a CLI. From the README I thought it was library-only. Perhaps it's worth including the CLI in the README?

jblondin · 2022-03-21T17:02:23Z

Definitely! I'll leave this open as a documentation issue then.

jqnatividad · 2022-03-22T02:35:38Z

Hhmmm... weird, it's not working for me using the csv-sniffer CLI:

./target/release/sniff ~/Downloads/test.csv 
ERROR: IO error: stream did not contain valid UTF-8

Happens on both Windows 11 and Ubuntu Linux LTS 20.04...

jblondin · 2022-03-22T02:49:13Z

Interesting! and of course I run it on macOS. I'll check it out in containers.

jqnatividad · 2022-03-23T11:39:07Z

Hmmm... it also happens on my MacBook Air 2018 running Monterey...

I wonder if its because of some locale settings @jblondin

jblondin · 2022-03-23T14:34:40Z

Interesting, I'm on an M1 running Monterey. I'll investigate.

jblondin · 2022-03-27T19:43:27Z

The encoding on this test file seems to be 'Western (Windows 1252)', while the library currently only supports UTF-8.

At some point when I originally was testing this I apparently re-saved it at UTF-8 prior to running my test, which is why it initially worked for me. I just re-downloaded and re-ran it and am encountering the same error.

jqnatividad · 2022-03-28T13:45:50Z

@jblondin I encountered the same issue with qsv. To minimize encoding errors, I used https://github.com/BurntSushi/encoding_rs_io to automatically transcode to UTF-8.

jblondin · 2022-03-28T13:59:26Z

I found the same 😄 I started looking into it yesterday, should have a fix soon

jblondin · 2022-04-04T00:43:35Z

So it doesn't look like it's as simple a fix as just using encoding_rs_io as a transcoder, since that only handles transcoding between UTF-16 and UTF-8.

qsv is mostly resilient to encoding woes due to its predominant usage of ByteRecords from the csv package, but does fail on some commands, e.g.

> qsv pseudo -d ';' COD.IMPDR.EXPDR tests/data/semicolon.nonUTF.csv
DIA.DESEMB,COD.SUBITEM.NCM,VMLE.DOLAR.BAL.EXP,PESO.LIQ.MERC.BAL.EXP,COD.IMPDR.EXPDR,NOME.IMPDR.EXPDR,PAIS.ORIGEM.DESTINO,UA.LOCAL.DESBQ.EMBQ,NOME.IMPORTADOR.ESTRANGEIRO,NUM.DDE,NUM.RE
CSV parse error: record 1 (line 1, field: 6, byte: 184): invalid utf-8: invalid UTF-8 in field 6 near byte index 4

I'll dig into this a little more to see if there's a solution.

jqnatividad · 2022-04-05T13:59:00Z

Hi @jblondin , thanks to your research, I ended up just saying that qsv requires UTF-8 and leave it at that.

With that assumption, I then fully embraced using from_utf8_unchecked for the extra performance. 😉

That said, it'd be awesome if csv-sniffer can also sniff a csv's encoding!

jblondin · 2022-04-05T14:48:02Z

For now, I'm just going to have csv-sniffer accept the encoding as a parameter.

Sniffing the encoding is a whole other (fascinating) subject, I may add it or spin off another crate with that functionality. My understanding is that it's very heuristic-based, and prone to error, but there are definitely tools that do it (Sublime Text for instance detects and handles different encodings just fine). Having an encoding sniffer + automatic transcoder might be really useful.

For qsv, an expectation of UTF-8 makes sense 😄 I did notice that you already use from_utf8_unchecked at times (in the stats module, for instance), which I believe is undefined behavior on non-UTF-8 files, but it seems to work fine on this file at least!

jqnatividad · 2022-04-05T20:44:07Z

Yes. I was really trying to squeeze as much performance as possible from qsv stats, as its central to the project I'm working on (scanning a CSV file for stats and data types, and then prepopulating metadata about it - data dictionary, frequency table, descriptive stats, jsonschema) in a CKAN data catalog while they're entering the metadata.

And now that I've decided to embrace the utf8 requirement, I'm doing more from_utf8_unchecked throughout qsv, especially, in the hot loops.

As to detecting encoding, I found https://github.com/thuleqaid/rust-chardet, which is inspired by https://pypi.org/project/chardet/.

It seems promising, even reqwest at one time was considering using it were it not for the incompatible license, though it looks unmaintained.

I also found chardetng, but its targeted for web use, as the character encoding detector of Firefox (I found this writeup fascinating!)

Perhaps you can leverage it for csv-sniffer?

harrybiddle mentioned this issue Mar 21, 2022

Auto-detect delimiter dathere/qsv#199

Closed

jblondin added the documentation Extra documentation is requested label Mar 21, 2022

jblondin changed the title ~~Sniffing seems to fail on a test file~~ Clarify existence of CLI tool in README Mar 21, 2022

jblondin mentioned this issue Mar 27, 2022

Improve description of packaged sniffer tool (sniff) in documentation #15

Open

jblondin changed the title ~~Clarify existence of CLI tool in README~~ Sniffing fails on non-UTF-8 files Mar 27, 2022

jblondin added enhancement New feature or request and removed documentation Extra documentation is requested labels Mar 27, 2022

jblondin self-assigned this Mar 27, 2022

jqnatividad linked a pull request May 22, 2022 that will close this issue

Add non-utf8-support #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sniffing fails on non-UTF-8 files #13

Sniffing fails on non-UTF-8 files #13

harrybiddle commented Mar 21, 2022

jblondin commented Mar 21, 2022

harrybiddle commented Mar 21, 2022

jblondin commented Mar 21, 2022

jqnatividad commented Mar 22, 2022

jblondin commented Mar 22, 2022

jqnatividad commented Mar 23, 2022

jblondin commented Mar 23, 2022

jblondin commented Mar 27, 2022 •

edited

Loading

jqnatividad commented Mar 28, 2022

jblondin commented Mar 28, 2022

jblondin commented Apr 4, 2022

jqnatividad commented Apr 5, 2022

jblondin commented Apr 5, 2022

jqnatividad commented Apr 5, 2022 •

edited

Loading

Sniffing fails on non-UTF-8 files #13

Sniffing fails on non-UTF-8 files #13

Comments

harrybiddle commented Mar 21, 2022

jblondin commented Mar 21, 2022

harrybiddle commented Mar 21, 2022

jblondin commented Mar 21, 2022

jqnatividad commented Mar 22, 2022

jblondin commented Mar 22, 2022

jqnatividad commented Mar 23, 2022

jblondin commented Mar 23, 2022

jblondin commented Mar 27, 2022 • edited Loading

jqnatividad commented Mar 28, 2022

jblondin commented Mar 28, 2022

jblondin commented Apr 4, 2022

jqnatividad commented Apr 5, 2022

jblondin commented Apr 5, 2022

jqnatividad commented Apr 5, 2022 • edited Loading

jblondin commented Mar 27, 2022 •

edited

Loading

jqnatividad commented Apr 5, 2022 •

edited

Loading