Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect delimiter #199

Closed
harrybiddle opened this issue Mar 18, 2022 · 16 comments · Fixed by #203
Closed

Auto-detect delimiter #199

harrybiddle opened this issue Mar 18, 2022 · 16 comments · Fixed by #203

Comments

@harrybiddle
Copy link
Contributor

Is your feature request related to a problem? Please describe.

In my daily work, a .csv file is comma-separated 50% of the time, and semi-colon-separated the other 50%. I use this command-line tool hundreds of times a day. It's incredibly frustrating to have to first figure out which separator character is being used, and then adjust my command appropriately.

Describe the solution you'd like

Ideally qsv would auto-detect a delimiter by default. However, this would break backwards compatibility, so I suggest having a environment variable to turn this on.

$ export QSV_AUTO_DETECT_DELIMITER=1
$ qsv table my_file.csv

When this environment variable is set, any value of QSV_DELIMITER would be ignored.

Describe alternatives you've considered

I tried to achieve this using a bash wrapper, but it was a bit fiddly because I need to do different things depending on whether qsv is being passed a file (in which case I sniff the file and then pass the delimiter to the qsv command) or a stream (in which case I sniff the stream, and then pass the amount I've already sniffed plus the rest of the stream to qsv).

Additional context

Duplicate of BurntSushi/xsv#294

@jqnatividad
Copy link
Collaborator

jqnatividad commented Mar 18, 2022

@harrybiddle I agree that this would be a useful option.

I would limit it though to single character ascii delimiters, and to sniff candidate delimiters from a provided list.

Perhaps, the env var could be something like QSV_AUTODETECT_DELIMITERS=",/t;|^"?

WDYT? A PR would be most welcome!

@harrybiddle
Copy link
Contributor Author

I would limit it though to single character ascii delimiters, and to sniff candidate delimiters from a provided list.

That would make sense to me. I guess there are already sniffers out there (I see this one?) so we could depend on that, or re-implement the idea to minimise dependencies.

WDYT? A PR would be most welcome!

I would love to give this a go, but unfortunately my Rust knowledge is less than ϵ, where ϵ is an arbitrarily small number 😜.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Mar 18, 2022

The csv_sniffer crate is a good find, but it seems unmaintained, though we can certainly leverage techniques used there.

As for Rust, I wouldn't consider myself a Rust veteran. I just picked it up while working on a project myself in 2020 using xsv, scratched a few itches, and ended up doing the qsv fork.

It does have a relatively steep initial learning curve, but once you get over the hump, its most enjoyable and makes you realize how interpreted, weakly typed, garbage-collected languages are so inefficient.

Anyways, I'll take a look at it over the weekend...

@jqnatividad
Copy link
Collaborator

Hi @harrybiddle ,
I ended up creating a PR for csv-sniffer.

When it gets merged and csv-sniffer is published, will leverage it for this request using a QSV_SNIFF environment variable.

Since it has auto-preamble detection, I'll also use it to automatically skip lines.

@jblondin
Copy link

Hi, csv_sniffer owner here. It's definitely been unmaintained (or rather, I completely forgot I wrote it), but I'm happy to maintain it if it's being used :) . I've merged @jqnatividad 's PR and republished.

@jqnatividad
Copy link
Collaborator

Awesome @jblondin , will be sure to integrate it and let you know!

@harrybiddle
Copy link
Contributor Author

harrybiddle commented Mar 21, 2022

Thank you for the quick feature! I compiled master and tried this out. Unfortunately it failed on the first file I tried, but I think this was just unfortunate that I happened to find a file which csv-sniffer couldn't handle. @jblondin I've opened up an issue over there: jblondin/csv-sniffer#13. It seems to work great on other files :).

@jqnatividad any reason why we aren't able to do sniffing when the file is piped in? The way I had this working in my noddy bash script was to read in the sample into a buffer in memory, sniff it, and then pass the buffer on, followed by the rest of the input pipe.

I wouldn't mind the challenge of trying to do this myself, but it might take me a few months to ramp up on rust and also find the time.

@jqnatividad
Copy link
Collaborator

@harrybiddle it's an encoding issue. I handled it with qsv by transcoding to UTF8 using encoding_rs_io.

With QSV_SNIFF_DELIMITER and the qsv sniff command however, I use sniff_path and let csv-sniffer handle file i/o - which also explains why qsv doesn't currently do csv-sniffing with stdin.

I'll take another go at it for the next release to see if I can use sniff_reader instead so qsv can handle the pipe file handling and UTF-8 transcoding as well.

@jqnatividad jqnatividad reopened this Mar 21, 2022
@jqnatividad
Copy link
Collaborator

jqnatividad commented Mar 21, 2022

BTW @harrybiddle , is it OK to include the sample file in jblondin/csv-sniffer#13 and use it with qsv's integration tests?

@jqnatividad
Copy link
Collaborator

Also @harrybiddle, I'm very interested in your use cases being that you use qsv "hundreds of times a day."

And you don't have to learn Rust to contribute... it'd be awesome if you can share some recipes of how you've put qsv to work in the Cookbook.

@jblondin
Copy link

@harrybiddle 's test file works with csv-sniffer directly (jblondin/csv-sniffer#13 (comment)), so it may be an issue with how it's called? I'll investigate a bit here as well.

@harrybiddle
Copy link
Contributor Author

Yes, please feel free to use as a test :). It sounds like we might have to guess the encoding too?

I use the tool hundreds of times a day, but purely for my own development cycle. I work a lot with CSVs due to the work environment I'm in, and I very frequently need to:

  • Sum up totals over columns
  • Preview data
  • Print column header names
  • Select columns, get the unique values
  • Try to diff two files when aggregated over a subset of columns
  • etc..

I found xsv a year or two ago and it worked great for me (aside from the pain of having to write my bash wrapper!). Then I discovered visidata and tried to use it, but it was a whole new world of keyboard shortcuts to learn; I find the Linux piping workflow to be much more flexible. I often pipe the data into terminal plotting libraries.

Happy to throw a few of my more common commands in the cookbook 👍 . I don't do anything advanced; it's 70% head my_file.csv | qsv table | less -S.

@jqnatividad
Copy link
Collaborator

That's awesome @harrybiddle ! I look forward to your recipes, and do feel free to include third-party CLI tools with 'em.

Speaking of which, you may want to check out csvlens - it's like less for CSVs. 😄

@harrybiddle
Copy link
Contributor Author

Ooh, thanks, I didn't know about that. Combined with csv-sniffer I've got myself auto-delimiter detection ;)

#! /usr/bin/env bash

file="$1"
delimiter=$(sniff "$1" | grep Delimiter | awk '{print $2}')
csvlens --delimiter "$delimiter" $1

@jqnatividad jqnatividad pinned this issue Mar 22, 2022
@jqnatividad
Copy link
Collaborator

Still encountering an error on Windows 11, Ubuntu Linux 20.04 LTS, and macOS Monterey wih the csv-sniffer binary parsing the sample file in jblondin/csv-sniffer#13...

@jqnatividad jqnatividad unpinned this issue Apr 4, 2022
@jqnatividad
Copy link
Collaborator

Now the qsv REQUIRES and even scans for utf8 encoding when it starts, closing this issue.

https://github.com/jqnatividad/qsv#encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants