Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve numeric type inference in xsv stats #22

Open
jungle-boogie opened this issue Aug 18, 2015 · 8 comments
Open

improve numeric type inference in xsv stats #22

jungle-boogie opened this issue Aug 18, 2015 · 8 comments

Comments

@jungle-boogie
Copy link

Hello,

Sample data here:
http://www.sharecsv.com/s/9096d32f98aa0ac671a1cca16fa43be8/SalesJan2009.csv

Running xsv stats SalesJan2009.csv --everything | xsv table

Why does it say the price max is 1200 and the min is 800? The max is actually 7500 and I see a record for 250.

Thanks!

@BurntSushi
Copy link
Owner

Here is the output I see:

[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field  type     min   max  min_length  max_length  mean  stddev
Price  Unicode  1200  800  3           6

Notice that the type inferred for the Price field is Unicode. Therefore, xsv thinks it is a string instead of a number. I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:

[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field  type     min  max    min_length  max_length  mean         stddev
Price  Integer  250  13000  3           5           1633.767535  1155.455403

I expect there is room for xsv to improve on its type inference.

@BurntSushi BurntSushi changed the title xsv min/max understandings improve type inference in xsv stats Aug 18, 2015
@jungle-boogie
Copy link
Author

Hi Burnt,

Yes, I had a feeling that how it interpreted the data may affect the outcome of it's results.

How did you change it from unicode to integer?

Thanks!

@BurntSushi
Copy link
Owner

I fixed your data:

I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:

13,000 doesn't parse as an integer because of the comma.

@jungle-boogie
Copy link
Author

Hi,

So correcting that one field made it read all records as integer.

Got it, thanks!

@BurntSushi
Copy link
Owner

Yes. xsv looks at the type of each value. If any of them can't be inferred as numeric, then it falls back to interpreting them as plain text.

There's some middle ground here, but we have to be careful with the heuristics we choose.

There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.

@BurntSushi BurntSushi reopened this Aug 18, 2015
@jungle-boogie
Copy link
Author

Hi Brunt,

There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.

I apologize about that and I look forward to any improvements you may bring forth.

Thanks!

@BurntSushi BurntSushi changed the title improve type inference in xsv stats improve numeric type inference in xsv stats Jul 31, 2016
@br3no
Copy link

br3no commented Nov 2, 2017

What about being able to force the type -- no need to implement dangerous heuristics, leave the responsibility to the user. The only thing to implement would be to be able to interpret the plain text as number, however it is formatted.

@n8henrie
Copy link

A few years later -- I often get 90% of the way with xsv slice / search and then want to end with a stats but get tripped up by a column with currency ($1,300) in which the comma and $ are problematic, and then I'm back to sed / awk. I'd be happy to work on a PR to address this, but I think it would need some architectural direction. Should this be a xsv transform subcommand? A strip or remove option to an existing subcommand?

Thanks for a cool tool :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants