improve numeric type inference in xsv stats #22

jungle-boogie · 2015-08-18T05:58:24Z

Hello,

Sample data here:
http://www.sharecsv.com/s/9096d32f98aa0ac671a1cca16fa43be8/SalesJan2009.csv

Running xsv stats SalesJan2009.csv --everything | xsv table

Why does it say the price max is 1200 and the min is 800? The max is actually 7500 and I see a record for 250.

Thanks!

The text was updated successfully, but these errors were encountered:

BurntSushi · 2015-08-18T11:54:04Z

Here is the output I see:

[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field  type     min   max  min_length  max_length  mean  stddev
Price  Unicode  1200  800  3           6

Notice that the type inferred for the Price field is Unicode. Therefore, xsv thinks it is a string instead of a number. I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:

[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field  type     min  max    min_length  max_length  mean         stddev
Price  Integer  250  13000  3           5           1633.767535  1155.455403

I expect there is room for xsv to improve on its type inference.

jungle-boogie · 2015-08-18T13:05:31Z

Hi Burnt,

Yes, I had a feeling that how it interpreted the data may affect the outcome of it's results.

How did you change it from unicode to integer?

Thanks!

BurntSushi · 2015-08-18T13:13:20Z

I fixed your data:

I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:

13,000 doesn't parse as an integer because of the comma.

jungle-boogie · 2015-08-18T13:21:10Z

Hi,

So correcting that one field made it read all records as integer.

Got it, thanks!

BurntSushi · 2015-08-18T13:26:16Z

Yes. xsv looks at the type of each value. If any of them can't be inferred as numeric, then it falls back to interpreting them as plain text.

There's some middle ground here, but we have to be careful with the heuristics we choose.

There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.

jungle-boogie · 2015-08-18T13:29:19Z

Hi Brunt,

There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.

I apologize about that and I look forward to any improvements you may bring forth.

Thanks!

br3no · 2017-11-02T12:55:57Z

What about being able to force the type -- no need to implement dangerous heuristics, leave the responsibility to the user. The only thing to implement would be to be able to interpret the plain text as number, however it is formatted.

n8henrie · 2023-03-14T13:54:56Z

A few years later -- I often get 90% of the way with xsv slice / search and then want to end with a stats but get tripped up by a column with currency ($1,300) in which the comma and $ are problematic, and then I'm back to sed / awk. I'd be happy to work on a PR to address this, but I think it would need some architectural direction. Should this be a xsv transform subcommand? A strip or remove option to an existing subcommand?

Thanks for a cool tool :)

BurntSushi changed the title ~~xsv min/max understandings~~ improve type inference in xsv stats Aug 18, 2015

jungle-boogie closed this as completed Aug 18, 2015

BurntSushi reopened this Aug 18, 2015

BurntSushi changed the title ~~improve type inference in xsv stats~~ improve numeric type inference in xsv stats Jul 31, 2016

BurntSushi added the enhancement label Jul 31, 2016

danfowler mentioned this issue Sep 22, 2016

Support for NULL values #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve numeric type inference in xsv stats #22

improve numeric type inference in xsv stats #22

jungle-boogie commented Aug 18, 2015

BurntSushi commented Aug 18, 2015

jungle-boogie commented Aug 18, 2015

BurntSushi commented Aug 18, 2015

jungle-boogie commented Aug 18, 2015

BurntSushi commented Aug 18, 2015

jungle-boogie commented Aug 18, 2015

br3no commented Nov 2, 2017

n8henrie commented Mar 14, 2023

improve numeric type inference in xsv stats #22

improve numeric type inference in xsv stats #22

Comments

jungle-boogie commented Aug 18, 2015

BurntSushi commented Aug 18, 2015

jungle-boogie commented Aug 18, 2015

BurntSushi commented Aug 18, 2015

jungle-boogie commented Aug 18, 2015

BurntSushi commented Aug 18, 2015

jungle-boogie commented Aug 18, 2015

br3no commented Nov 2, 2017

n8henrie commented Mar 14, 2023