math precision problems in `xsv stats` #100

vielmetti · 2017-09-26T05:08:27Z

Input data set, "iris.csv" from csvkit, a copy which is at https://gist.github.com/447b7d1ef6ef79add6084272e91dcf83

What I got:

$ xsv stats iris.csv
field,type,sum,min,max,min_length,max_length,mean,stddev
sepal_length,Float,876.5000000000002,4.3,7.9,3,3,5.843333333333333,0.8253012917851416
sepal_width,Float,458.60000000000014,2,4.4,3,3,3.0573333333333315,0.4344109677354942
petal_length,Float,563.7000000000004,1,6.9,3,3,3.758,1.759404065775303
petal_width,Float,179.90000000000012,0.1,2.5,3,3,1.1993333333333336,0.7596926279021594
species,Unicode,,Iris-setosa,Iris-virginica,11,15,,

What I expected:

Field type "Sum" more exact.

The text was updated successfully, but these errors were encountered:

hmenke · 2017-10-09T09:55:56Z

That is floating point math. Nothing to do there.

See also this Stack Overflow thread: Is floating point math broken?

BurntSushi · 2017-10-09T10:45:40Z

Sorry I didn't see this earlier. I agree that @hmenke is probably right here. Regardless, this issue doesn't have enough detail to be actionable. It's not clear what "I expected 'sum' to be more exact" actually means. What problem are you trying to solve?

vielmetti · 2017-10-09T11:17:06Z

I expected that xsv stats, which says it's a statistics command, would display precisely 876.5 in the sum field for sepal_length given this data set. If the underlying native Rust math libraries are not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some that do a better job.

BurntSushi · 2017-10-09T11:55:14Z

Oh I see. In that case, it seems like @hmenke is correct. This has nothing to do with "Rust math libraries" and everything to do with floating point arithmetic.

One possible fix I can imagine to this is that xsv records the number of significant digits in the input, and makes sure the output has the same number of significant digits. Patches are welcome, although this does seem rather unpleasant to implement!

If the underlying native Rust math libraries are not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some that do a better job.

This isn't actually helpful. Example: "If xsv is not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some other tool that does a better job."

vielmetti · 2017-10-09T12:22:17Z

I recognize that floating point is not an easy problem. (Ask me some time about doing financial math in Tcl.)

I am aware of tools that do a better job specifically in the realm of producing summary statistics for CSV files; csvstat from csvkit in particular prints the correct answers on the same data set. So I know it's possible.

Looking at csvkit it appears that it uses the Python agate library, which stores numbers in a Decimal format rather than as float.

http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html?highlight=decimal

and indeed looking at Rust libraries I see

https://github.com/alkis/decimal

as a possible compatible library for representing numbers in a format that doesn't generate the evident and completely avoidable errors that floating point arithmetic does.

BurntSushi · 2017-10-09T13:00:35Z

@vielmetti Thanks for those tips! It sounds like this might be something worth experimenting with, although I probably won't do it any time soon myself.

Yomguithereal · 2018-07-11T13:26:44Z

Sorry if I'm missing the point here, but can't we improve the float sum exactedness a little by using Kahan summation here (or Kahan-Babuska, even faster)? I think I could add a PR to add it if you want. It's not a silver bullet and clearly less exact than using decimal types but it could be a compromise.

BurntSushi · 2018-07-11T13:29:25Z

@Yomguithereal It's not clear to me that the algorithm for summation is the problem, but rather, the representation of decimal numbers as floats. You are more than welcome to experiment, but please keep in mind that I'll likely reject anything that increases the maintenance burden.

BurntSushi closed this as completed Oct 9, 2017

BurntSushi reopened this Oct 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math precision problems in `xsv stats` #100

math precision problems in `xsv stats` #100

vielmetti commented Sep 26, 2017

hmenke commented Oct 9, 2017

BurntSushi commented Oct 9, 2017 •

edited

Loading

vielmetti commented Oct 9, 2017

BurntSushi commented Oct 9, 2017 •

edited

Loading

vielmetti commented Oct 9, 2017

BurntSushi commented Oct 9, 2017

Yomguithereal commented Jul 11, 2018

BurntSushi commented Jul 11, 2018

math precision problems in xsv stats #100

math precision problems in xsv stats #100

Comments

vielmetti commented Sep 26, 2017

hmenke commented Oct 9, 2017

BurntSushi commented Oct 9, 2017 • edited Loading

vielmetti commented Oct 9, 2017

BurntSushi commented Oct 9, 2017 • edited Loading

vielmetti commented Oct 9, 2017

BurntSushi commented Oct 9, 2017

Yomguithereal commented Jul 11, 2018

BurntSushi commented Jul 11, 2018

math precision problems in `xsv stats` #100

math precision problems in `xsv stats` #100

BurntSushi commented Oct 9, 2017 •

edited

Loading

BurntSushi commented Oct 9, 2017 •

edited

Loading