Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math precision problems in xsv stats #100

Open
vielmetti opened this issue Sep 26, 2017 · 8 comments
Open

math precision problems in xsv stats #100

vielmetti opened this issue Sep 26, 2017 · 8 comments

Comments

@vielmetti
Copy link

Input data set, "iris.csv" from csvkit, a copy which is at https://gist.github.com/447b7d1ef6ef79add6084272e91dcf83

What I got:

$ xsv stats iris.csv
field,type,sum,min,max,min_length,max_length,mean,stddev
sepal_length,Float,876.5000000000002,4.3,7.9,3,3,5.843333333333333,0.8253012917851416
sepal_width,Float,458.60000000000014,2,4.4,3,3,3.0573333333333315,0.4344109677354942
petal_length,Float,563.7000000000004,1,6.9,3,3,3.758,1.759404065775303
petal_width,Float,179.90000000000012,0.1,2.5,3,3,1.1993333333333336,0.7596926279021594
species,Unicode,,Iris-setosa,Iris-virginica,11,15,,

What I expected:

Field type "Sum" more exact.

@hmenke
Copy link

hmenke commented Oct 9, 2017

That is floating point math. Nothing to do there.

See also this Stack Overflow thread: Is floating point math broken?

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 9, 2017

Sorry I didn't see this earlier. I agree that @hmenke is probably right here. Regardless, this issue doesn't have enough detail to be actionable. It's not clear what "I expected 'sum' to be more exact" actually means. What problem are you trying to solve?

@vielmetti
Copy link
Author

I expected that xsv stats, which says it's a statistics command, would display precisely 876.5 in the sum field for sepal_length given this data set. If the underlying native Rust math libraries are not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some that do a better job.

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 9, 2017

Oh I see. In that case, it seems like @hmenke is correct. This has nothing to do with "Rust math libraries" and everything to do with floating point arithmetic.

One possible fix I can imagine to this is that xsv records the number of significant digits in the input, and makes sure the output has the same number of significant digits. Patches are welcome, although this does seem rather unpleasant to implement!

If the underlying native Rust math libraries are not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some that do a better job.

This isn't actually helpful. Example: "If xsv is not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some other tool that does a better job."

@vielmetti
Copy link
Author

I recognize that floating point is not an easy problem. (Ask me some time about doing financial math in Tcl.)

I am aware of tools that do a better job specifically in the realm of producing summary statistics for CSV files; csvstat from csvkit in particular prints the correct answers on the same data set. So I know it's possible.

Looking at csvkit it appears that it uses the Python agate library, which stores numbers in a Decimal format rather than as float.

http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html?highlight=decimal

and indeed looking at Rust libraries I see

https://github.com/alkis/decimal

as a possible compatible library for representing numbers in a format that doesn't generate the evident and completely avoidable errors that floating point arithmetic does.

@BurntSushi
Copy link
Owner

@vielmetti Thanks for those tips! It sounds like this might be something worth experimenting with, although I probably won't do it any time soon myself.

@BurntSushi BurntSushi reopened this Oct 9, 2017
@Yomguithereal
Copy link
Contributor

Sorry if I'm missing the point here, but can't we improve the float sum exactedness a little by using Kahan summation here (or Kahan-Babuska, even faster)? I think I could add a PR to add it if you want. It's not a silver bullet and clearly less exact than using decimal types but it could be a compromise.

@BurntSushi
Copy link
Owner

@Yomguithereal It's not clear to me that the algorithm for summation is the problem, but rather, the representation of decimal numbers as floats. You are more than welcome to experiment, but please keep in mind that I'll likely reject anything that increases the maintenance burden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants