-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
math precision problems in xsv stats
#100
Comments
That is floating point math. Nothing to do there. See also this Stack Overflow thread: Is floating point math broken? |
Sorry I didn't see this earlier. I agree that @hmenke is probably right here. Regardless, this issue doesn't have enough detail to be actionable. It's not clear what "I expected 'sum' to be more exact" actually means. What problem are you trying to solve? |
I expected that |
Oh I see. In that case, it seems like @hmenke is correct. This has nothing to do with "Rust math libraries" and everything to do with floating point arithmetic. One possible fix I can imagine to this is that
This isn't actually helpful. Example: "If xsv is not sufficiently conditioned to avoid reporting spurious precision, then I'd suggest looking for some other tool that does a better job." |
I recognize that floating point is not an easy problem. (Ask me some time about doing financial math in Tcl.) I am aware of tools that do a better job specifically in the realm of producing summary statistics for CSV files; Looking at http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html?highlight=decimal and indeed looking at Rust libraries I see https://github.com/alkis/decimal as a possible compatible library for representing numbers in a format that doesn't generate the evident and completely avoidable errors that floating point arithmetic does. |
@vielmetti Thanks for those tips! It sounds like this might be something worth experimenting with, although I probably won't do it any time soon myself. |
Sorry if I'm missing the point here, but can't we improve the float sum exactedness a little by using Kahan summation here (or Kahan-Babuska, even faster)? I think I could add a PR to add it if you want. It's not a silver bullet and clearly less exact than using decimal types but it could be a compromise. |
@Yomguithereal It's not clear to me that the algorithm for summation is the problem, but rather, the representation of decimal numbers as floats. You are more than welcome to experiment, but please keep in mind that I'll likely reject anything that increases the maintenance burden. |
Input data set, "iris.csv" from
csvkit
, a copy which is at https://gist.github.com/447b7d1ef6ef79add6084272e91dcf83What I got:
What I expected:
Field type "Sum" more exact.
The text was updated successfully, but these errors were encountered: