You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Initial considerations involve calculating the statistics as efficiently as possible. Some different approaches off of the top of my head include:
calculating statistics using UDAFs. For exact calculation of the histogram and standard deviation this would appear to require at least two passes over the data.
leveraging existing hive/sql functions
exploring/using approximate methods for histograms, std dev, etc. on large data
As discussed in our original Spark Summit presentation: See 22 min mark.
Listening to myself is awful btw.
Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.
The text was updated successfully, but these errors were encountered: