-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for histogram merging #815
Fix for histogram merging #815
Conversation
@@ -14,6 +14,218 @@ | |||
) | |||
|
|||
|
|||
def rework_ptp(maximum, minimum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we fix naming, we should make these private as well.
------- | ||
h : An estimate of the optimal bin width for the given data. | ||
""" | ||
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can instead use the profile.sampling_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AttributeError: 'IntColumn' object has no attribute 'sampling_size'
minimum = profile._stored_histogram["histogram"]["bin_edges"][0] | ||
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to use profile.min
etc.
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) | ||
minimum = profile._stored_histogram["histogram"]["bin_edges"][0] | ||
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1] | ||
return (rework_ptp(maximum, minimum) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) | ||
minimum = profile._stored_histogram["histogram"]["bin_edges"][0] | ||
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) | ||
minimum = profile._stored_histogram["histogram"]["bin_edges"][0] | ||
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
h : An estimate of the optimal bin width for the given data. | ||
""" | ||
iqr = np.subtract(profile._get_percentile([75]), profile._get_percentile([25])) | ||
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
------- | ||
h : An estimate of the optimal bin width for the given data. | ||
""" | ||
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
dataset_size = sum(profile._stored_histogram["histogram"]["bin_counts"]) | ||
minimum = profile._stored_histogram["histogram"]["bin_edges"][0] | ||
maximum = profile._stored_histogram["histogram"]["bin_edges"][-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
n_equal_bins = 1 | ||
else: | ||
# Do not call selectors on empty arrays | ||
width = rework_hist_bin_selectors[bin_name](profile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm wondering if it is better to take in profile or a dict of properties. this does work and is future flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if there are ways to reduce overall calcs by generating props before this loop and passing that in instead of the profile.
@@ -76,7 +76,6 @@ def __init__(self, options: NumericalOptions = None) -> None: | |||
self.histogram_selection: str | None = None | |||
self.user_set_histogram_bin: int | None = None | |||
self.bias_correction: bool = True # By default, we correct for bias | |||
self._mode_is_enabled: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird I think this may solve some of my issue above. This should not be deleted
""" | ||
# parse the overloaded bins argument | ||
|
||
rework_hist_bin_selectors = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of re init internal every time, maybe global in this file. Probably should privatize the variable too. Update name to suggest from profile in the name as opposed to rework.
class TestColumn(NumericStatsMixin): | ||
def __init__(self): | ||
NumericStatsMixin.__init__(self) | ||
self.times = defaultdict(float) | ||
self.match_count = 5 | ||
self.min = 1 | ||
self.max = 5 | ||
self._biased_skewness = 1.0 | ||
self._stored_histogram["histogram"]["bin_counts"] = [1, 1, 1, 1, 1, 1] | ||
self._stored_histogram["histogram"]["bin_edges"] = [0, 1, 2, 3, 4, 5, 6] | ||
|
||
def update(self, df_series): | ||
pass | ||
|
||
def _filter_properties_w_options(self, calculations, options): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
|
||
def _assimilate_histogram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah doesn't look like self
is even utilized outside of L1359 in the function so maybe we add this as an issue
@ksneab7 for follow-up work but for now we focus on getting the working update in the feature branch
try: | ||
calculated_loss = ( | ||
(histogram_loss_1 * other1.sample_size) | ||
+ (histogram_loss_2 * other2.sample_size) | ||
) / (other1.sample_size + other2.sample_size) | ||
except AttributeError: | ||
sample_size_1 = sum(other1._stored_histogram["histogram"]["bin_counts"]) | ||
sample_size_2 = sum(other2._stored_histogram["histogram"]["bin_counts"]) | ||
calculated_loss = ( | ||
(histogram_loss_1 * sample_size_1) + (histogram_loss_2 * sample_size_2) | ||
) / (sample_size_1 + sample_size_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is sample_size
the correct value or should it be match_count
? Do we have tests evaluating this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think loss should be the sum, right? It's the aggregate loss of both being inserted into the _stored_histogrtam
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure match_count is probably right.
so in this calculation we are taking the the loss of each histogram and using the size of the dataset associated with that loss as a weight factor. So yes it is a sum, in this case I am doing a weighted sum.
dest_hist_bin_edges=ideal_bin_edges, | ||
dest_hist_num_bin=ideal_count_of_bins, | ||
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could separate out getting sizes from the actual calc so the calc is not within the try itself. that is if sizes are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below
…d helper function
hist_loss += ( | ||
((new_bin_edge[2] - new_bin_edge[1]) - (bin_edge[1] - bin_edge[0])) | ||
((new_bin_edge[2] + new_bin_edge[1]) - (bin_edge[1] + bin_edge[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the last bug to be fixed
"sturges": _calc_sturges_bin_width_from_profile, | ||
} | ||
|
||
|
||
def _get_bin_edges( | ||
a: np.ndarray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think we might like a more descriptive variable name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not in code we changed and is actually how numpy
also implements it as well. My suggestion is to table for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call
* rough draft of merge fix for histograms * final fixes for passing of existing tests
* [WIP] Part 1 fix for categorical mem opt issue (#795) * part_1 of fix for mem optimization for categoical dict creation issue * precommit fix * Separated the update from the check in stop conditions for categoical columns * added tests and accounted for different varaibles affected by the change made to categories attribute * Modifications to code based on test findings * Fixes for logic and tests to match requirements from PR * Fix for rebase carry over issue * fixes for tests because of changes to variable names in categorical column object * precommit fixes and improvement of code based on testing * added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808) * implementation of setting stop conds via options for cat column profiler (#810) * Space time analysis improvement (#809) * Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate) * Modified md file to account for new variable in space time analysis code * fix: cat bug (#816) * hotfix for more conservatitive stop condition in categorical columns (#817) * [WIP] Fix for histogram merging (#815) * rough draft of merge fix for histograms * final fixes for passing of existing tests * Added option to remove calculations for updating row statistics (#827) * Fix to doc strings (#829) * Preset Option Fix: presets docsstring added (#830) * presets docsstring added * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <[email protected]> * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py --------- Co-authored-by: Taylor Turner <[email protected]> --------- Co-authored-by: ksneab7 <[email protected]> Co-authored-by: Michael Davis <[email protected]> Co-authored-by: JGSweets <[email protected]>
No description provided.