-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Part 1 fix for categorical mem opt issue #795
Part 1 fix for categorical mem opt issue #795
Conversation
409f16c
to
a50a7e0
Compare
self.max_sample_size_to_check_stop_condition is not None | ||
and len(data) >= self.max_sample_size_to_check_stop_condition | ||
and self.stop_condition_unique_value_ratio is not None | ||
and len(self._categories) / len(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged_unique_count = len(self._categories)
merged_sample_size = (self.sample_size + len(data))
merged_unique_ratio = merged_unique_count / merged_sample_size
if (
self.max_sample_size_to_check_stop_condition is not None
and self. stop_condition_unique_value_ratio is not None
and merged_sample_size >= self.max_sample_size_to_check_stop_condition
and merged_unique_ratio >= stop_condition_unique_value_ratio
):
self._stop_condition_is_met = True
self._stopped_at_unique_ratio = merged_unique_ratio
self._stopped_at_unique_count = merged_unique_count
I think this keeps it clear and then sets the values.
An option would be if we want this function to just be a check, we can calculate in_update_categories
:
merged_unique_count = len(self._categories)
merged_sample_size = (self.sample_size + len(data))
merged_unique_ratio = merged_unique_count / merged_sample_size
and pass it into the check function:
```python
def _check_stop_condition_is_met(self, sample_count, sample_size, unqiue_ratio):
"""Return value stop_condition_is_met given stop conditions.
:return: boolean for stop conditions
"""
if (
self.max_sample_size_to_check_stop_condition is not None
and self. stop_condition_unique_value_ratio is not None
and sample_size >= self.max_sample_size_to_check_stop_condition
and unique_ratio >= stop_condition_unique_value_ratio
):
return True
return False
then in _update_categories
:
if self._check_stop_condition_is_met(merged.....):
self._stop_condition_is_met = True
self._categories = {}
self._stopped_at_unique_ratio = merged_unique_ratio
self._stopped_at_unique_count = merged_unique_count
^^ similar logic can then be used in the __add__
where we calc the:
if not profile1._stop_condition_is_met and not profile2._stop_condition_is_met:
# merge dicts and then check the values for the merging with the stop condition, etc.
else:
# merged categories is empty
# unique ratio and unique count can be take from the profile with the largest sample size
# if we avg, will be problem for streams.
# also transfer stop conditions: enforce the most expensive /lenient case for now (open to either case)
between the two profies
dataprofiler/tests/profilers/test_categorical_column_profile.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/profilers/test_categorical_column_profile.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/profilers/test_categorical_column_profile.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/profilers/test_categorical_column_profile.py
Outdated
Show resolved
Hide resolved
dataprofiler/tests/profilers/test_categorical_column_profile.py
Outdated
Show resolved
Hide resolved
merged_profile._stopped_at_unique_ratio = self._stopped_at_unique_ratio | ||
merged_profile._stopped_at_unique_count = self._stopped_at_unique_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, default should suffice in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
if self._stop_condition_is_met: | ||
self._categories = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already set:
self._stopped_at_unique_ratio = merged_unique_ratio
self._stopped_at_unique_count = merged_unique_count
below, we can join it with:
self._categories = {}
However, let's consider the function description and the action of the function.
When we say update_stop_condition
, to me it would only update _stop_condition_is_met
boolean. Do we wnat to rename it to be more explicit about it doing more? or do we need update_stop_condition
?
Could we just have _update_categories
and _check_stop_condition_is_met
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to note these are code readability / quality comments, not functional as I think the code is in a functional state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a feature branch, we can refactor in subsequent PR.
return True | ||
return False | ||
|
||
def update_stop_condition(self, data: DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want this to continue to exist, I suggest making it private as we wouldn't want an external user to use this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldnt we want that?
merged_profile._stopped_at_unique_ratio = max( | ||
self.unique_ratio, other.unique_ratio | ||
) | ||
merged_profile._stopped_at_unique_count = max( | ||
self.unique_count, other.unique_count | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be purely based on max?
Consider:
Profile 1
has 10 samples and a unique_ratio of 1.00Profile 2
has 10000 samples and a unique ratio of 0.50.
Do these values make sense?
Also, do we want to potentially choose different values from each, could that be problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option of not choosing one profiles values vs the other's complicates the sample_size below I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have tests which ensure that if self has the values that get assigned vs swapping and ensuring that other's values can be chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to talk about this more thoroughly, I feel that not matter what we choose here, in the current state, there is going to be a problem and there isnt an alternative suggested here, so not sure how to move forward.
merged_profile.times = ( | ||
self.times | ||
if self.times["categories"] >= other.times["categories"] | ||
else other.times | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't later merged_profile times. This should already be handled via the _add_helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make sure our tests validates this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we should not add the times together because that is not consistent with a stop condition being hit
if self.times["categories"] >= other.times["categories"] | ||
else other.times | ||
) | ||
merged_profile.sample_size = self.sample_size + other.sample_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is true if we are taking one value over the other above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont agree with this statement (mentioned above)
@@ -578,6 +614,44 @@ def test_categorical_merge(self): | |||
} | |||
self.assertCountEqual(report_count, expected_dict) | |||
|
|||
# Setting up of profile with stop condition not yet met | |||
profile_w_stop_cond_1 = CategoricalColumn("merge_stop_condition_test") | |||
profile_w_stop_cond_1.max_sample_size_to_check_stop_condition = 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should ensure times are correct, but looks like we don't properly do that in this file rn.
The way we've done previously is mocking timeit as a generator.
Example:
with test_utils.mock_timeit(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add issue for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added:
#806
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerns about merge and suggestions for the update portion.
86e4681
to
d4d07e2
Compare
self._merge_calculations( | ||
merged_profile.__calculations, self.__calculations, other.__calculations | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs to be conducted above. right now we don't have it populated, but it should be done in either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming refactors after this in subsequent PRs
need these refactors to be notated as follow-up issues @ksneab7 |
ebb3995
into
capitalone:feature/memory-optimization
* part_1 of fix for mem optimization for categoical dict creation issue * precommit fix * Separated the update from the check in stop conditions for categoical columns * added tests and accounted for different varaibles affected by the change made to categories attribute * Modifications to code based on test findings * Fixes for logic and tests to match requirements from PR * Fix for rebase carry over issue * fixes for tests because of changes to variable names in categorical column object * precommit fixes and improvement of code based on testing
* [WIP] Part 1 fix for categorical mem opt issue (#795) * part_1 of fix for mem optimization for categoical dict creation issue * precommit fix * Separated the update from the check in stop conditions for categoical columns * added tests and accounted for different varaibles affected by the change made to categories attribute * Modifications to code based on test findings * Fixes for logic and tests to match requirements from PR * Fix for rebase carry over issue * fixes for tests because of changes to variable names in categorical column object * precommit fixes and improvement of code based on testing * added stop_condition_unique_value_ratio and max_sample_size_to_check_stop_condition to CategoricalOptions (#808) * implementation of setting stop conds via options for cat column profiler (#810) * Space time analysis improvement (#809) * Made space time analysis code improvements (detect if dataset is already generated, specify cats to generate) * Modified md file to account for new variable in space time analysis code * fix: cat bug (#816) * hotfix for more conservatitive stop condition in categorical columns (#817) * [WIP] Fix for histogram merging (#815) * rough draft of merge fix for histograms * final fixes for passing of existing tests * Added option to remove calculations for updating row statistics (#827) * Fix to doc strings (#829) * Preset Option Fix: presets docsstring added (#830) * presets docsstring added * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <[email protected]> * Update dataprofiler/profilers/profiler_options.py * Update dataprofiler/profilers/profiler_options.py --------- Co-authored-by: Taylor Turner <[email protected]> --------- Co-authored-by: ksneab7 <[email protected]> Co-authored-by: Michael Davis <[email protected]> Co-authored-by: JGSweets <[email protected]>
Dataset used is a single column ordered dataset
Conditions set:
self.max_sample_size_to_check_stop_condition = 100
self.stop_condition_unique_value_ratio = .90
Separate measure for two different profiles to indicate merge differences: