Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make _assimilate_histogram() not use self #1071

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 19 additions & 7 deletions dataprofiler/profilers/numerical_column_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import pandas as pd
import scipy.stats

from . import histogram_utils, profiler_utils
from . import float_column_profile, histogram_utils, profiler_utils
from .base_column_profilers import BaseColumnProfiler
from .profiler_options import NumericalOptions

Expand Down Expand Up @@ -710,6 +710,9 @@ def _preprocess_for_calculate_psi(
entity_count_per_bin=self_histogram["bin_counts"],
bin_edges=self_histogram["bin_edges"],
suggested_bin_count=num_psi_bins,
is_float_histogram=isinstance(
self, float_column_profile.FloatColumn
),
options={
"min_edge": min_min_edge,
"max_edge": max_max_edge,
Expand All @@ -731,6 +734,9 @@ def _preprocess_for_calculate_psi(
entity_count_per_bin=other_histogram["bin_counts"],
bin_edges=other_histogram["bin_edges"],
suggested_bin_count=num_psi_bins,
is_float_histogram=isinstance(
self, float_column_profile.FloatColumn
),
options={
"min_edge": min_min_edge,
"max_edge": max_max_edge,
Expand Down Expand Up @@ -1360,7 +1366,12 @@ def _update_histogram(self, df_series: pd.Series) -> None:
self._stored_histogram["total_loss"] += histogram_loss

def _regenerate_histogram(
self, entity_count_per_bin, bin_edges, suggested_bin_count, options=None
self,
entity_count_per_bin,
bin_edges,
suggested_bin_count,
is_float_histogram,
options=None,
) -> tuple[dict[str, np.ndarray], float]:

# create proper binning
Expand All @@ -1372,6 +1383,11 @@ def _regenerate_histogram(
new_bin_edges = np.linspace(
options["min_edge"], options["max_edge"], suggested_bin_count + 1
)

# if it's not a float histogram, then assume it only contains integer values
if not is_float_histogram:
bin_edges = np.round(bin_edges)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a big assumption ... can you comment on the rationale here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same assumption was being made in _assimilate_histogram (diff). I moved it from _assimilate_histogram to _regenerate_histogram.

We could also remove that rounding code entirely, but it causes a single test failure here. I haven't been able to figure out what purpose the rounding fulfills here, as it existed since the first release. I kept it so that the behavior of _regenerate_histogram wouldn't change from before.

But you're right that it's a big assumption, and it's worth considering its removal. I'll open an alternative PR for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1073 is an alternative to this PR where we remove the rounding code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after extensive discussion with @taylorfturner @micdavis we have determined that this assumption is valid and all values passed into this function at the current placement in the workflow will all be numeric (either ints and floats). I think there should be an additional test where this line 1389 is called (or if a test already exists) to validate that bin edges are able to be rounded at that point in the workflow (i.e. is it possible to get a non-roundable value at this point in the code, which I believe is not possible).

Assuming this test is created I am good with this version of code and we can delete PR 1073

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I closed #1073

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksneab7 Thanks for the feedback! Line 1389 is called in the existing test TestTextColumnProfiler.test_profile (I've attached a screenshot of my debugger stopping at line 1389 while running the test, to demonstrate this). Also, if this rounding code is removed entirely, the output changes and this test fails, as I explained in #1073.

Screenshot 2024-01-09 at 2 31 52 PM

Does this fulfill what you meant, or did you have something else in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope this is exactly what I was looking for thank you!


return self._assimilate_histogram(
from_hist_entity_count_per_bin=entity_count_per_bin,
from_hist_bin_edges=bin_edges,
Expand Down Expand Up @@ -1417,11 +1433,6 @@ def _assimilate_histogram(

bin_edge = from_hist_bin_edges[bin_id : bin_id + 3]

# if we know not float, we can assume values in bins are integers.
is_float_profile = self.__class__.__name__ == "FloatColumn"
if not is_float_profile:
bin_edge = np.round(bin_edge)

# loop until we have a new bin which contains the current bin.
while (
bin_edge[0] >= dest_hist_bin_edges[new_bin_id + 1]
Expand Down Expand Up @@ -1513,6 +1524,7 @@ def _histogram_for_profile(
entity_count_per_bin=bin_counts,
bin_edges=bin_edges,
suggested_bin_count=suggested_bin_count,
is_float_histogram=isinstance(self, float_column_profile.FloatColumn),
)

def _get_best_histogram_for_profile(self) -> dict:
Expand Down