Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical column stats update #1089

Merged

Conversation

atl1502
Copy link
Contributor

@atl1502 atl1502 commented Jan 29, 2024

No description provided.

@atl1502 atl1502 requested a review from a team as a code owner January 29, 2024 22:22
@atl1502 atl1502 changed the title [WIP] Numerical column stats update Numerical column stats update Feb 2, 2024
@atl1502 atl1502 force-pushed the numerical_column_stats_update branch from e4be13f to 5aaf2b6 Compare February 2, 2024 20:01
not df_series_clean.empty
and df_series_clean.apply(pd.to_numeric, errors="coerce").dtype == "O"
)
if self._greater_than_64_bit:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good design choice here I like it

ksneab7
ksneab7 previously approved these changes Feb 2, 2024
@taylorfturner taylorfturner added the New Feature A feature addition not currently in the library label Feb 5, 2024
Comment on lines 1288 to 1291
if isinstance(values, (np.ndarray, list)):
unique_value = values[0]
else:
unique_value = values.iloc[0]
unique_value = values[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no reason for the if/else statement because they are doing the same thing no matter what the condition is.

micdavis
micdavis previously approved these changes Feb 5, 2024
micdavis
micdavis previously approved these changes Feb 7, 2024
@atl1502 atl1502 mentioned this pull request Feb 7, 2024
df_series = pl.from_pandas(df_series)
min_value = df_series.min()
if self.min is not None:
min_value = type(self.min)(min_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not intuitive what is happening here.... type(self.min)(min_value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is casting min_value to the current type of self.min

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed actually updating it to remove this line

batch_count,
batch_biased_variance,
batch_mean,
self._biased_variance = np.float64(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this add np.float64(....?

)

@BaseColumnProfiler._timeit(name="skewness")
def _get_skewness(
self,
df_series: pd.Series,
df_series: pd.Series | np.ndarray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this (and following) add to account for larger than 64 bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@taylorfturner taylorfturner enabled auto-merge (squash) February 12, 2024 14:11
@taylorfturner taylorfturner merged commit 6fadd7f into capitalone:feature/polars Feb 12, 2024
5 checks passed
@atl1502 atl1502 deleted the numerical_column_stats_update branch February 12, 2024 16:24
taylorfturner pushed a commit to taylorfturner/DataProfiler that referenced this pull request Mar 7, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
taylorfturner pushed a commit to taylorfturner/DataProfiler that referenced this pull request Mar 7, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
micdavis pushed a commit that referenced this pull request Mar 7, 2024
* Staging into `main` from `dev`  (#1106)

* add downloads tile (#1085)

* Hot fix json bug (#1105)

* update

* update

* update version (#1107)

* add polars to requirements (#1087)

* add polars to requirements

* Update requirements.txt

Co-authored-by: Taylor Turner <[email protected]>

---------

Co-authored-by: Taylor Turner <[email protected]>

* update precommit env (#1088)

* Numerical column stats update (#1089)

* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting

* Profiler utils update (#1092)

* update profiler utils

* finish updates

---------

Co-authored-by: Andrew <[email protected]>
taylorfturner pushed a commit to taylorfturner/DataProfiler that referenced this pull request Mar 22, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
taylorfturner pushed a commit that referenced this pull request Mar 22, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
abajpai15 pushed a commit to abajpai15/DataProfiler that referenced this pull request Apr 11, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
taylorfturner pushed a commit that referenced this pull request Apr 15, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
atl1502 added a commit that referenced this pull request Apr 16, 2024
* partial update to numerical_column_stats

* update with full polars replacement

* reduce redundant if statement

* fix histogram warning

* remove unneeded casting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants