Added DropNullColumn transformer to remove columns that contain only nulls #1115

rcap107 · 2024-10-17T13:45:29Z

DropNullColumn (provisional name) takes as input a column, and drops it if all the values are nulls or nans. TableVectorizer was also updated with a drop_null_columns flag set to False by default; if the flag is set to True, the DropNullColumn is added as a processing step for all columns.

I've also added drop and is_all_null to _common.py, though I am not sure if they should go there. Maybe is_all_null can stay in the DropNullColumn file.

The test I wrote passes, but I'm not sure if it's good enough.

The documentation is still missing.

TheooJ

Hi @rcap107 ! I made a first pass and have a few comments :

Personnally I like the name DropNullColumn, I think it’s clear what it does !
I would rename the file _drop_null.py
Make sure you pre-commit run --all-files before pushing, it seems to be what’s breaking the CI for you here
I think is_all_null could be placed in the DropNullColumn file if it’s only used there for now, but I could also see it being in _common.py

skrub/tests/test_dropnulls.py

skrub/_dataframe/_common.py

TheooJ · 2024-10-18T15:44:34Z

skrub/_dataframe/_common.py

@@ -1187,3 +1208,15 @@ def with_columns(df, **new_cols):
    cols = {col_name: col(df, col_name) for col_name in column_names(df)}
    cols.update({n: make_column_like(df, c, n) for n, c in new_cols.items()})
    return make_dataframe_like(df, cols)
+
+@dispatch
+def drop(obj, col):


I don’t know if drop is necessary, you could directly use skrub selectors:
df = s.select(df, ~s.cols(col))

TheooJ · 2024-10-18T15:46:34Z

skrub/_table_vectorizer.py

@@ -191,6 +192,9 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
        similar functionality to what is offered by scikit-learn's
        :class:`~sklearn.compose.ColumnTransformer`.

+    drop_null_columns : bool, default=False


Do we want it to be True by default ?

That should be discussed with others I think

I vote for true by default -- there's nothing we can learn from a completely empty column.

if it is False by default, I think it should be set to True in the tabular_learner

skrub/tests/test_dropnulls.py

TheooJ · 2024-10-18T15:51:40Z

skrub/tests/test_dropnulls.py

+    main_table_dropped = ns.drop(main_table_dropped, "value_nan")
+
+    # Don't drop null columns
+    tv = TableVectorizer(drop_null_columns=False)


This test needs to go in the TV test file IMO

I can move it 👍

rcap107 · 2024-10-21T08:15:46Z

Hi @TheooJ, thanks a lot for the comments! I'll address them and update the PR 👍

Co-authored-by: Théo Jolivet <[email protected]>

…into drop_null_columns

rcap107 · 2024-10-21T14:57:24Z

skrub/tests/test_drop_nulls.py

+
+    # assert_array_equal(
+    #     sbd.to_numpy(sbd.col(drop_null_table, "value_almost_null")),
+    #     np.array(["almost", None, None]),


Not sure how to write this check so that it works with either pandas or polars

You could use df_module as a fixture in the test by adding it to the arguments, then comparing series instead of numpy arrays:

df_module.assert_column_equal( sbd.col(drop_null_table, "value_almost_null"), df_module.make_column("value_almost_null", ["almost", None, None]), )

Test would look like

def test_single_column(drop_null_table, df_module): """Check that null columns are dropped and non-null columns are kept.""" dn = DropNullColumn() assert dn.fit_transform(drop_null_table["value_nan"]) == [] assert dn.fit_transform(drop_null_table["value_null"]) == [] df_module.assert_column_equal( sbd.col(drop_null_table, "idx"), df_module.make_column("idx", [1, 2, 3]) ) df_module.assert_column_equal( sbd.col(drop_null_table, "value_almost_nan"), df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]), ) df_module.assert_column_equal( sbd.col(drop_null_table, "value_almost_null"), df_module.make_column("value_almost_null", ["almost", None, None]), )

This also circumvents that depending on the version of pandas, null values are not treated the same

jeromedockes · 2024-10-22T09:09:46Z

the failure in the min-deps environment is not related to this pr; the fix is in #1122

skrub/_dataframe/_common.py

skrub/_drop_null.py

jeromedockes · 2024-10-22T09:20:17Z

skrub/_table_vectorizer.py

@@ -536,6 +542,9 @@ def add_step(steps, transformer, cols, allow_reject=False):
        cols = s.all() - self._specific_columns

        self._preprocessors = [CheckInputDataFrame()]
+        if self.drop_null_columns:
+            add_step(self._preprocessors, DropNullColumn(), cols, allow_reject=True)


we may want to insert it after CleanNullStrings? so that if the column becomes full of nulls after converting "N/A" to null it will be dropped. also it's not important but your transformer never raises a RejectColumn exception so allow_reject has no effect you don't need it here and can leave the default

I added it after CleanNullStrings, but I think I did it in an ugly way, maybe it can be fixed

skrub/_dataframe/_common.py

skrub/_drop_null.py

rcap107 · 2024-11-08T10:43:27Z

[...] when in debug mode?

I'm not sure to understand what debug mode is?

Nevermind, I'll just raise the warning by default. I was thinking that maybe it would be possible to only raise a warning in verbose mode (if it's even a thing), but in the end I just went with a different solution.

DropNull then takes the same argument and default

Also, "warn" looks like a good default, but that's up for discussion.

We also need to decide whether it's "warn and drop", or "warn and keep", and explain the behavior in the documentation.

rcap107 · 2024-11-08T14:49:49Z

skrub/tests/test_table_vectorizer.py

@@ -778,11 +777,7 @@ def test_drop_null_column():

    # Raise exception if a null column is found
    with pytest.raises(


This test is still failing because the TableVectorizer is not raising the correct exception and I don't know how to make it do that.

here raise a ValueError instead of RejectColumn

rejectcolumn is a way to signify to the tablevectorizer "I'm not the right transformer for this column, don't apply me here".

Gotcha, fixed

rcap107 · 2024-11-08T14:51:46Z

I have updated the code to have "warn and keep" as the default behavior, I think it's the version that makes the most sense.

At the moment the only problem I have is that I don't know how to raise the proper exception from TableVectorizer. In DropNull I am raising RejectColumn, but then I don't know how to propagate it correctly,

jeromedockes · 2024-11-08T15:04:59Z

I find it a bit weird to have a DropColumnIfNull that does not drop the column and just raises an exception. maybe it should be named something like CheckNulls or something?

jeromedockes · 2024-11-08T15:07:45Z

Consider a situation where you have a TableVectorizer in a scikit-learn pipeline, trained in a production environment every week. If you built this pipeline assuming you would have access to some column "col_1", but for some reason, the data production system now only produces NaN values for this column

I'm not sure I understand -- in any case the tablevectorizer chooses the estimators during fit so the schema of the output can change every week in this scenario. eg if a column has one more unique value than the previous week it can change from a one-hot encoding to a gap encoding. or for example if you had high_cardinality="drop" those columns would also be dropped or kept depending on their content

jeromedockes · 2024-11-08T15:08:51Z

so if you want consistent output schema (number of columns, names and types) across training the tablevetorizer is not what you want anyway

Vincent-Maladiere · 2024-11-10T16:09:48Z

I find it a bit weird to have a DropColumnIfNull that does not drop the column and just raises an exception. maybe it should be named something like CheckNulls or something?

Yes, I agree this sounds weird. I recognize dropping NaN columns –which are usually useless– looks good. What about having the transformer drop by default, but allowing the user to pass other options as arguments in the TableVectorizer?

I'm not sure I understand -- in any case the tablevectorizer chooses the estimators during fit [...]

Right, I was pointing out that even outside of the TableVectorizer (e.g. in the ColumnTransformer), the DropColumnIfNull could raise surprising mistakes.

It's true though that this issue broadly applies to the TableVectorizer, hence we should encourage people to perform checks in production (with dedicated check functions? There might be some potential here for production usage).

This reverts commit 7b635ef.

This reverts commit a04fb50.

This reverts commit b311317.

This reverts commit 32ca7a0.

This reverts commit 399954a.

…efault" This reverts commit 98b6c10.

rcap107 · 2024-11-18T14:29:23Z

Following today's discussion with @Vincent-Maladiere and @jeromedockes, I reverted the changes back to the old version (with the simple flag)

jeromedockes

LGTM! thanks again @rcap107

(the codecov report is bogus)

I'll let @Vincent-Maladiere do a final review & merge

Vincent-Maladiere

In our IRL discussion, haven't we agreed on a threshold for the null ratio in the column (which is 1.0 by default)?

Apart from that, and to keep in mind for later, the items we discussed were:

The ability to freeze TableVectorizer column-to-transformer mapping. It would help to obtain consistent results for retraining in an automated environment, and having more sensible errors to debug.
Decoupling the check/cleaning part of the TableVectorizer (which comes before the vectorizing part) so that it can be used as a standalone object wherever.

jeromedockes · 2024-11-18T14:48:58Z

In our IRL discussion, haven't we agreed on a threshold for the null ratio in the column (which is 1.0 by default)?

we definitely did; I had understood it would be tackled in a separate PR

+1 for points 1. and 2. -- let's open a separate issue for 1. and there is #925 for 2.

rcap107 · 2024-11-18T14:51:58Z

I also understood that the threshold would be added in a separate issue

Vincent-Maladiere

LGTM then, thanks @rcap107 :)

jeromedockes · 2024-11-18T14:55:31Z

yay 🎉 !! thanks @rcap107 !

rcap107 · 2024-11-18T14:55:45Z

🎉

rcap107 added 7 commits October 17, 2024 11:39

Adding code for DropNull

bee630f

Fixed line

ccc9a02

renamed script

d3f9c90

Added new common functions for drop and is_all_null

9d42b95

Fixed code

f249982

Added test for dropcol

a1caf39

Removing dev script

b0e3235

TheooJ reviewed Oct 18, 2024

View reviewed changes

rcap107 and others added 16 commits October 21, 2024 15:40

Update skrub/tests/test_dropnulls.py

90be825

Co-authored-by: Théo Jolivet <[email protected]>

Renamed file

55764a8

Renamed file

c8fdaaa

Formatting

0cdc0bd

Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …

34c0095

…into drop_null_columns

Rename file

430c8e3

Added docstrings

80bd408

Fixing imports and refactoring names

e2ca33f

Formatting

4dbba09

Updated changelog.

7d6f8ce

Formatting

4771d18

Removing function because it was not needed

f0b521a

Updated test

ea9893b

Merge branch 'main' into drop_null_columns

c73db7e

Improving tests

09cf9c7

Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …

4e4f255

…into drop_null_columns

rcap107 commented Oct 21, 2024

View reviewed changes

Updated test

754e2ef

jeromedockes reviewed Oct 22, 2024

View reviewed changes

skrub/_dataframe/_common.py Outdated Show resolved Hide resolved

skrub/_drop_null.py Outdated Show resolved Hide resolved

Altering the code to add different options and changing the default

98b6c10

rcap107 added 4 commits November 8, 2024 11:49

Improvements to formatting and docstring.

399954a

Adding error checking

32ca7a0

Updated documentation

b311317

Fixed tests

a04fb50

rcap107 commented Nov 8, 2024

View reviewed changes

Changing exception

7b635ef

rcap107 added 6 commits November 18, 2024 15:25

Revert "Changing exception"

43a61d4

This reverts commit 7b635ef.

Revert "Fixed tests"

c48a63d

This reverts commit a04fb50.

Revert "Updated documentation"

5704ebf

This reverts commit b311317.

Revert "Adding error checking"

ab5af46

This reverts commit 32ca7a0.

Revert "Improvements to formatting and docstring."

3f69bde

This reverts commit 399954a.

Revert "Altering the code to add different options and changing the d…

801d745

…efault" This reverts commit 98b6c10.

jeromedockes approved these changes Nov 18, 2024

View reviewed changes

Vincent-Maladiere reviewed Nov 18, 2024

View reviewed changes

rcap107 closed this Nov 18, 2024

rcap107 reopened this Nov 18, 2024

Vincent-Maladiere approved these changes Nov 18, 2024

View reviewed changes

jeromedockes merged commit 2cdf8ad into skrub-data:main Nov 18, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Added DropNullColumn transformer to remove columns that contain only nulls #1115

rcap107 commented Oct 17, 2024 •

edited by jeromedockes

Loading

TheooJ left a comment

TheooJ Oct 18, 2024

TheooJ Oct 18, 2024

rcap107 Oct 21, 2024

jeromedockes Oct 23, 2024

TheooJ Oct 18, 2024

rcap107 Oct 21, 2024

rcap107 commented Oct 21, 2024

rcap107 Oct 21, 2024

TheooJ Oct 21, 2024

TheooJ Oct 21, 2024

TheooJ Oct 21, 2024

jeromedockes commented Oct 22, 2024

jeromedockes Oct 22, 2024

rcap107 Oct 22, 2024

rcap107 commented Nov 8, 2024 •

edited

Loading

rcap107 Nov 8, 2024

jeromedockes Nov 8, 2024

jeromedockes Nov 8, 2024

rcap107 Nov 8, 2024

rcap107 commented Nov 8, 2024

jeromedockes commented Nov 8, 2024

jeromedockes commented Nov 8, 2024

jeromedockes commented Nov 8, 2024

Vincent-Maladiere commented Nov 10, 2024

rcap107 commented Nov 18, 2024

jeromedockes left a comment

Vincent-Maladiere left a comment

jeromedockes commented Nov 18, 2024

rcap107 commented Nov 18, 2024

Vincent-Maladiere left a comment

jeromedockes commented Nov 18, 2024

rcap107 commented Nov 18, 2024

		@@ -778,11 +777,7 @@ def test_drop_null_column():

		# Raise exception if a null column is found
		with pytest.raises(

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Conversation

rcap107 commented Oct 17, 2024 • edited by jeromedockes Loading

TheooJ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Nov 8, 2024

jeromedockes commented Nov 8, 2024

jeromedockes commented Nov 8, 2024

jeromedockes commented Nov 8, 2024

Vincent-Maladiere commented Nov 10, 2024

rcap107 commented Nov 18, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes commented Nov 18, 2024

rcap107 commented Nov 18, 2024

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes commented Nov 18, 2024

rcap107 commented Nov 18, 2024

rcap107 commented Oct 17, 2024 •

edited by jeromedockes

Loading

rcap107 commented Nov 8, 2024 •

edited

Loading