Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe output: Column types depend on the value of default #138

Open
datajanko opened this issue Feb 5, 2018 · 11 comments
Open

Dataframe output: Column types depend on the value of default #138

datajanko opened this issue Feb 5, 2018 · 11 comments
Assignees

Comments

@datajanko
Copy link

datajanko commented Feb 5, 2018

So the Type of the output column types is the largest class containing all types in every column (typically object)

check_df = pd.DataFrame({'A': [1.0, 2.0], 'B':[1,2], 'C':['A', 'B' ]})
mapper_check= skp.DataFrameMapper([('A', preprocessing.LabelBinarizer())], default=False, df_out=True)
mapper_check.fit_transform(check_df).dtypes
A    int64
dtype: object

now use default= None

mapper_check= skp.DataFrameMapper([('A', preprocessing.LabelBinarizer())], default=None, df_out=True)
mapper_check.fit_transform(check_df).dtypes
A    object
B    object
C    object
dtype: object

So as we see incorporating the default = None changes the type of column A. This is due to the fact, that the stacked arrays only have one type.

So a fix would be to check first if df_out is true and defer the construction of the stacked array

edit: Issue not completely correct: I just build an dtype-transformer: it always construct chooses the type of the column that contains the type of all the other columns

@dukebody
Copy link
Collaborator

Sorry for the big delay. Does this issue cause you trouble somewhere else? If you know how to fix this, can you submit a PR with the fix? Thanks!

@datajanko
Copy link
Author

No problem. In pipelines, I had the problem, that an estimator did not work with an column of object type containing floats/ints. Those things can happen if you e.g. keep an object column in the first step of a pipeline, and the rest should be floats/ints. Then all columns have object type. This makes it necessary to append a 'to float/int' transformer at the end of a pipeline.

I am not sure, how a good implementation would look like. Structured numpy arrays could be used or simply trying to transform everything to the most common dtype. But the latter does not seem optimal.

@hacktuarial
Copy link
Contributor

I am working on a PR for this issue and will submit this week.

@gennaro-tedesco
Copy link

gennaro-tedesco commented Feb 13, 2020

I am seeing the same problem: has this been solved? I see the merge request being accepted but nevertheless I am having the same issue as described in the original question.

@hacktuarial
Copy link
Contributor

what version of sklearn-pandas are you using?

@gennaro-tedesco
Copy link

I am using sklearn-pandas==1.8.0.

@hacktuarial
Copy link
Contributor

The following works for me, using sklearn-pandas==1.8.0 installed from PyPI. Can you please provide a code snippet to reproduce your issue?

import sklearn
import pandas as pd
import numpy as np
import sklearn_pandas as skp

if __name__ == "__main__":
    (sklearn.show_versions())
    check_df = pd.DataFrame({"A": [1.0, 2.0], "B": [1, 2], "C": ["A", "B"]})
    mapper_check = skp.DataFrameMapper(
        [("A", sklearn.preprocessing.LabelBinarizer())], default=None, df_out=True
    )
    actual = mapper_check.fit_transform(check_df).dtypes
    expected = pd.Series({"A": np.int64, "B": object, "C": object})
    assert (actual == expected).all()

output


System:
    python: 3.6.6 (default, Sep 20 2018, 23:47:57)  [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)]
executable: /Users/timothysweetser/.pyenv/versions/test/bin/python
   machine: Darwin-19.3.0-x86_64-i386-64bit

Python dependencies:
       pip: 20.0.2
setuptools: 39.0.1
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 1.0.1
matplotlib: None
    joblib: 0.14.1

Built with OpenMP: True

@hacktuarial
Copy link
Contributor

Ah, I see the problem now: column B should be integer type

@gennaro-tedesco
Copy link

Ah, I see the problem now: column B should be integer type

Yes, apparently it keeps the "old" type only when numeric, changing everything else to object; there is another GitHub issue referencing exactly this behaviour (I don't have it with me now). However, another one that can be relevant is here: #171

@kirel
Copy link

kirel commented Oct 16, 2020

This is still an issue.

@ragrawal
Copy link
Collaborator

ragrawal commented May 8, 2021

sorry for the delay. Let me look into this.

@ragrawal ragrawal self-assigned this May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants