Skip to content
This repository has been archived by the owner on Jul 27, 2024. It is now read-only.

ProtoFromDataFrames fails for dataframes with categorical columns #237

Open
ysayeed opened this issue Mar 10, 2021 · 6 comments
Open

ProtoFromDataFrames fails for dataframes with categorical columns #237

ysayeed opened this issue Mar 10, 2021 · 6 comments

Comments

@ysayeed
Copy link

ysayeed commented Mar 10, 2021

When attempting to create the proto for facets-overview, if any of the columns are categorical, the operation will fail with an attribute error. I would expect it to properly parse the dataframe, treating the category dtype as a string and displaying it in the "Categorical Features" section in the same way.

Below is example code to produce this error and the traceback:

from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator  
import pandas as pd  
df = pd.DataFrame({'col1': pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])})  
proto = GenericFeatureStatisticsGenerator().ProtoFromDataFrames([{'name': 'test', 'table': df}])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../facets_overview/base_generic_feature_statistics_generator.py", line 54, in ProtoFromDataFrames
    table_entries[col] = self.NdarrayToEntry(table[col])
  File ".../facets_overview/base_generic_feature_statistics_generator.py", line 119, in NdarrayToEntry
    data_type = self.DtypeToType(x.dtype)
  File ".../facets_overview/base_generic_feature_statistics_generator.py", line 66, in DtypeToType
    if dtype.char in np.typecodes['AllFloat']:
AttributeError: 'CategoricalDtype' object has no attribute 'char'

This is using facets-overview 1.0.0 and pandas 1.1.4.

@jameswex
Copy link
Contributor

Yes it looks like the facets overview code doesn't support the Categorical type. You can change it to a series of standard strings and then the proto creation should work.

In order for this code to work on Categorical series out of the box, https://github.com/PAIR-code/facets/blob/master/facets_overview/python/base_generic_feature_statistics_generator.py#L69 would need to be updated to check for the Categorical dtype and return self.fs_proto.STRING in that case, before the current checks that use dtype.char (since the Categorical type doesn't have the char member variable).

@ysayeed
Copy link
Author

ysayeed commented Mar 11, 2021

Thanks, that workaround solves things for me.

@hermanashley
Copy link

hermanashley commented Oct 21, 2022

I am running into a similar error, but here it is not handling string data.

File "/ashley/.cache/pypoetry/virtualenvs/test-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 54, in ProtoFromDataFrames
    table_entries[col] = self.NdarrayToEntry(table[col])
  File "/ashley/.cache/pypoetry/virtualenvs/test-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 119, in NdarrayToEntry
    data_type = self.DtypeToType(x.dtype)
  File "/ashley/.cache/pypoetry/virtualenvs/test-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 66, in DtypeToType
    if dtype.char in np.typecodes['AllFloat']:
AttributeError: 'StringDtype' object has no attribute 'char'

This is using python 3.8, pandas 1.4, and facets-overview 1.0.0

Would appreciate some help!

@jameswex
Copy link
Contributor

The facets code is quite old and doesn't contain support for the newer StringDtype for string values. If you instead use the standard "object" type for the strings, the code should work.

@hermanashley
Copy link

@jameswex Thank you! I had to convert Int64Dtype as well it turned out. Possibly this belongs in another thread, but I am seeing a new error after doing type conversion:

    proto_str = GenericFeatureStatisticsGenerator().ProtoFromDataFrames(dfs).SerializeToString()
  File "/ashley/.cache/pypoetry/virtualenvs/scorecard-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 60, in ProtoFromDataFrames
    return self.GetDatasetsProto(
  File "/ashley/.cache/pypoetry/virtualenvs/scorecard-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 284, in GetDatasetsProto
    sample_count=np.asscalar(val[0]),
  File "/ashley/.cache/pypoetry/virtualenvs/scorecard-BIYvDDBt-py3.8/lib64/python3.8/site-packages/numpy/__init__.py", line 311, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'asscalar'

Any insight?

@jameswex
Copy link
Contributor

I believe it has to do with your numpy version. See https://numpy.org/doc/1.21/reference/generated/numpy.asscalar.html

You can downgrade numpy or update the facets code to use the appropriate replacement method.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants