Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fitting with numerical column names fails #328

Open
pvk-developer opened this issue Nov 15, 2021 · 1 comment
Open

Fitting with numerical column names fails #328

pvk-developer opened this issue Nov 15, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@pvk-developer
Copy link
Member

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • RDT version: 0.6.1
  • Python version: 3.8
  • Operating System: Ubuntu

Error Description

When fitting any Transformer with a pd.DataFrame that contains as column names a RangeIndex, or a numerical value as column name, those end up failing.

This bug can produce two errors:

  1. Multiple columns
  2. Single columns

Steps to reproduce

Multiple columns

from rdt.transformers import OneHotEncoder

data = pd.DataFrame([
    ['a', 'b', 'c'],
    ['d', 'e', 'f']
])

ohe = OneHotEncoder()
ohe.fit(data, data.columns)

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-9be2b41b4858> in <module>
----> 1 ohe.fit(data, data.columns)

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in fit(self, data, columns)
    163                 Column names. Must be present in the data.
    164         """
--> 165         self._store_columns(columns, data)
    166 
    167         columns_data = self._get_columns_data(data, self.columns)

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in _store_columns(self, columns, data)
    112             columns = [columns]
    113 
--> 114         missing = set(columns) - set(data.columns)
    115         if missing:
    116             raise KeyError(f'Columns {missing} were not present in the data.')

~/.virtualenvs/RDT/lib/python3.9/site-packages/pandas/core/indexes/base.py in __hash__(self)
   4076 
   4077     def __hash__(self):
-> 4078         raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
   4079 
   4080     def __setitem__(self, key, value):

TypeError: unhashable type: 'RangeIndex'

Using a single column

ohe.fit(data, 0)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-5d4d0160e7be> in <module>
----> 1 ohe.fit(data, 0)

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in fit(self, data, columns)
    168         self._fit(columns_data)
    169 
--> 170         self._build_output_columns(data)
    171 
    172     def _transform(self, columns_data):

~/Projects/sdv-dev/RDT/rdt/transformers/base.py in _build_output_columns(self, data)
    136 
    137     def _build_output_columns(self, data):
--> 138         self.column_prefix = '#'.join(self.columns)
    139         self.output_columns = list(self.get_output_types().keys())
    140 

TypeError: sequence item 0: expected str instance, int found

Notes

This errors appear in _store_columns for multiple columns and _build_output_columns for single column.

@pvk-developer pvk-developer added bug Something isn't working pending review labels Nov 15, 2021
@npatki
Copy link
Contributor

npatki commented Jun 10, 2022

I can confirm that this issue still persists in RDT 1.0:

import pandas as pd
from rdt import HyperTransformer

data = pd.DataFrame([
    ['a', 'b', 'c'],
    ['d', 'e', 'f']
])

ht = HyperTransformer()
ht.detect_initial_config(data)
ht.fit_transform(data)

Output:

TypeError: sequence item 0: expected str instance, int found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants