Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using TestSuites for numerical data #1122

Open
jeric250 opened this issue May 23, 2024 · 2 comments
Open

Error using TestSuites for numerical data #1122

jeric250 opened this issue May 23, 2024 · 2 comments

Comments

@jeric250
Copy link

Hi there, first time opening an issue so bear with me (and let me know if more info is needed).

Basic information:
Package version used: 0.4.20
Operating system and version: macOS VSCode
Programming language and version used: Python 3.12.2

Code snippet:

from evidently.calculations.stattests import StatTest
from evidently.test_suite import TestSuite
from evidently.tests import *

data_drift_dataset_tests = TestSuite(tests=[
    TestShareOfDriftedColumns(stattest='psi'),
])

# ref_df: represents reference pandas DataFrame data (only numerical features)
# curr_df: represents current pandas DataFrame data (only numerical features)
data_drift_dataset_tests.run(reference_data=ref_df, current_data=curr_df)
data_drift_dataset_tests

The above code is based on Evidently documentation: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb

Error message:
image

The above code snippet takes in only numerical data in a pandas DataFrame (data type of 'float64', 'int64'). When I use the exact same code for only categorical data (data type of 'object','category'), the above code works fine with a report generated.

I checked whether the numerical data used contain any weird values, and it doesn't seem to be the case. For example, to find records with non-numeric values:
ref_df[~ref_df.applymap(np.isreal).all(1)]

What am I missing? Any advice?

@elenasamuylova
Copy link
Collaborator

Hi @jeric250, could you try to run pd.to_numeric on your input columns?

@jeric250
Copy link
Author

jeric250 commented May 23, 2024

Thanks @elenasamuylova for responding so quickly. Forgot to mention, I did try pd.to_numeric as well, something like:
ref_df = ref_df.apply(pd.to_numeric, errors='coerce')
However, the same error still occurred. There's also no null values in the dataset as well.

When I tried to test on a single numerical column, I get the same error as well.

# test on AGE column, represent age of people (e.g. 32, 40)
data_drift_column_report = Report(metrics=[
    ColumnDriftMetric('AGE'),
    ColumnValuePlot('AGE'),  
])

data_drift_column_report.run(reference_data=ref_df, current_data=curr_df)
data_drift_column_report

Error:
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U14'), dtype('float64')) -> None

Same error when I tried DataDriftTable:

data_drift_dataset_report = Report(metrics=[
    DataDriftTable(num_stattest='wasserstein', cat_stattest='psi'),    
])

data_drift_dataset_report.run(reference_data=ref_df, current_data=curr_df)
data_drift_dataset_report

When I limit DataDriftTable to just categorical columns, it works fine with a report generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants