Skip to content

contamination / information leakage between training and testing data #2

@makesourcenotcode

Description

@makesourcenotcode

I notice that both in the README.md and train.php the same mistake is made:

Namely that ZScaleStandardizer is used BEFORE the train test split and not AFTER. This results in information leakage right from the start and puts into question all the various metrics at the end.

The correct approach would be using ZScaleStandardizer on the training set only and capturing it's parameters to repeat on the testing set before using the trained model to make predictions.

This can potentially mislead newer users studying machine learning into bad habits that will later need to be unlearned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions