Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add catch to prevent data leakage when resampling data [feature/124/augmentation] #179

Closed
tallamjr opened this issue Jun 21, 2019 · 1 comment
Assignees
Labels
enhancement Improvement to existing functionality or implementation, including adding a new functions/methods. pre-v2.0.0 Issues that should be completed prior to public release of v2.0.0

Comments

@tallamjr
Copy link
Collaborator

We need to ensure that when resampling to done via SMOTE, or other techniques, that there is not the risk of data leaking into the test set such that when one comes to evaluate the models it is not being tested on examples that also exist in the training set.

This can be done with a copying of the original data and perhaps checks to see if augmentation has already occurred elsewhere in the pipeline

@tallamjr tallamjr added enhancement Improvement to existing functionality or implementation, including adding a new functions/methods. pre-v2.0.0 Issues that should be completed prior to public release of v2.0.0 labels Jun 21, 2019
@tallamjr tallamjr self-assigned this Jun 21, 2019
@Catarina-Alves Catarina-Alves self-assigned this May 13, 2020
@Catarina-Alves
Copy link
Collaborator

This issue is no longer relevant for the current version of snmachine. However, it must be taken into account during issue #246

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to existing functionality or implementation, including adding a new functions/methods. pre-v2.0.0 Issues that should be completed prior to public release of v2.0.0
Projects
None yet
Development

No branches or pull requests

2 participants