Skip to content

New methods #105

@glemaitre

Description

@glemaitre
Member

This is a non exhaustive list of the methods that can be added for the next release.

Oversampling:

Prototype Generation/Selection:

  • Steady State Memetic Algorithm (SSMA)
    Adaptive Self-Generating Prototypes (ASGP)

Ensemble

Regression

  • SMOTE for regression

P. Branco, L. Torgo and R. Ribeiro (2016). A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2, 31. DOI: http://dx.doi.org/10.1145/2907070

Branco, P. and Torgo, L. and Ribeiro R.P. (2017) "Pre-processing Approaches for Imbalanced Distributions in Regression" Special Issue on Learning in the Presence of Class Imbalance and Concept Drift. Neurocomputing Journal. (submitted).

Activity

glemaitre

glemaitre commented on Jul 21, 2016

@glemaitre
MemberAuthor

@dvro @chkoar you can add anything there. We can make a PR to add these stuff in the todo list.

We should also discuss where these methods will be added (under-/over-sampling or new module)

chkoar

chkoar commented on Jul 21, 2016

@chkoar
Member

SGP it should be placed in a new module/package like in scikit-protopy. generation is a reasonable name for this kind of algorithm.

glemaitre

glemaitre commented on Jul 21, 2016

@glemaitre
MemberAuthor

@chkoar What would be the reason to disassociate over-sampling and generation?

chkoar

chkoar commented on Jul 21, 2016

@chkoar
Member

Actually none. Just for semantic reasons. Obviously, prototype generation methods could be considered as over-sampling methods.

dvro

dvro commented on Jul 21, 2016

@dvro
Member

@glemaitre actually, oversampling is different than prototype generation:

Prototype Selection:
given a set of samples S, a PS method selects a subset S', where S' \in S and |S'| < |S|
Prototype Generation:
given a set of samples S, a PG method generates a new set S', where |S'| < |S|.
Oversampling:
given a set of samples S, an OS method generates a new set S', where |S'| > |S| and S \in S'

chkoar

chkoar commented on Jul 21, 2016

@chkoar
Member

Thanks for the clarification @dvro. That could be placed in the wiki!

added this to the 0.2.alpha milestone on Jul 27, 2016
changed the title [-]New methods for Release 0.2[/-] [+]New methods[/+] on Aug 31, 2016
modified the milestones: 0.2.alpha, 0.3.alpha on Aug 31, 2016

45 remaining items

beeb

beeb commented on Jul 31, 2020

@beeb

Can you share link to code?

Here is the code of the original paper and also what I took as inspiration for my modified implementation https://rdrr.io/cran/UBL/man/smoteRegress.html

beeb

beeb commented on Jul 31, 2020

@beeb

@beeb actually they are call it imbalanced regression but to my view it is not. All the thing they call utility based learning and the key thing is around the utility function that it is used, right? In any case you can draft an implementations talk about it.

I'm not sure what you are saying. It's SMOTE but they use a function to determine if a data point is common or "rare" depending on how far away from the mean of the distribution it falls (kind of - I used the extremas of the whiskers of a box plot as the inflection points for a CubicHermiteSpline that defines "rarity", I think they also use this in the original code). Then they oversample those by selecting a random NN and computing the new sample in between (just like SMOTE) , the difference is that the label value for the new point is a weighted average of the labels for the two parents.

chkoar

chkoar commented on Jul 31, 2020

@chkoar
Member

@beeb yeap. i have read all their related work. Since they invonvle that utility function, to me is not imbalanced regression but something like cost sensitive/interested regression. Apart from my personal opinion, I think that this method still remains in the scope of the package so I would love to see that implemented in imbalanced-learn. Please open a PR when you have time. It will be much appreciated.

zoj613

zoj613 commented on Jan 31, 2021

@zoj613

Is there any interest in adding Localized Random Affine Shadowsampling (LoRAS) from the maintainers?

To quote from the paper's abstract:

We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

If there is interest in inclusion to the library, then I can prepare a PR.

Reference:
Bej, S., Davtyan, N., Wolfien, M. et al. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 110, 279–301 (2021). https://doi.org/10.1007/s10994-020-05913-4

Sandy4321

Sandy4321 commented on Feb 1, 2021

@Sandy4321
zoj613

zoj613 commented on Feb 1, 2021

@zoj613
Sandy4321

Sandy4321 commented on Feb 1, 2021

@Sandy4321
zoj613

zoj613 commented on Feb 2, 2021

@zoj613
hayesall

hayesall commented on Feb 3, 2021

@hayesall
Member

Hey @zoj613 and @Sandy4321, please keep discussion focused, it creates a lot of noise otherwise.

@zoj613 I'm -1 on including it right now.

We loosely follow scikit-learn's rule of thumb to keep maintenance burden down. Methods should roughly be 3 years old and 200+ citations.

zoj613

zoj613 commented on Feb 4, 2021

@zoj613

Hey @zoj613 and @Sandy4321, please keep discussion focused, it creates a lot of noise otherwise.

@zoj613 I'm -1 on including it right now.

We loosely follow scikit-learn's rule of thumb to keep maintenance burden down. Methods should roughly be 3 years old and 200+ citations.

Fair enough. Keeping to the topic at hand, I submitted a PR at #789 implementing SMOTE-RSB from the checklist in the OP.

glemaitre

glemaitre commented on Feb 18, 2021

@glemaitre
MemberAuthor

I think that we should prioritize the SMOTE variants that we want to include.
We could reuse the benchmark proposed there: analyticalmindsltd/smote_variants#14 (comment)

Basically, we could propose to implement the following:

  • polynom-fit-SMOTE
  • ProWSyn
  • SMOTE-IPF
  • Lee
  • SMOBD
  • G-SMOTE

Currently, we have SVM/KMeans/KNN based SMOTE for historical reasons rather than performance reasons.

I think that we should probably make an effort regarding the documentation. Currently, we show the differences regarding how the methods are sampling (this is already a good point). However, I think that we should have a clearer guideline on SMOTE works best for which applications. What I mean is that SMOTE, SMOTENC, SMOTEN, might already cover a good basis.

BradKML

BradKML commented on Sep 7, 2022

@BradKML

@glemaitre are there any standard APIs to follow for the SMOTE variants?

glemaitre

glemaitre commented on Sep 7, 2022

@glemaitre
MemberAuthor

Whenever possible it should inherit from SMOTE.
You can check the current code hierarchy that we have for SMOTE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Help WantedIndicates that a maintainer wants help on an issue or pull requestType: EnhancementIndicates new feature requests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dvro@beeb@lrq3000@souravsingh@massich

        Issue actions

          New methods · Issue #105 · scikit-learn-contrib/imbalanced-learn