-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Labels
Status: Help WantedIndicates that a maintainer wants help on an issue or pull requestIndicates that a maintainer wants help on an issue or pull requestType: EnhancementIndicates new feature requestsIndicates new feature requests
Description
This is a non exhaustive list of the methods that can be added for the next release.
Oversampling:
- SPIDERMWMOTESMOTE-SLSMOTE-RSBSMOTE-NCRandom-SMOTE New methods #105 (comment)Cluster Based Oversampling New methods #105 (comment)Supervised Over-Sampling New methods #105 (comment)To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Prototype Generation/Selection:
- Steady State Memetic Algorithm (SSMA)Adaptive Self-Generating Prototypes (ASGP)To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Ensemble
- Under-Over-Bagging FEA allow any resampler in the BalancedBaggingClassifier #808RUS-BoostSMOTE-BoostRAMO-BoostEUS-BoostTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Regression
- SMOTE for regressionTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
P. Branco, L. Torgo and R. Ribeiro (2016). A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2, 31. DOI: http://dx.doi.org/10.1145/2907070
Branco, P. and Torgo, L. and Ribeiro R.P. (2017) "Pre-processing Approaches for Imbalanced Distributions in Regression" Special Issue on Learning in the Presence of Class Imbalance and Concept Drift. Neurocomputing Journal. (submitted).
Metadata
Metadata
Assignees
Labels
Status: Help WantedIndicates that a maintainer wants help on an issue or pull requestIndicates that a maintainer wants help on an issue or pull requestType: EnhancementIndicates new feature requestsIndicates new feature requests
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
glemaitre commentedon Jul 21, 2016
@dvro @chkoar you can add anything there. We can make a PR to add these stuff in the todo list.
We should also discuss where these methods will be added (under-/over-sampling or new module)
chkoar commentedon Jul 21, 2016
SGP it should be placed in a new module/package like in scikit-protopy.
generation
is a reasonable name for this kind of algorithm.glemaitre commentedon Jul 21, 2016
@chkoar What would be the reason to disassociate
over-sampling
andgeneration
?chkoar commentedon Jul 21, 2016
Actually none. Just for semantic reasons. Obviously, prototype generation methods could be considered as over-sampling methods.
dvro commentedon Jul 21, 2016
@glemaitre actually, oversampling is different than prototype generation:
Prototype Selection:
given a set of samples S, a PS method selects a subset S', where S' \in S and |S'| < |S|
Prototype Generation:
given a set of samples S, a PG method generates a new set S', where |S'| < |S|.
Oversampling:
given a set of samples S, an OS method generates a new set S', where |S'| > |S| and S \in S'
chkoar commentedon Jul 21, 2016
Thanks for the clarification @dvro. That could be placed in the wiki!
[-]New methods for Release 0.2[/-][+]New methods[/+]45 remaining items
beeb commentedon Jul 31, 2020
Here is the code of the original paper and also what I took as inspiration for my modified implementation https://rdrr.io/cran/UBL/man/smoteRegress.html
beeb commentedon Jul 31, 2020
I'm not sure what you are saying. It's SMOTE but they use a function to determine if a data point is common or "rare" depending on how far away from the mean of the distribution it falls (kind of - I used the extremas of the whiskers of a box plot as the inflection points for a CubicHermiteSpline that defines "rarity", I think they also use this in the original code). Then they oversample those by selecting a random NN and computing the new sample in between (just like SMOTE) , the difference is that the label value for the new point is a weighted average of the labels for the two parents.
chkoar commentedon Jul 31, 2020
@beeb yeap. i have read all their related work. Since they invonvle that utility function, to me is not imbalanced regression but something like cost sensitive/interested regression. Apart from my personal opinion, I think that this method still remains in the scope of the package so I would love to see that implemented in
imbalanced-learn
. Please open a PR when you have time. It will be much appreciated.zoj613 commentedon Jan 31, 2021
Is there any interest in adding Localized Random Affine Shadowsampling (LoRAS) from the maintainers?
To quote from the paper's abstract:
If there is interest in inclusion to the library, then I can prepare a PR.
Reference:
Bej, S., Davtyan, N., Wolfien, M. et al. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 110, 279–301 (2021). https://doi.org/10.1007/s10994-020-05913-4
Sandy4321 commentedon Feb 1, 2021
zoj613 commentedon Feb 1, 2021
Sandy4321 commentedon Feb 1, 2021
zoj613 commentedon Feb 2, 2021
hayesall commentedon Feb 3, 2021
Hey @zoj613 and @Sandy4321, please keep discussion focused, it creates a lot of noise otherwise.
@zoj613 I'm -1 on including it right now.
We loosely follow scikit-learn's rule of thumb to keep maintenance burden down. Methods should roughly be 3 years old and 200+ citations.
zoj613 commentedon Feb 4, 2021
Fair enough. Keeping to the topic at hand, I submitted a PR at #789 implementing
SMOTE-RSB
from the checklist in the OP.glemaitre commentedon Feb 18, 2021
I think that we should prioritize the SMOTE variants that we want to include.
We could reuse the benchmark proposed there: analyticalmindsltd/smote_variants#14 (comment)
Basically, we could propose to implement the following:
polynom-fit-SMOTE
ProWSyn
SMOTE-IPF
Lee
SMOBD
G-SMOTE
Currently, we have SVM/KMeans/KNN based SMOTE for historical reasons rather than performance reasons.
I think that we should probably make an effort regarding the documentation. Currently, we show the differences regarding how the methods are sampling (this is already a good point). However, I think that we should have a clearer guideline on SMOTE works best for which applications. What I mean is that SMOTE, SMOTENC, SMOTEN, might already cover a good basis.
BradKML commentedon Sep 7, 2022
@glemaitre are there any standard APIs to follow for the SMOTE variants?
glemaitre commentedon Sep 7, 2022
Whenever possible it should inherit from SMOTE.
You can check the current code hierarchy that we have for SMOTE.