From 9d50a137149856142a3079c38a9e819504d707b4 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 14:44:19 +0200 Subject: [PATCH 1/3] reword nearmiss --- doc/under_sampling.rst | 89 ++++++++++++++++++++++++++---------------- 1 file changed, 55 insertions(+), 34 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..efdb288b3 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -125,9 +125,20 @@ It would also work with pandas dataframe:: >>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult) >>> df_resampled.head() # doctest: +SKIP -:class:`NearMiss` adds some heuristic rules to select samples -:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of -heuristic which can be selected with the parameter ``version``:: +NearMiss +^^^^^^^^ + +:class:`NearMiss` is another controlled under-sampling technique. It aims to balance +the class distribution by eliminating samples from the targeted classes. But these +samples are not removed at random. Instead, :class:`NearMiss` removes instances of the +target class(es) that increase the "space" or separation between the target class and +the minority class. In other words, :class:`NearMiss` removes observations from the +target class that are closer to the boundary they form with the minority class samples. + +To find out which samples are closer to the boundary with the minority class, +:class:`NearMiss` uses the K-Nearest Neighbour algorithm. :class:`NearMiss` implements +3 different heuristics, which we can be selected with the parameter ``version`` and we +will explain in the coming paragraphs. We can perform this undersampling as follows:: >>> from imblearn.under_sampling import NearMiss >>> nm1 = NearMiss(version=1) @@ -135,65 +146,75 @@ heuristic which can be selected with the parameter ``version``:: >>> print(sorted(Counter(y_resampled).items())) [(0, 64), (1, 64), (2, 64)] -As later stated in the next section, :class:`NearMiss` heuristic rules are -based on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors`` -and ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin`` -from scikit-learn. The former parameter is used to compute the average distance -to the neighbors while the latter is used for the pre-selection of the samples -of interest. Mathematical formulation -^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~ + +:class:`NearMiss` uses the K-Nearest Neighbour algorithm to identify the samples of the +target class(es) that are closer to the minority class, as well as the distance that +separates them. -Let *positive samples* be the samples belonging to the targeted class to be -under-sampled. *Negative sample* refers to the samples from the minority class -(i.e., the most under-represented class). +Let *positive samples* be the samples belonging to the class to be under-sampled, and +*negative sample* the samples from the minority class (i.e., the most +under-represented class). -NearMiss-1 selects the positive samples for which the average distance -to the :math:`N` closest samples of the negative class is the smallest. +**NearMiss-1** selects the positive samples whose average distance to the :math:`K` +closest samples of the negative class is the smallest (:math:`K` is the number of +neighbours in the K-Nearest Neighbour algorithm). The following image illustrates the +logic: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_001.png :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html :scale: 60 :align: center -NearMiss-2 selects the positive samples for which the average distance to the -:math:`N` farthest samples of the negative class is the smallest. +**NearMiss-2** selects the positive samples whose average distance to the +:math:`K` farthest samples of the negative class is the smallest. The following image +illustrates the logic: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_002.png :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html :scale: 60 :align: center -NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their -:math:`M` nearest-neighbors will be kept. Then, the positive samples selected -are the one for which the average distance to the :math:`N` nearest-neighbors -is the largest. +**NearMiss-3** is a 2-steps algorithm: + +First, for each negative sample, that is, for each observation of the minority class, +it selects :math:`M` nearest-neighbors from the postivie class (target class). This +ensures that all observations from the minority class have at least some neighbours +from the target class. + +Next, it selects positive samples whose average distance to the :math:`K` +nearest-neighbors of the minority class is the largest. + +The following image illustrates the logic: .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_003.png :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html :scale: 60 :align: center -In the next example, the different :class:`NearMiss` variant are applied on the -previous toy example. It can be seen that the decision functions obtained in -each case are different. - -When under-sampling a specific class, NearMiss-1 can be altered by the presence -of noise. In fact, it will implied that samples of the targeted class will be -selected around these samples as it is the case in the illustration below for -the yellow class. However, in the normal case, samples next to the boundaries -will be selected. NearMiss-2 will not have this effect since it does not focus -on the nearest samples but rather on the farthest samples. We can imagine that -the presence of noise can also altered the sampling mainly in the presence of -marginal outliers. NearMiss-3 is probably the version which will be less -affected by noise due to the first step sample selection. +In the following example, we apply the different :class:`NearMiss` variants to a toy +dataset. Hote how the decision functions obtained in each case are different (left +plots): .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html :scale: 60 :align: center +NearMiss-1 is sensitive to noise. In fact, we could think that those observations from +the target class that are closer to samples from the minority class are indeed noise. +NearMiss-1 will select however, those observations, as shown in the first row of the +previous illustration (check the yellow class). + +NearMiss-2 will be less sensitive to noise since it does not select the nearest, but +rather on the farthest samples of the target class. + +NearMiss-3 is probably the least sensitive version to noise due to the first sample +selection step. + + Cleaning under-sampling techniques ---------------------------------- From c1cb00f7079a3e978f0cff2672c250d13c11fe78 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 15:03:18 +0200 Subject: [PATCH 2/3] fix typos --- doc/under_sampling.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index efdb288b3..38b87540d 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -136,8 +136,8 @@ the minority class. In other words, :class:`NearMiss` removes observations from target class that are closer to the boundary they form with the minority class samples. To find out which samples are closer to the boundary with the minority class, -:class:`NearMiss` uses the K-Nearest Neighbour algorithm. :class:`NearMiss` implements -3 different heuristics, which we can be selected with the parameter ``version`` and we +:class:`NearMiss` uses the K-Nearest Neighbours algorithm. :class:`NearMiss` implements +3 different heuristics, which we can select with the parameter ``version`` and we will explain in the coming paragraphs. We can perform this undersampling as follows:: >>> from imblearn.under_sampling import NearMiss @@ -150,11 +150,11 @@ will explain in the coming paragraphs. We can perform this undersampling as foll Mathematical formulation ~~~~~~~~~~~~~~~~~~~~~~~~ -:class:`NearMiss` uses the K-Nearest Neighbour algorithm to identify the samples of the +:class:`NearMiss` uses the K-Nearest Neighbours algorithm to identify the samples of the target class(es) that are closer to the minority class, as well as the distance that separates them. -Let *positive samples* be the samples belonging to the class to be under-sampled, and +Let *positive samples* be the samples from the class to be under-sampled, and *negative sample* the samples from the minority class (i.e., the most under-represented class). @@ -195,7 +195,7 @@ The following image illustrates the logic: :align: center In the following example, we apply the different :class:`NearMiss` variants to a toy -dataset. Hote how the decision functions obtained in each case are different (left +dataset. Note how the decision functions obtained in each case are different (left plots): .. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png From 121a75df56a5fe307d1dd3f5ff9945faf7d61d50 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 15:10:52 +0200 Subject: [PATCH 3/3] reword docstrings --- .../_prototype_selection/_nearmiss.py | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/imblearn/under_sampling/_prototype_selection/_nearmiss.py b/imblearn/under_sampling/_prototype_selection/_nearmiss.py index 70f647fa5..7073a8cf2 100644 --- a/imblearn/under_sampling/_prototype_selection/_nearmiss.py +++ b/imblearn/under_sampling/_prototype_selection/_nearmiss.py @@ -35,20 +35,17 @@ class NearMiss(BaseUnderSampler): n_neighbors : int or estimator object, default=3 If ``int``, size of the neighbourhood to consider to compute the - average distance to the minority point samples. If object, an + average distance to the minority samples. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the k_neighbors. - By default, it will be a 3-NN. n_neighbors_ver3 : int or estimator object, default=3 - If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This - parameter correspond to the number of neighbours selected create the - subset in which the selection will be performed. If object, an - estimator that inherits from + Only used if `version=3`. If ``int``, the number of target class samples that + are closest to a minority sample that will be retained in the first subsampling + step. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to find the k_neighbors. - By default, it will be a 3-NN. {n_jobs} @@ -56,7 +53,7 @@ class NearMiss(BaseUnderSampler): ---------- sampling_strategy_ : dict Dictionary containing the information to sample the dataset. The keys - corresponds to the class labels from which to sample and the values + correspond to the class labels from which to sample and the values are the number of samples to sample. nn_ : estimator object @@ -144,7 +141,7 @@ def __init__( def _selection_dist_based( self, X, y, dist_vec, num_samples, key, sel_strategy="nearest" ): - """Select the appropriate samples depending of the strategy selected. + """Select the appropriate samples depending on the selected strategy. Parameters ----------