Acquiring high-quality labels is one of the most critical bottlenecks in machine learning. It is time-consuming and costly to obtain accurate, precise, and a sufficient amount of labels. Sometimes, we have to make do with a lot of noisy labels with a handful of "ground truths." Semi-supervised learning aims to deal with this category of problems by utilizing unlabeled data in training under the premise that there's a data manifold underlying the large amount of unlabeled data to leverage. We propose two algorithms that utilize the concept of random sampling and self-training to enhance the prowess of our models. The algorithms are robust to most semi-supervised learning tasks, even when the ground truths are as few as ten.
The algorithm is updated based on the Three-stage-learning : Supervision, Supervised-generalizatoin, Transductive-Generalization.
Supervision : randomly samples training set and train a weak classifier.
Supervised-generalization : validate the the weak model with remaining labels (which might not be necessary, because it prevents diversification)
Inductive-Generalization : weakly-supervised model use decision boundary or probability estimate to assign confience values to unlabeled data. Under the assumption of sufficient training epochs. There exists a good subset of training samples that best generatlized to the unlabled data by excluding low confidence samples from the pseudo-labels. The other approach is to include high confidence samples to the pseudo-labels. The former is called substractive labeling, the later is called additive labeling.
Version 1 of RANSAC1
Improvements: Version2
Improvements: Version3
Step 1: RANSAC2 combines bagging and active learning. Initially N base classifiers were trainined separately with irreplacable random samples from the training data. The rest of training data are considered unlableld.
Step 2: Then the base classifiers use 'bagging' to generate concensus as pseudo labels.
Step 3: The psudolabels then are added to the sample sets in the next iteration of all base classifiers.
Step 4: The process is repeated untill all unlabeled data are labeled.
First Version
The models takes fully labeled data set where labels are 'noisy'.
The SSL model then 'denoise' the data set in the fashion of 'bagging' and active learning.
Version 2 is developed on top of verison 1.
Improvements :Version2
Version 3 is developed on top of verison 2.
Improvements :Version3
This version 3 is developed on top of version 3. (fork form ransac_simple_v3.py)
Improvements :Improvements :Version4
Ransac Simple Algorithm2 (fork from ransac_simple.py)
RANSAC semisupervised learning utilizing unlabeled data.
Make sure you have python3, numpy, and scikit-learn packages installed.
Clone and use either of the .py script above to port to your binary classification tasks.
The labeled data are annotated as 0, 1,
The unlabled data are annotated as -1
You can swap the base classifiers among SVM, KNN, or Multi-layer Perceptron
We welcome contributions from the community! If you'd like to contribute to this project, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them with clear and concise commit messages.
- Push your changes to your fork.
- Open a pull request against the main repository with a description of your changes.
- Wait for a maintainer to review your pull request and provide feedback.
For more details on how to contribute, please see our CONTRIBUTING.md file.
This project is licensed under the MIT License. This means you are free to use, modify, and distribute the project, provided you include the original license and copyright notice in any copies or substantial portions of the software.
For more details, see the LICENSE file.
If you have any questions or feedback, feel free to reach out to us: