Benchmarking methods to select examples to relabel in active learning for data labeled by multiple annotators
Code to reproduce results from the paper:
ActiveLab: Active Learning with Re-Labeling by Multiple Annotators
This repository benchmarks algorithms to compute an active learning score that quantifies how valuable it is to collect additional labels for specific examples in a classification dataset. We consider settings with multiple data annotators such that each example can be labeled more than once, if needed to ensure high-quality consensus labels.
This repository is only for intended for scientific purposes. To apply the ActiveLab algorithm to your own active learning loops with multiannotator data, you should instead use the implementation from the official cleanlab library.
To run the model training and benchmarks, you need to install the following dependencies:
pip install -r requirements.txt
pip install cleanlab
Three sets of benchmarks are conducted with 3 different datasets:
Dataset | Description | |
---|---|---|
1 | CIFAR-10H | Image classification with a total of 5000 examples, where 1000 examples have annotator labels at the beginning, we collect 500 new labels each round. |
2 | Wall Robot | Tabular classification with a total of 2000 examples, where 500 examples have annotator labels at the beginning, we collect 100 new labels each round. |
3 | Wall Robot Complete | Tabular classification with a total of 2000 examples, where all 2000 examples have annotator labels at round 0, we collect 100 new labels each round. |
The datasets used in the benchmark are downloaded from:
Two supplementary benchmarks were conducted on the Wall Robot dataset:
Benchmark | Description | |
---|---|---|
1 | Single Annotator vs Multiannotator | Compare labeling new data vs relabeling existing datapoints. |
2 | Methods for Single Label | Benchmark the performance of various method in the scenario where each examples only has one label. |
The results/
folder for each dataset contains .npy
files that are the saved results (model accuracy and consensus label accuracy) from each run of the benchmark. These files are used to vizualize the results in the plot_results.ipynb
notebooks.