Toolbox to train automatic classification models for UVP6 images and/or to evaluate their performances.
Minimal knowledge in python, git and machine learning is needed.
This toolbox has been tested on MacOS and Linux (e.g. Ubuntu 20.04/22.04 and Mint 21). We do not garantee it will work on Windows.
To install the package, you can type the following command in your terminal:
python -m pip install git+https://github.com/ecotaxa/uvpec
or
python -m pip install git+ssh://[email protected]/ecotaxa/uvpec.git
or using pip
pip install uvpec
uvpec
should now appear if you type pip list | grep uvpec
.
For development purposes, you can also clone the repository locally. For this, you can either run (for HTTPS)
git clone https://github.com/ecotaxa/uvpec.git
or (for SSH)
git clone [email protected]:ecotaxa/uvpec.git
In order to use the package, you have to create a config.yaml
file. Don't panic, you have an example of such a file in your cloned repository in uvpec/uvpec/config.yaml
. In the latter, you need to specify 3 things : (1) what you want to do with the package, (2) some input/output information and (3) parameters for the gradient boosted trees algorithm (XGBoost) that will train and create a classification model.
For the process information, you need to specify two boolean variables:
evaluate_only
:true
if you only want to evaluate an already created model. In that case, the package will not train any model and will do only the evaluation of the model indicated by themodel
path with thetest_features_file
data.false
if you want to train a model.train_only
:true
if you want to only train a model and skip the evaluation part.false
if not. Not taken into account ifevaluate_only
istrue
.
For the input/ouput (io), you need to specify:
output_dir
: an output directory, where the model and related information will be exported.train_images_dir
: an image directory for the training set images. The plankton and/or particle images must be sorted by taxonomic classes into subfolders. It is standardized to be used with Ecotaxa. Each subfolder is named by the class's display name, and the ecotaxa ID, separated by two "_", and contains images from only its taxonomic class : 'DisplayName__EcotaxaID'. The typical way to export data from ecotaxa in such folders organization is to make a D.O.I. export, exporting all images and keep only 'white on black' images = *_100.png (see here). The maximum number of accepted classes is 40.test_images_dir
: an image directory for the test set images. It will only be used if you evaluate a model (training + evaluation or evaluation only).training_features_file
: the name of your training features file. If it does not already exist, it will be created automatically so give it a great name !test_features_file
: the name of your test features file. If it does not already exist, it will be created automatically so give it a great name as well ! Unused iftrain_only
istrue
.model
: the path to a model (the format of the file should beMuvpec_KEY.model
, a model created using XGBoost). Only used forevaluation_only
.objid_threshold_file
: the path to a tsv file containing the objid and the UVP6 acquisition threshold of each image for which features will be extracted. Only used ifuse_objid_threshold_file
is set totrue
.
For the instrument parameter, you need to specify:
- The pixel threshold of your UVP6
uvp_pixel_threshold
, that is the threshold value used to split image pixels into foreground (> threshold) and background (<= threshold) pixels. It is usually comprised between 20 and 22. - If you wish to use a variable threshold value (e.g. if you are working with images acquired with different UVP6 instruments), set
use_objid_threshold_file
totrue
.
Then, for XGBoost parameters of the training, you need to specify:
- An initialization seed
random_state
. It is important if you build multiple models with a different XGBoost configurations. The number is not important, you can keep 42. - A number of CPU cores
n_jobs
that will depend on the computational power of your machine or server. - The learning rate. It controls the magnitude of adjustements made to the model's parameters during each iteration of training (i.e. in our model, at each boosting round). A high learning rate may cause the optimization to miss the optimal parameter values (e.g. it leads to oscillations or divergence) while a low learning rate might lead to a slow training due to a slow convergence to the minimum of the loss function or it can also get stuck in local minima.
- The maximum depth of a tree
max_depth
. For technical reasons, it is forbidden to go above 7. -
weight_sensitivity
represents the weight ($w$ ) you want to put on biological classes during training. The minimum value is 0 (i.e. no weight) and the maximum value is 1. It is useful to add a weight to smaller classes because a great number (often$\ge$ 80%) of images from the training set are detritus hence putting$w$ to 0.25 will put more weight on small (biological) classes during training and will force the algorithm to pay more attention to those classes. -
detritus_subsampling
can be used if you want to undersample the detritus class in your training. If you think that your detritus class (therefore, you must have one specifically named 'detritus') is too populated (e.g. extreme dataset imbalance) and that removing a part of it is not an issue for your application, then you can fix a given percentage of subsampling for that class. For example, asubsampling_percentage
of 20 means that you only keep 20% of your entire detritus class. Keepdetritus_subsampling
tofalse
if you don't want to use it. -
subsampling_percentage
is the percentage of images of 'detritus' from your training set you want to keep for training. -
num_trees_CV
stands for the number of boosting rounds you want to use for the cross-validation (CV). This is equivalent to the parameternum_round
in XGBoost.
You will also notice that there is one last thing. use_C
gives the possibility to extract the features from images using a C++ extension. We advise to keep it to true
because it is much faster than the python version.
Once you are done, run uvpec config.yaml
in your terminal and wait for the magic to happen ! You should get everything you need in the output folder you specified.
We have prepared a test
folder in our package. This allows you to check if the pipeline works without launching a full process that will take a significant amount of time. It is always a good idea to check if everything works well before using it on a full training set and also after some package updates. To use it,
navigate in the test folder using cd test
then run uvpec config.yaml
. You should see something going on in your terminal. Don't forget to check your output folder now !
In addition, there is also another test that you can run in order to see if the pipeline is not broken somewhere. For that, run pytest
(that actually looks for test_uvpec.py) in your terminal. Everything should now be taken care of and if you only see green lights it means that all tests went smoothly! If not, that means something went wrong and the error messages can help you find where the leak is.
Just a reminder, if you see some errors during the test, check if you did not forget to run uvpec config.yaml
.
pytest
is not automatically present on your laptop. To install it, type pip install --user pytest
in your terminal.
You can refer to the documentation on Ecotaxa to download all the vignettes you need to use for your training and/or test set. See the "export project" part of your project on https://ecotaxa.obs-vlfr.fr/.
Ecotaxa is built with a rest API that has been designed to facilitate the work of users. Two packages have been developped to interact more easily with the API in python and in R. Be careful to download the vignettes with the black background because every object is stored in two versions: one with a white backgroud and one with a black background. You will also need to remove the size legend at the bottom of each vignette. To do so, crop 31 pixel at the bottom of the vignette.
Finally, just rename the vignettes with the uvpec
standard (i.e. DisplayName__EcotaxaID), and you are good to go !
To uninstall our (awesome-why-are-you-removing-it) package, type pip uninstall uvpec
in your terminal.
For updates, either uninstall it and reinstall it with the HTPPS or SSH version, or more simply using pip
.