Skip to content

Random Forest classifier

Simone Maurizio La Cava edited this page Jul 23, 2020 · 6 revisions

The Random Forest classifier is an ensamble classifier obtained by aggregating more decision tree classifiers.

Before starting the classification process, it may be useful to introduce the Decision Tree and the Random Forest concepts (if you already know them, you can jump directly to the procedure paragraph).





Decision Tree classifier

A Decision Tree classifier is a Supervised Machine Learning where the data is continuously split according to a certain parameter, and consists of a set of nodes, which corresponds to the tests for the value of a certain feature and where a split of the data is computed dependently by the result of the test, and the branches which connect a node to another node.

The final nodes (the nodes not connected to following nodes) are called leaf nodes and predict the outcome of the classification, assigning the class label to any tested sample.

The first node, which represent the first test, is called root node.

However, before the classification step, the Decision Tree classifier have to be built.

A training set is used in order to fit a Decision Tree classifier, and the algorithm simply consist in identify the better split value on the better feature to split (the one which is able to better discriminate between the classes), splitting the data based on the result of the test and repeat it for each nodes until samples belonging to only one class are present on a node, or until this path is pruned (in this case will be used the majority rule on the node) in order to prevent overfitting or in case of reaching a maximum depth value.

Now, the classification step only consists in applying these tests on each test sample, which will reach a leaf node and it will be so classified.





Random Forest classifier

Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.

These decision trees are fitted with different sets, which can mean different samples (however belonging to the same overall training set) or different features.

In the classification step, each tree will classify any test sample, and the final class chosen by the Random Forest classifier can be decided in different ways.

Athena simply uses the majority rule to assign the final class label, which consists in assigning the most frequently predicted label.

This classifier tends to reduce the probability of overfitting respect to the Decision Tree classifier, and generally provides better performance.





The Random Forest classification step

Athena allows you to set some parameters of the classifier, such as the number of repetitions of the training-test cycle, the number of trees from which it is composed and the fraction of resample of the training samples to use in order to train each tree, or select the default parameters which will automatically be set by the toolbox (1000 repetitions, 7 trees and a fraction of resample value equal to 0.5, evaluated with a training-test split with a training fraction equal to 0.8).

Here, you can also decide to try the Decision Tree classifier instead.

Furthermore, you have to select one of the evaluation methods.

After all the repetitions will be finished, inside your main data directory will be created, inside a Classification folder, a file which contains all the used parameters and the performance.

The confusion matrix, with the resulting average accuracy value, will be showed in a figure.

Also the ROC curve will be shown with the corresponding AUC value.

Finally, you can repeat it by changing the parameters or the evaluation method, or you can return to the classifiers list.

Clone this wiki locally