Skip to content

"COGS 118A: Supervised Machine Learning Algorithms" Final Project

Notifications You must be signed in to change notification settings

jac237/Supervised-machine-learning-project

Repository files navigation

Comparison of Supervised Machine Learning Algorithms

Course: COGS 118A - Supervised Machine Learning Algorithms (Winter 2020)

Data Sets

The following data sets were taken from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml):

Final Project Report

Write a report with >1,000 words (excluding references) including main sections: a) Abstract, b) Introduction, c) Methods, d) Experiments, e) Conclusions, and f) References. The basic requirement for the final project is based on the two-class classification problem.

Train your classifiers using the setting (not all metrics are needed) described in the empirical study by Caruana and Niculescu-Mizil. You are supposed to reproduce consistent results as in the paper. However, do expect some small variations. When evaluating the algorithms, you don’t need to use all the metrics that were reported in the paper. Using one metric, e.g. the classification accuracy, is sufficient. Please report the cross-validated classification results with the corresponding learned hyper-parameters.

If you compute accuracy and follow the basic requirement picking 3 classifiers and 3 datasets. You are looking at 3 trials/repeats X 3 classifiers X 3 datasets X 3 partitions (20/80, 50/50, 80/20). Each time you always report the best accuracy under the chosen hyper-parameter. Since for the accuracy is averaged among three 3 trials/repeats to rank order the classifiers, you will report 3 classifiers X 3 datasets X 3 partitions (20/80, 50/50, 80/20) X 3. accuracies (train, validation, test). When trying to debug, always try to see the training accuracy to see if you are able to at least push the training accuracy high (to overfit the data) as a sanity check making sure your implementation is correct. The heatmaps for your hyper-parameters are the details that do not need to be too carefully compared with. The searching for the hyper-parameters is internal and the final conclusion about the classifiers is based on the best hyper-parameter you have obtained for each time.

Pseudo code

For i in three different datasets
    For j in three types of different partitions
         For t in three different trials/repeats (shuffling or performing random splits for each type j (20/80,80/20) )
              For c in three different classifiers
                     cross validate
                      find the optimal hyper-parameter
                      train using the hyper-parameter above
                      obtain the training and validation accuracy/error
                       test
                       obtain the testing accuracy
          compute the averaged accuracy (training, validation, and testing) for each classifier c out of three trials/repeats
          rank order the classifiers

Chosen Classifiers

  • Linear SVM with scikit-learn

    Linear SVM img
  • KNN with scikit-learn

    KNN img
  • Decision Tree with scikit-learn

    Decision Tree img

Methodology

This section summarizes the parameters used for each learning algorithm.

  • SVM: I used the following kernel with SciKit Learn: linear. The regularization parameters used were: 0.1, 1, 10, 100, 1000.

  • K-NN: Uses KNeighborsClassifier from sklearn to create a grid searcher with 5-fold cross-validation with up to 3 neighbors. The KNN uses the Euclidean distance as weights.

  • Decision Tree: Uses DecisionTreeClassifier from sklearn and GridSearchCV with 10-fold cross-validation with a max-depth of 5.

Useful Resources

Acknowledgements

About

"COGS 118A: Supervised Machine Learning Algorithms" Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published