Utilizes custom implementation of ensemble methods and decision trees, k nearest neighbors, and neural nets, in order to predict passenger survival.
Currently random forests & decision trees is the only fully implemented model. Running main.py without modification will use the default randomForests class parameters (num_trees = 1000, gain = 'entropy', num_levels = no limit, num features considered on each split = sqrt(len(features))).
To test the random forest and decision tree models, I have created 3 unittest files which use the Titanic dataset:
test_decision_tree.py
: Tests the decision tree over all possible maximum depth lengths over each type of split algorithm (gini impurity or information gain) and plots the resulting train and test accuracy.test_random_forests_single.py
: Tests a single random forests instance with whatever parameters you choose, and prints the test and train accuracy to console.test_random_forests_parameters.py
: Parameter search on the parameternum_trees
and parameter listsmax_feature_counts
, andmax_num_levels
. Uses multiprocessing to complete faster and plots the resulting test and train accuracy for every parameter combination.
If you have the dependencies installed, you can run any of these by simply going to the parent folder and running
python test/test_*.py
. Every test file also has easy-access parameter variables if you would like to tweak them yourself.
Example output from test_random_forests_parameters.py
:
The only dependencies required are sklearn, pandas, matplotlib, numpy, joblib
, and python >= 3.12
Install with pip or conda.
- Allows fine tuning of parameters, including selection of entropy or gini and the max depth of the tree.
- Handles new data with unseen patterns during prediction by probabilistically choosing a child node to continue down instead of throwing errors or dumping into an "other" branch.
- Allows fine tuning of parameters "num_trees, "max levels/depth per tree", "max number of features to sample & consider at each split", and "gain type (gini or information)".