Pam50 Classification of Breast Cancer.

An overview of different preprocessing thecniques, imbalance management and learning algorithms.

This repository provides the PYTHON codes used to run the experiments. We used Python 2.7.14.

Repository structure

In the main branch there are all the files needed to run the algorithm. In the folder Dataset_creation we have the two scripts, one in matlab and one in python to create the dataset used starting from the raw filed of the gdc site.

Prerequisites

Running version of python 2.7.
Packages to be installed: sklearn (scikit-learn), scipy, numpy, pandas, imblearn. You can install by e.g.

pip install pandas

or

python -m pip install pandas

or

conda install pandas

Dataset

The dataset used could be downloaded from here

Files:

The starting file is the main, while all the others are the classes used inside:

Main.py: manages the k-fold training/testing loop and calls all the classes method
GA.py: Genetic Algortihm Class for feature selection.
Imbalance_manager.py: runs the selected method for imbalance issue
DR.py: runs the selected method for dimensionality reduction
Classification.py: runs the selected method for classification and Scores computation
Opening_design.py: manages user interface and defines the pipeline

Running the tests

To achieve the results, you can run the script in two different ways:

by normally call the script with

python .\Main.py

and following the I/O user interface selecting the preferred methods.

N.B. If the Unsupervised pipeline is choosen, the feature reduction and class imbalance management are fixed, because we force methods that don't need any labels.

using the command line through the parameter --fast True. The default pipeline in this case is : PCA, no class balancing,SVC. With the command

python .\Main.py -h

you can see the parameter to set and how to fill them. Each part of the pipeline can be modified by inserting the corresponding arguments (see --help). Example:

python .\Main.py --fast True --reduction LDA --imbalance SMOTE --supervised Random_Forest

The implemented methods are:

Supervised: SVC, KNN, RandomForest
Unsupervised: Kmeans, Hierarchical clustering
Unbalanced method: SMOTE, SMOTE + ENN , RandomOverSampling
Dim Reduction: PCA, LDA, GA

Notes on the algorithms

The dataset should be in the same folder of all the files of the algorithm.
The algorithm with the supervised pipeline use 10-fold cross validation for test.
The unsupervised pipeline is forced, you can change only the unsupervised method: in addition we use the full dataset in the unsupervised version.
The Dimensional reduction methods produce 180 features for PCA, 5 for LDA and 1800 for Genetic Algorithm.
You can change number of folds and Number of features in the algorithm changing NFOLDS and Nfeatures.
The performance is evaluated through Accuracy, F1 score and Balanced Error rate (this last one is better if lower and takes into account the unbalancing of the classes).
The algorithm of the GA has a number of worker that could be set as parameter. Using 1 is the safe mode to use a single thread. However, you can increase this number to use more threads and speed up the algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pam50 Classification of Breast Cancer.

An overview of different preprocessing thecniques, imbalance management and learning algorithms.

Repository structure

Prerequisites

Dataset

Files:

Running the tests

Notes on the algorithms

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Dataset_creation		Dataset_creation
Classification.py		Classification.py
DR.py		DR.py
GA.py		GA.py
Imbalance_manager.py		Imbalance_manager.py
Main.py		Main.py
Opening_design.py		Opening_design.py
Project.zip		Project.zip
README.md		README.md
Report Project #6.pdf		Report Project #6.pdf

ABurrello/Pam50Classification_algorithms

Folders and files

Latest commit

History

Repository files navigation

Pam50 Classification of Breast Cancer.

An overview of different preprocessing thecniques, imbalance management and learning algorithms.

Repository structure

Prerequisites

Dataset

Files:

Running the tests

Notes on the algorithms

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages