Skip to content
/ SHIMR Public

Sparse High-order Interaction Model with Rejection option

Notifications You must be signed in to change notification settings

tsudalab/SHIMR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SHIMR (Sparse High-order Interaction Model with Rejection option)

SHIMR (https://peerj.com/articles/6543/) is basically a forward feature selection with simultaneous sample reduction method to iteratively search for higher order feature interactions from the power set of complex features by maximizing the classification gain.

Sample reduction is achieved by incorporating the notion of "Classification with rejection option" which essentially minimizes the classification uncertainty, specifically in case of noisy data. One potential application of this method could be in clinical diagnosis (or prognosis) to serve as a highly reliable computer assisted diagnosis (CAD) model. Below one can see that SHIMR has the ability to identify the ambiguous low confidence zones (close to the decision boundary) and refrain from taking any decision (R: reject) for those data points (encircled). High rejection rate (rr) conforms to high prediction probability of the classified samples and hence more reliability in prediction.

Our visualization module complements SHIMR by generating a simple and easily comprehensible visual representation of the model generated by SHIMR. For more details please refer to our paper published in PeerJ (https://peerj.com/articles/6543/).

Below is a visualization of SHIMR when applied on "Breast Cancer Wisconsin (Diagnostic) Data Set" from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). Our visualization module can clearly represent the weighted combination of simple rules based classification model generated by SHIMR.

Comparing SHIMR with CORELS using ProPublica datasets.

SHIMR results on ProPublica data (Without rejection)

======= Training Results =======
d=0.5
No of rules selected = 8
correctly_classified:4375, misclassified: 2114, rejected: 0
TP:1618, TN: 2757, FP:826, FN: 1288, SN/RC:0.56, PR:0.66, SP: 0.77
roc_auc: 0.68
area_pr: 0.59
accuracy: 0.67
rejection rate: 0.0

======= Testing Results =======
d=0.5
No of rules selected = 8
correctly_classified:499, misclassified: 222, rejected: 0
TP:210, TN: 289, FP:88, FN: 134, SN/RC:0.61, PR:0.7, SP: 0.77
roc_auc: 0.71
area_pr: 0.63
accuracy: 0.69
rejection rate: 0.0

SHIMR results on ProPublica data (With rejection)
======= Training Results =======
d=0.45
No of rules selected = 17
correctly_classified:3973, misclassified: 1742, rejected: 774
TP:1448, TN: 2525, FP:667, FN: 1075, SN/RC:0.57, PR:0.68, SP: 0.79
roc_auc: 0.68
area_pr: 0.58
accuracy: 0.7
rejection rate: 0.12

======= Testing Results =======
d=0.45
No of rules selected = 17
correctly_classified:458, misclassified: 172, rejected: 91
TP:187, TN: 271, FP:68, FN: 104, SN/RC:0.64, PR:0.73, SP: 0.8
roc_auc: 0.72
area_pr: 0.64
accuracy: 0.73
rejection rate: 0.13

CORELS result on ProPublica data
SN_tr = 0.49, SP_tr= 0.78, accuracy_tr = 0.65
SN_te = 0.55, SP_te= 0.8, accuracy_te = 0.68

To reproduce the results generated by SHIMR using ProPublica datasets, please run “main_Propublica.py” script.

$ python main_ProPublica.py

To reproduce the results generated by CORELS using ProPublica datasets first install CORELS locally as instructed in (https://github.com/nlarusstone/corels). Then copy “get_Test_Scores_CORELS.py” script into the “src” folder of corels and run the following script from inside the “src” folder.

$ python get_Test_Scores_CORELS.py

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

"SHIMR" has the following two dependencies

  1. CPLEX Optimizer
  2. Linear Time Closed Itemset Miner (LCM v5.3)
        Coded by Takeaki Uno, e-mail:[email protected], homepage: http://research.nii.ac.jp/~uno/code/lcm.html

Apart from that the current implementation is in python which is tested with the following python setup

  1. Python 3.4.5
  2. scikit-learn==0.19.1
  3. scipy==1.0.0
  4. numpy==1.14.1
  5. pandas==0.22.0
  6. matplotlib==2.0.0

Download

"IBM ILOG CPLEX Optimization Studio" from https://www-01.ibm.com/software/websphere/products/optimization/cplex-studio-community-edition/
"LCM ver. 5.3" from http://research.nii.ac.jp/~uno/codes.htm

Installing

A step by step instructions that will guide you to get a working copy of "SHIMR" in your own development environment.

A. Create a virtual environment

Download "anaconda" from https://www.continuum.io/downloads

  1. Install Anaconda
$ bash Anaconda-latest-Linux-x86_64.sh       (Linux)  or
$ bash Anaconda-latest-MacOSX-x86_64.sh      (Mac)
  1. Activate anaconda environment
source anaconda/bin/activate anaconda/
  1. Create a new environment and activate it
$ conda create -n r_boost python=3.4.5
$ source activate r_boost
$ pip install -r requirements.txt

B. Install "IBM ILOG CPLEX Optimization Studio"

  1. Download "cplex_studioXXX.linux-x86.bin" (Linux) or "cplex_studioXXX.osx.bin" (Mac) file

Make sure the .bin file is executable. If necessary, change its permission using the chmod command from the directory where the .bin is located:

$ chmod +x cplex_studioXXX.linux-x86.bin
  1. Enter the following command to start the installation process:
$ ./cplex_studioXXX.linux-x86.bin
  1. Provide the follwing installation path:
$ /home/user/ibm/ILOG/CPLEX_StudioXXX 
  1. Change directory to CPLEX installation path
$ cd /home/username/ibm2/ILOG/CPLEX_StudioXXX/cplex/python/3.4/x86-64_linux                (Linux)  or
$ cd /Users/username/Applications/IBM/ILOG/CPLEX_StudioXXX/cplex/python/3.4/x86-64_osx/     (Mac)
  1. Install python version of CPLEX
$ python setup.py install

C. Install "LCM ver. 5.3"

1) Unzip the 'lcm53.zip' directory
2) cd lcm53
3) make

Running the tests

To test SHIMR we included "Breast Cancer Wisconsin (Diagnostic) Data Set" from UCI Machine Learning Repository under the Data folder. Please run 'code/main_WDBC.ipynp' in an interactive mode to see the sparse high order interactions of features generated by our visualization module. SHIMR can also be tested from command line by running 'main.py'. Please run it with the help flag [- h] to check the argument requirements of SHIMR to run from command line.

python main.py -h
usage: main.py [-h] [-d D] [-n_bins N_BINS] [-c_pos C_POS] [-c_neg C_NEG]
               [-size_u SIZE_U] [-r] [-v] [-pd] [-pa]
               f_data

Usage of SHIMR

positional arguments:
  f_data          File path of input data to SHIMR. File format should be
                  ".npy". The file should contain data in the format of
                  "[data_train, data_test, Feature_dict, class_labels_dict]".
                  Feature_dict is an ordered dictionary (collections.OrderedDict()) 
                  to provide a short name of feature (Key) if it has long name (Value).
                  A typical example can be wdbc_dict["Rad_M"]= "Radius Mean".
                  class_labels_dict is a class labels dictionary. A typical example can
                  be "class_labels_dict={-1:"Benign", +1:"Malignant", 0:"Rejected"}".

optional arguments:
  -h, --help      show this help message and exit
  -d D            Set rejection cost
  -n_bins N_BINS  Set number of bins
  -c_pos C_POS    Set regularization parameter value for positive class
  -c_neg C_NEG    Set regularization parameter value for negative class
  -size_u SIZE_U  Set the order of feature interaction
  -r              To apply rejection option
  -v              To generate visualization
  -pd             To display the plot (default: File saved)
  -pa             To generate visualization for all subjects
  

Visualization module

Motivation of our visualization module came from "UpSet: Visualizing Intersecting Sets" (http://caleydo.org/tools/upset/) and its python implementation (https://github.com/ImSoErgodic/py-upset).

Published Article

"An interpretable machine learning model for diagnosis of Alzheimer's disease" (https://peerj.com/articles/6543/).

About

Sparse High-order Interaction Model with Rejection option

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published