Skip to content

Commit 4c007aa

Browse files
committedJun 20, 2020
enh: add group variable, regression test, update README
1 parent f283288 commit 4c007aa

File tree

6 files changed

+109
-43
lines changed

6 files changed

+109
-43
lines changed
 

‎README.md

Lines changed: 50 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -29,22 +29,22 @@ pip install pydra-ml
2929

3030
This repo installs `pydraml` a CLI to allow usage without any programming.
3131

32-
To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and
32+
To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and
3333
`short-spec.json.sample` to a folder and run.
3434

3535
```
3636
$ pydraml -s short-spec.json.sample
3737
```
38-
To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and `diabetes_spec.json`
39-
to a folder and run.
38+
To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and
39+
`diabetes_spec.json` to a folder and run.
4040

4141
```
4242
$ pydraml -s diabetes_spec.json
4343
```
4444

45-
For each case pydra-ml will generate a result folder with the spec file name that includes
46-
`test-{metric}-{timestamp}.png` file for each metric together with a pickled results file
47-
containing all the scores from the model evaluations.
45+
For each case pydra-ml will generate a result folder with the spec file name that
46+
includes `test-{metric}-{timestamp}.png` file for each metric together with a
47+
pickled results file containing all the scores from the model evaluations.
4848

4949
```
5050
$ pydraml --help
@@ -82,14 +82,17 @@ will want to generate `x_indices` programmatically.
8282
group.
8383
- *x_indices*: Numeric (0-based) or string list of columns to use as input features
8484
- *target_vars*: String list of target variable (at present only one is supported)
85+
- *group_var*: String to indicate column to use for grouping
8586
- *n_splits*: Number of shuffle split iterations to use
8687
- *test_size*: Fraction of data to use for test set in each iteration
8788
- *clf_info*: List of scikit-learn classifiers to use.
8889
- *permute*: List of booleans to indicate whether to generate a null model or not
8990
- *gen_shap*: Boolean indicating whether shap values are generated
9091
- *nsamples*: Number of samples to use for shap estimation
9192
- *l1_reg*: Type of regularizer to use for shap estimation
92-
- *plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16 or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot top first feature.
93+
- *plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16
94+
or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot
95+
top first feature.
9396
- *metrics*: scikit-learn metric to use
9497

9598
## `clf_info` specification
@@ -113,6 +116,7 @@ then an empty dictionary **MUST** be provided as parameter 3.
113116
"x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
114117
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
115118
"target_vars": ["target"],
119+
"group_var": null,
116120
"n_splits": 100,
117121
"test_size": 0.2,
118122
"clf_info": [
@@ -140,25 +144,46 @@ then an empty dictionary **MUST** be provided as parameter 3.
140144

141145
## Output:
142146
The workflow will output:
143-
- `results-{timestamp}.pkl` containing 1 list per model used. For example, if assigned to variable `results`, it is accessed through `results[0]` to `results[N]`
144-
(if `permute: [false,true]` then it will output the model trained on the labels first `results[0]` and the model trained on permuted labels second `results[1]`.
147+
- `results-{timestamp}.pkl` containing 1 list per model used. For example, if
148+
assigned to variable `results`, it is accessed through `results[0]` to `results[N]`
149+
(if `permute: [false,true]` then it will output the model trained on the labels
150+
first `results[0]` and the model trained on permuted labels second `results[1]`.
145151
Each model contains:
146-
- `dict` accesed through `results[0][0]` with model information: `{'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier', {'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}`
147-
- `pydra Result obj` accesed through `results[0][1]` with attribute `output` which itself has attributes:
152+
- `dict` accesed through `results[0][0]` with model information:
153+
`{'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier',
154+
{'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}`
155+
- `pydra Result obj` accesed through `results[0][1]` with attribute `output`
156+
which itself has attributes:
148157
- `feature_names`: from the columns of the data csv.
149158
And the following attributes organized in N lists for N bootstrapping samples:
150159
- `output`: N lists, each one with two lists for true and predicted labels.
151160
- `score`: N lists each one containing M different metric scores.
152-
- `shaps`: N lists each one with a list of shape (P,F) where P is the amount of predictions and F the different SHAP values for each feature. `shaps` is empty if `gen_shap` is set to `false` or if `permute` is set to true.
153-
- One figure per metric with performance distribution across splits (with or without null distribution trained on permuted labels)
161+
- `shaps`: N lists each one with a list of shape (P,F) where P is the
162+
amount of predictions and F the different SHAP values for each feature.
163+
`shaps` is empty if `gen_shap` is set to `false` or if `permute` is set
164+
to true.
165+
- One figure per metric with performance distribution across splits (with or
166+
without null distribution trained on permuted labels)
154167
- `shap-{timestamp}` dir
155168
- SHAP values are computed for each prediction in each split's test set
156-
(e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array). The mean is taken across predictions for each split (e.g., resulting in a (64,30) array for 64 features and 30 bootstrapping samples).
157-
- For binary classification, a more accurate display of feature importance obtained by splitting predictions into TP, TN, FP, and FN,
158-
which in turn can allow for error auditing (i.e., what a model pays attention to when making incorrect/false predictions)
159-
- `quadrant_indexes.pkl`: The TP, TN, FP, FN indexes are saved in as a `dict` with one `key` per model (permuted models without SHAP values will be skipped automatically), and each key `values` being a bootstrapping split.
160-
- `summary_values_shap_{model_name}_{prediction_type}.csv` contains all SHAP values and summary statistics ranked by the mean SHAP value across bootstrapping splits. A sample_n column can be empty or NaN if this split did not have the type of prediction in the filename (e.g., you may not have FNs or FPs in a given split with high performance).
161-
- `summary_shap_{model_name}_{plot_top_n_shap}.png` contains SHAP value summary statistics for all features (set to 1.0) or only the top N most important features for better visualization.
169+
(e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array).
170+
The mean is taken across predictions for each split (e.g., resulting in a
171+
(64,30) array for 64 features and 30 bootstrapping samples).
172+
- For binary classification, a more accurate display of feature importance
173+
obtained by splitting predictions into TP, TN, FP, and FN, which in turn can
174+
allow for error auditing (i.e., what a model pays attention to when making
175+
incorrect/false predictions)
176+
- `quadrant_indexes.pkl`: The TP, TN, FP, FN indexes are saved in as a
177+
`dict` with one `key` per model (permuted models without SHAP values will
178+
be skipped automatically), and each key `values` being a bootstrapping split.
179+
- `summary_values_shap_{model_name}_{prediction_type}.csv` contains all
180+
SHAP values and summary statistics ranked by the mean SHAP value across
181+
bootstrapping splits. A sample_n column can be empty or NaN if this split
182+
did not have the type of prediction in the filename (e.g., you may not
183+
have FNs or FPs in a given split with high performance).
184+
- `summary_shap_{model_name}_{plot_top_n_shap}.png` contains SHAP value
185+
summary statistics for all features (set to 1.0) or only the top N most
186+
important features for better visualization.
162187

163188

164189
## Developer installation
@@ -171,10 +196,14 @@ cd pydra-ml
171196
pip install -e .[dev]
172197
```
173198

174-
It is also useful to install pre-commit:
199+
It is also useful to install pre-commit, which takes care of styling when
200+
committing code. When pre-commit is used you may have to run git commit twice,
201+
since pre-commit may make additional changes to your code for styling and will
202+
not commit these changes by default:
203+
175204
```
176205
pip install pre-commit
177-
pre-commit
206+
pre-commit install
178207
```
179208

180209
### Project structure

‎diabetes_spec.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1-
{"filename": "./diabetes_table.csv",
1+
{"filename": "diabetes_table.csv",
22
"x_indices": [0,1,2,3,4,5,6,7,8,9],
33
"target_vars": ["target"],
4+
"group_var": null,
45
"n_splits": 4,
56
"test_size": 0.2,
67
"clf_info": [
@@ -14,4 +15,4 @@
1415
"l1_reg": "aic",
1516
"plot_top_n_shap": 10,
1617
"metrics":["explained_variance_score","mean_squared_error","mean_absolute_error"]
17-
}
18+
}

‎long-spec.json.sample

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
"x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
33
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
44
"target_vars": ["target"],
5+
"group_var": null,
56
"n_splits": 100,
67
"test_size": 0.2,
78
"clf_info": [

‎pydra_ml/classifier.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ def gen_workflow(inputs, cache_dir=None, cache_locations=None):
5555
filename=wf.lzin.filename,
5656
x_indices=wf.lzin.x_indices,
5757
target_vars=wf.lzin.target_vars,
58+
group=wf.lzin.group_var,
5859
)
5960
)
6061
wf.add(

‎pydra_ml/tests/test_classifier.py

Lines changed: 53 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,63 @@
11
import os
22
from ..classifier import gen_workflow, run_workflow
33

4-
clfs = [
5-
("sklearn.neural_network", "MLPClassifier", {"alpha": 1, "max_iter": 1000}),
6-
("sklearn.naive_bayes", "GaussianNB", {}),
7-
]
8-
csv_file = os.path.join(os.path.dirname(__file__), "data", "breast_cancer.csv")
9-
inputs = {
10-
"filename": csv_file,
11-
"x_indices": range(30),
12-
"target_vars": ("target",),
13-
"n_splits": 2,
14-
"test_size": 0.2,
15-
"clf_info": clfs,
16-
"permute": [True, False],
17-
"gen_shap": True,
18-
"nsamples": 5,
19-
"l1_reg": "aic",
20-
"plot_top_n_shap": 16,
21-
"metrics": ["roc_auc_score", "accuracy_score"],
22-
}
23-
244

255
def test_classifier(tmpdir):
6+
clfs = [
7+
("sklearn.neural_network", "MLPClassifier", {"alpha": 1, "max_iter": 1000}),
8+
("sklearn.naive_bayes", "GaussianNB", {}),
9+
]
10+
csv_file = os.path.join(os.path.dirname(__file__), "data", "breast_cancer.csv")
11+
inputs = {
12+
"filename": csv_file,
13+
"x_indices": range(30),
14+
"target_vars": ("target",),
15+
"group_var": None,
16+
"n_splits": 2,
17+
"test_size": 0.2,
18+
"clf_info": clfs,
19+
"permute": [True, False],
20+
"gen_shap": True,
21+
"nsamples": 5,
22+
"l1_reg": "aic",
23+
"plot_top_n_shap": 16,
24+
"metrics": ["roc_auc_score", "accuracy_score"],
25+
}
2626
wf = gen_workflow(inputs, cache_dir=tmpdir)
2727
results = run_workflow(wf, "cf", {"n_procs": 1})
2828
assert results[0][0]["ml_wf.clf_info"][1] == "MLPClassifier"
2929
assert results[0][0]["ml_wf.permute"]
3030
assert results[0][1].output.score[0][0] < results[1][1].output.score[0][0]
31+
32+
33+
def test_regressor(tmpdir):
34+
clfs = [
35+
("sklearn.neural_network", "MLPRegressor", {"alpha": 1, "max_iter": 1000}),
36+
(
37+
"sklearn.linear_model",
38+
"LinearRegression",
39+
{"fit_intercept": True, "normalize": True},
40+
),
41+
]
42+
csv_file = os.path.join(os.path.dirname(__file__), "data", "diabetes_table.csv")
43+
inputs = {
44+
"filename": csv_file,
45+
"x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
46+
"target_vars": ["target"],
47+
"group_var": None,
48+
"n_splits": 2,
49+
"test_size": 0.2,
50+
"clf_info": clfs,
51+
"permute": [True, False],
52+
"gen_shap": True,
53+
"nsamples": 5,
54+
"l1_reg": "aic",
55+
"plot_top_n_shap": 10,
56+
"metrics": ["explained_variance_score"],
57+
}
58+
59+
wf = gen_workflow(inputs, cache_dir=tmpdir)
60+
results = run_workflow(wf, "cf", {"n_procs": 1})
61+
assert results[0][0]["ml_wf.clf_info"][1] == "MLPRegressor"
62+
assert results[0][0]["ml_wf.permute"]
63+
assert results[0][1].output.score[0][0] < results[1][1].output.score[0][0]

‎short-spec.json.sample

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
"x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
33
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
44
"target_vars": ["target"],
5+
"group_var": null,
56
"n_splits": 2,
67
"test_size": 0.2,
78
"clf_info": [

0 commit comments

Comments
 (0)