You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+50-21Lines changed: 50 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -29,22 +29,22 @@ pip install pydra-ml
29
29
30
30
This repo installs `pydraml` a CLI to allow usage without any programming.
31
31
32
-
To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and
32
+
To test the CLI for a classification example, copy the `pydra_ml/tests/data/breast_cancer.csv` and
33
33
`short-spec.json.sample` to a folder and run.
34
34
35
35
```
36
36
$ pydraml -s short-spec.json.sample
37
37
```
38
-
To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and`diabetes_spec.json`
39
-
to a folder and run.
38
+
To check a regression example, copy `pydra_ml/tests/data/diabetes_table.csv` and
39
+
`diabetes_spec.json`to a folder and run.
40
40
41
41
```
42
42
$ pydraml -s diabetes_spec.json
43
43
```
44
44
45
-
For each case pydra-ml will generate a result folder with the spec file name that includes
46
-
`test-{metric}-{timestamp}.png` file for each metric together with a pickled results file
47
-
containing all the scores from the model evaluations.
45
+
For each case pydra-ml will generate a result folder with the spec file name that
46
+
includes `test-{metric}-{timestamp}.png` file for each metric together with a
47
+
pickled results file containing all the scores from the model evaluations.
48
48
49
49
```
50
50
$ pydraml --help
@@ -82,14 +82,17 @@ will want to generate `x_indices` programmatically.
82
82
group.
83
83
-*x_indices*: Numeric (0-based) or string list of columns to use as input features
84
84
-*target_vars*: String list of target variable (at present only one is supported)
85
+
-*group_var*: String to indicate column to use for grouping
85
86
-*n_splits*: Number of shuffle split iterations to use
86
87
-*test_size*: Fraction of data to use for test set in each iteration
87
88
-*clf_info*: List of scikit-learn classifiers to use.
88
89
-*permute*: List of booleans to indicate whether to generate a null model or not
89
90
-*gen_shap*: Boolean indicating whether shap values are generated
90
91
-*nsamples*: Number of samples to use for shap estimation
91
92
-*l1_reg*: Type of regularizer to use for shap estimation
92
-
-*plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16 or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot top first feature.
93
+
-*plot_top_n_shap*: Number or proportion of top SHAP values to plot (e.g., 16
94
+
or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot
95
+
top first feature.
93
96
-*metrics*: scikit-learn metric to use
94
97
95
98
## `clf_info` specification
@@ -113,6 +116,7 @@ then an empty dictionary **MUST** be provided as parameter 3.
@@ -140,25 +144,46 @@ then an empty dictionary **MUST** be provided as parameter 3.
140
144
141
145
## Output:
142
146
The workflow will output:
143
-
-`results-{timestamp}.pkl` containing 1 list per model used. For example, if assigned to variable `results`, it is accessed through `results[0]` to `results[N]`
144
-
(if `permute: [false,true]` then it will output the model trained on the labels first `results[0]` and the model trained on permuted labels second `results[1]`.
147
+
-`results-{timestamp}.pkl` containing 1 list per model used. For example, if
148
+
assigned to variable `results`, it is accessed through `results[0]` to `results[N]`
149
+
(if `permute: [false,true]` then it will output the model trained on the labels
150
+
first `results[0]` and the model trained on permuted labels second `results[1]`.
145
151
Each model contains:
146
-
- `dict` accesed through `results[0][0]` with model information: `{'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier', {'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}`
147
-
- `pydra Result obj` accesed through `results[0][1]` with attribute `output` which itself has attributes:
152
+
- `dict` accesed through `results[0][0]` with model information:
- `pydra Result obj` accesed through `results[0][1]` with attribute `output`
156
+
which itself has attributes:
148
157
- `feature_names`: from the columns of the data csv.
149
158
And the following attributes organized in N lists for N bootstrapping samples:
150
159
- `output`: N lists, each one with two lists for true and predicted labels.
151
160
- `score`: N lists each one containing M different metric scores.
152
-
- `shaps`: N lists each one with a list of shape (P,F) where P is the amount of predictions and F the different SHAP values for each feature. `shaps` is empty if `gen_shap` is set to `false` or if `permute` is set to true.
153
-
- One figure per metric with performance distribution across splits (with or without null distribution trained on permuted labels)
161
+
- `shaps`: N lists each one with a list of shape (P,F) where P is the
162
+
amount of predictions and F the different SHAP values for each feature.
163
+
`shaps` is empty if `gen_shap` is set to `false` or if `permute` is set
164
+
to true.
165
+
- One figure per metric with performance distribution across splits (with or
166
+
without null distribution trained on permuted labels)
154
167
-`shap-{timestamp}` dir
155
168
- SHAP values are computed for each prediction in each split's test set
156
-
(e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array). The mean is taken across predictions for each split (e.g., resulting in a (64,30) array for 64 features and 30 bootstrapping samples).
157
-
- For binary classification, a more accurate display of feature importance obtained by splitting predictions into TP, TN, FP, and FN,
158
-
which in turn can allow for error auditing (i.e., what a model pays attention to when making incorrect/false predictions)
159
-
- `quadrant_indexes.pkl`: The TP, TN, FP, FN indexes are saved in as a `dict` with one `key` per model (permuted models without SHAP values will be skipped automatically), and each key `values` being a bootstrapping split.
160
-
- `summary_values_shap_{model_name}_{prediction_type}.csv` contains all SHAP values and summary statistics ranked by the mean SHAP value across bootstrapping splits. A sample_n column can be empty or NaN if this split did not have the type of prediction in the filename (e.g., you may not have FNs or FPs in a given split with high performance).
161
-
- `summary_shap_{model_name}_{plot_top_n_shap}.png` contains SHAP value summary statistics for all features (set to 1.0) or only the top N most important features for better visualization.
169
+
(e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array).
170
+
The mean is taken across predictions for each split (e.g., resulting in a
171
+
(64,30) array for 64 features and 30 bootstrapping samples).
172
+
- For binary classification, a more accurate display of feature importance
173
+
obtained by splitting predictions into TP, TN, FP, and FN, which in turn can
174
+
allow for error auditing (i.e., what a model pays attention to when making
175
+
incorrect/false predictions)
176
+
- `quadrant_indexes.pkl`: The TP, TN, FP, FN indexes are saved in as a
177
+
`dict` with one `key` per model (permuted models without SHAP values will
178
+
be skipped automatically), and each key `values` being a bootstrapping split.
179
+
- `summary_values_shap_{model_name}_{prediction_type}.csv` contains all
180
+
SHAP values and summary statistics ranked by the mean SHAP value across
181
+
bootstrapping splits. A sample_n column can be empty or NaN if this split
182
+
did not have the type of prediction in the filename (e.g., you may not
183
+
have FNs or FPs in a given split with high performance).
184
+
- `summary_shap_{model_name}_{plot_top_n_shap}.png` contains SHAP value
185
+
summary statistics for all features (set to 1.0) or only the top N most
186
+
important features for better visualization.
162
187
163
188
164
189
## Developer installation
@@ -171,10 +196,14 @@ cd pydra-ml
171
196
pip install -e .[dev]
172
197
```
173
198
174
-
It is also useful to install pre-commit:
199
+
It is also useful to install pre-commit, which takes care of styling when
200
+
committing code. When pre-commit is used you may have to run git commit twice,
201
+
since pre-commit may make additional changes to your code for styling and will
0 commit comments