Skip to content

Commit

Permalink
bc auroc sampling error calculation
Browse files Browse the repository at this point in the history
  • Loading branch information
nikml committed Apr 12, 2024
1 parent ab7e79d commit dd12310
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 4 deletions.
32 changes: 32 additions & 0 deletions nannyml/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -614,3 +614,35 @@ def _raise_exception_for_negative_values(column: pd.Series):
"\tLog-based metrics are not supported for negative target values.\n"
f"\tCheck '{column.name}' at rows {str(negative_item_indices)}."
)

def common_nan_removal(data: pd.DataFrame, selected_columns: List[str]) -> Tuple[List[pd.Series], bool]:

This comment has been minimized.

Copy link
@michael-nml

michael-nml Apr 12, 2024

Collaborator

Functionality seems to largely overlap with _remove_nans. Maybe we could just use that function?
As far as I can tell the extra stuff this function provides (checking for column presence & empty) isn't even used.

This comment has been minimized.

Copy link
@nnansters

nnansters Apr 12, 2024

Contributor

This is indeed the n-th time we've added a "remove NaN" function. In favor of reusing the existing ones. Also, this is not common because it is only used once :-)

This comment has been minimized.

Copy link
@nikml

nikml Apr 19, 2024

Author Contributor

Just saw the comment. I am removing all instances of _remove_nans from the code. This function should succeed it. Also updated it after our discussion with Niels so it can do it's role.

This comment has been minimized.

Copy link
@michael-nml

michael-nml Apr 19, 2024

Collaborator

Two notes to make sure you're aware:

  • This common_nan_removal function may behave differently from _remove_nans depending on the arguments. I specifically added an ability to _remove_nans to only drop a row if a combination of columns is equal to NaN, e.g. for multiclass problems only drop the row if all probabilities columns are NaN. I don't think this new common_nan_removal function covers that case as it is currently.
  • I think this function currently doesn't infer_objects after NaN's have been dropped. That is important to ensure the dtypes are set correctly, so you may want to add it to the new function
"""Remove NaN values from rows of selected columns.
Parameters
----------
data: pd.DataFrame
Pandas dataframe containing data.
selected_columns: List[str]
List containing the strings of column names
Returns
-------
col_list:
List containing the clean columns specified. Order of columns from selected_columns is
preserved.
empty:
Boolean whether the resulting data are contain any rows (false) or not (true)
"""
# If we want target and it's not available we get None
if not set(selected_columns) <= set(data.columns):
raise InvalidArgumentsException(
f"Selected columns: {selected_columns} not all present in provided data columns {list(data.columns)}"
)
df = data[selected_columns].dropna(axis=0, how='any', inplace=False).reset_index()
empty: bool = False
if df.shape[0] == 0:
empty = True
results = []
for el in selected_columns:
results.append(df[el])
return (results, empty)
13 changes: 9 additions & 4 deletions nannyml/sampling_error/binary_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from sklearn.metrics import average_precision_score

from nannyml.exceptions import InvalidArgumentsException
from nannyml.base import common_nan_removal

# How many experiments to perform when doing resampling to approximate sampling error.
N_EXPERIMENTS = 50
Expand Down Expand Up @@ -49,14 +50,18 @@ def auroc_sampling_error_components(y_true_reference: pd.Series, y_pred_proba_re
-------
(std, fraction): Tuple[np.ndarray, float]
"""

y_true = y_true_reference.copy().reset_index(drop=True)
y_pred_proba = y_pred_proba_reference.copy().reset_index(drop=True)
# remove common nans - Better Way? - conform with common_nan_removal API
df = pd.DataFrame({
'y_true': y_true_reference,
'y_pred_proba': y_pred_proba_reference,
})
[y_true, y_pred_proba], empty = common_nan_removal(df, ['y_true', 'y_pred_proba'])
y_true = y_true.to_numpy()
y_pred_proba = y_pred_proba.to_numpy()

if np.mean(y_true) > 0.5:
y_true = abs(np.asarray(y_true) - 1)
y_pred_proba = 1 - y_pred_proba

sorted_idx = np.argsort(y_pred_proba)
y_pred_proba = y_pred_proba[sorted_idx]
y_true = y_true[sorted_idx]
Expand Down

0 comments on commit dd12310

Please sign in to comment.