Support simultaneous stratification by y and stratify_cols in train_val_test_split method #28

lshpaner · 2024-08-26T21:51:39Z

Description:

Currently, the train_val_test_split method allows for stratification either by y (stratify_y) or by specified columns (stratify_cols), but not both at the same time. There are use cases where stratification by both the target variable (y) and specific columns is necessary to ensure a balanced and representative split across different data segments.

Proposed Enhancement:

Look for the following #TODO in the code:

        ## TODO: need to either consolidate stratification into one input or 
        ## alow for simultaneous usage of stratify_cols and stratify_y inputs.

Modify the method to support simultaneous stratification by both y and stratify_cols. This can be achieved by combining the stratification keys or implementing logic that ensures both y and the specified columns are considered during the stratification process.

Current Method Implementation:

def train_val_test_split(
    self,
    X,
    y,
    stratify_y,
    train_size,
    validation_size,
    test_size,
    random_state,
    stratify_cols,
    calibrate,
):

    # if calibrate:
    #     X = X.join(self.dropped_strat_cols)
    # Determine the stratify parameter based on stratify and stratify_cols

    ## TODO: need to either consolidate stratification into one input or 
    ## alow for simultaneous usage of stratify_cols and stratify_y inputs.
    if stratify_cols:
        # Creating stratification columns out of stratify_cols list
        stratify_key = X[stratify_cols]
    elif stratify_y:
        stratify_key = y
    else:
        stratify_key = None

    if self.drop_strat_feat:
        self.dropped_strat_cols = X[self.drop_strat_feat]
        X = X.drop(columns=self.drop_strat_feat)

    X_train, X_valid_test, y_train, y_valid_test = train_test_split(
        X,
        y,
        test_size=1 - train_size,
        stratify=stratify_key,  # Use stratify_key here
        random_state=random_state,
    )

    # Determine the proportion of validation to test size in the remaining dataset
    proportion = test_size / (validation_size + test_size)

    if stratify_cols:
        strat_key_val_test = X_valid_test[stratify_cols]
    elif stratify_y:
        strat_key_val_test = y_valid_test
    else:
        strat_key_val_test = None

    # Further split (validation + test) set into validation and test sets
    X_valid, X_test, y_valid, y_test = train_test_split(
        X_valid_test,
        y_valid_test,
        test_size=proportion,
        stratify=strat_key_val_test,
        random_state=random_state,
    )

    return X_train, X_valid, X_test, y_train, y_valid, y_test

The text was updated successfully, but these errors were encountered:

elemets · 2024-08-27T16:47:50Z

This issue should now be resolved with the latest push to the stratify_fix branch.

This code concatenates the X columns and y if both are specified e.g. stratify_y=True and stratify_cols=['col_example'] . We have discussed merging these into one variable but since the data is already split when input into the model tuner (X and y) this would cause unnecessary complexity.

   if stratify_cols and stratify_y:
            strat_key_val_test = pd.concat(
                [X_valid_test[stratify_cols], y_valid_test], axis=1
            )

elemets closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support simultaneous stratification by y and stratify_cols in train_val_test_split method #28

Support simultaneous stratification by y and stratify_cols in train_val_test_split method #28

lshpaner commented Aug 26, 2024 •

edited

Loading

elemets commented Aug 27, 2024 •

edited

Loading

Support simultaneous stratification by y and stratify_cols in train_val_test_split method #28

Support simultaneous stratification by y and stratify_cols in train_val_test_split method #28

Comments

lshpaner commented Aug 26, 2024 • edited Loading

elemets commented Aug 27, 2024 • edited Loading

lshpaner commented Aug 26, 2024 •

edited

Loading

elemets commented Aug 27, 2024 •

edited

Loading