Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support simultaneous stratification by y and stratify_cols in train_val_test_split method #28

Closed
lshpaner opened this issue Aug 26, 2024 · 1 comment

Comments

@lshpaner
Copy link
Collaborator

lshpaner commented Aug 26, 2024

Description:

Currently, the train_val_test_split method allows for stratification either by y (stratify_y) or by specified columns (stratify_cols), but not both at the same time. There are use cases where stratification by both the target variable (y) and specific columns is necessary to ensure a balanced and representative split across different data segments.

Proposed Enhancement:

Look for the following #TODO in the code:

        ## TODO: need to either consolidate stratification into one input or 
        ## alow for simultaneous usage of stratify_cols and stratify_y inputs.

Modify the method to support simultaneous stratification by both y and stratify_cols. This can be achieved by combining the stratification keys or implementing logic that ensures both y and the specified columns are considered during the stratification process.

Current Method Implementation:

def train_val_test_split(
    self,
    X,
    y,
    stratify_y,
    train_size,
    validation_size,
    test_size,
    random_state,
    stratify_cols,
    calibrate,
):

    # if calibrate:
    #     X = X.join(self.dropped_strat_cols)
    # Determine the stratify parameter based on stratify and stratify_cols

    ## TODO: need to either consolidate stratification into one input or 
    ## alow for simultaneous usage of stratify_cols and stratify_y inputs.
    if stratify_cols:
        # Creating stratification columns out of stratify_cols list
        stratify_key = X[stratify_cols]
    elif stratify_y:
        stratify_key = y
    else:
        stratify_key = None

    if self.drop_strat_feat:
        self.dropped_strat_cols = X[self.drop_strat_feat]
        X = X.drop(columns=self.drop_strat_feat)

    X_train, X_valid_test, y_train, y_valid_test = train_test_split(
        X,
        y,
        test_size=1 - train_size,
        stratify=stratify_key,  # Use stratify_key here
        random_state=random_state,
    )

    # Determine the proportion of validation to test size in the remaining dataset
    proportion = test_size / (validation_size + test_size)

    if stratify_cols:
        strat_key_val_test = X_valid_test[stratify_cols]
    elif stratify_y:
        strat_key_val_test = y_valid_test
    else:
        strat_key_val_test = None

    # Further split (validation + test) set into validation and test sets
    X_valid, X_test, y_valid, y_test = train_test_split(
        X_valid_test,
        y_valid_test,
        test_size=proportion,
        stratify=strat_key_val_test,
        random_state=random_state,
    )

    return X_train, X_valid, X_test, y_train, y_valid, y_test
@elemets
Copy link
Collaborator

elemets commented Aug 27, 2024

This issue should now be resolved with the latest push to the stratify_fix branch.

This code concatenates the X columns and y if both are specified e.g. stratify_y=True and stratify_cols=['col_example'] . We have discussed merging these into one variable but since the data is already split when input into the model tuner (X and y) this would cause unnecessary complexity.

   if stratify_cols and stratify_y:
            strat_key_val_test = pd.concat(
                [X_valid_test[stratify_cols], y_valid_test], axis=1
            )

@elemets elemets closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@elemets @lshpaner and others