You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the train_val_test_split method allows for stratification either by y (stratify_y) or by specified columns (stratify_cols), but not both at the same time. There are use cases where stratification by both the target variable (y) and specific columns is necessary to ensure a balanced and representative split across different data segments.
Proposed Enhancement:
Look for the following #TODO in the code:
## TODO: need to either consolidate stratification into one input or ## alow for simultaneous usage of stratify_cols and stratify_y inputs.
Modify the method to support simultaneous stratification by both y and stratify_cols. This can be achieved by combining the stratification keys or implementing logic that ensures both y and the specified columns are considered during the stratification process.
Current Method Implementation:
deftrain_val_test_split(
self,
X,
y,
stratify_y,
train_size,
validation_size,
test_size,
random_state,
stratify_cols,
calibrate,
):
# if calibrate:# X = X.join(self.dropped_strat_cols)# Determine the stratify parameter based on stratify and stratify_cols## TODO: need to either consolidate stratification into one input or ## alow for simultaneous usage of stratify_cols and stratify_y inputs.ifstratify_cols:
# Creating stratification columns out of stratify_cols liststratify_key=X[stratify_cols]
elifstratify_y:
stratify_key=yelse:
stratify_key=Noneifself.drop_strat_feat:
self.dropped_strat_cols=X[self.drop_strat_feat]
X=X.drop(columns=self.drop_strat_feat)
X_train, X_valid_test, y_train, y_valid_test=train_test_split(
X,
y,
test_size=1-train_size,
stratify=stratify_key, # Use stratify_key hererandom_state=random_state,
)
# Determine the proportion of validation to test size in the remaining datasetproportion=test_size/ (validation_size+test_size)
ifstratify_cols:
strat_key_val_test=X_valid_test[stratify_cols]
elifstratify_y:
strat_key_val_test=y_valid_testelse:
strat_key_val_test=None# Further split (validation + test) set into validation and test setsX_valid, X_test, y_valid, y_test=train_test_split(
X_valid_test,
y_valid_test,
test_size=proportion,
stratify=strat_key_val_test,
random_state=random_state,
)
returnX_train, X_valid, X_test, y_train, y_valid, y_test
The text was updated successfully, but these errors were encountered:
This issue should now be resolved with the latest push to the stratify_fix branch.
This code concatenates the X columns and y if both are specified e.g. stratify_y=True and stratify_cols=['col_example'] . We have discussed merging these into one variable but since the data is already split when input into the model tuner (X and y) this would cause unnecessary complexity.
Description:
Currently, the
train_val_test_split
method allows for stratification either by y (stratify_y
) or by specified columns (stratify_cols
), but not both at the same time. There are use cases where stratification by both the target variable (y) and specific columns is necessary to ensure a balanced and representative split across different data segments.Proposed Enhancement:
Look for the following
#TODO
in the code:Modify the method to support simultaneous stratification by both y and
stratify_cols
. This can be achieved by combining the stratification keys or implementing logic that ensures both y and the specified columns are considered during the stratification process.Current Method Implementation:
The text was updated successfully, but these errors were encountered: