-
Notifications
You must be signed in to change notification settings - Fork 64
feat: split process_data out from run_training #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
process_data=True was allowing users to implicitly process their data via run_training since `ilab` isn't the main consumption point of the training library going forward, it makes sense to separate these concerns. remove `data_output_dir` and assume `data_path` is processed data. also remove the `process_data` argument. Adjust documentation, notebooks, etc Signed-off-by: Charlie Doern <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -335,13 +334,14 @@ from instructlab.training import ( | |||
DataProcessArgs, | |||
data_process as dp | |||
) | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, Opinion, Not required: in some codebases I worked with, it's common to import os.path
instead of os
to give a hint that we care about the path functions specifically and not just any os
stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@booxter or better yet, from os import path
😉
|
||
# this field defines where we should be saving the processed version of the training dataset | ||
# after we have tokenized it | ||
data_output_dir: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU instructlab CLI currently uses this field to pass --data-output-dir from the user. It also assumes the default behavior - process_data
being True
.
Would it be possible / desirable to make the transition smoother, e.g. by
- marking
data_output_dir
field as deprecated; - if it's passed, direct users with a warning to use the data_path name instead (plus not assume pre-processing?).
Or is the strategy here is to rewrite CLI + bump the minimal version for training library? Even if the CLI training rewrite for the new interface is the plan, it would be easier on the remaining CLI maintainers if the change happens through gradual deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the CLI training rewrite for the new interface is the plan, it would be easier on the remaining CLI maintainers if the change happens through gradual deprecation
I'm not sure how this is handled in the current refactor @cdoern is working on. Assuming it doesn't handle this case, this is how I recommend we handle it based on how we made the initial refactor to move to the newer version of data processing:
- Create a new API for what is currently
run_training
which only runs training, we can either call itrun_training_v2
or something else. - Keep
run_training
with a deprecation notice and have it still perform this orchestration under the hood - Refactor
ilab
to use the new APIs - Remove
run_training
after a few releases as you have described here
You mentioned #437 in the PR description as resolved. I am not sure I understand the relation between them. Would you mind clarifying how they are related for those new to the repo? Thanks. |
process_data=True was allowing users to implicitly process their data via run_training. since
ilab
isn't the main consumption point of the training library going forward, it makes sense to separate these concerns.remove
data_output_dir
and assumedata_path
is processed data. also remove theprocess_data
argument.Adjust documentation, notebooks, etc
resolves #434, #437