Skip to content

feat: split process_data out from run_training #438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Apr 1, 2025

process_data=True was allowing users to implicitly process their data via run_training. since ilab isn't the main consumption point of the training library going forward, it makes sense to separate these concerns.

remove data_output_dir and assume data_path is processed data. also remove the process_data argument.

Adjust documentation, notebooks, etc

resolves #434, #437

@mergify mergify bot added documentation Improvements or additions to documentation ci-failure labels Apr 1, 2025
process_data=True was allowing users to implicitly process their data via run_training
since `ilab` isn't the main consumption point of the training library going forward, it makes sense to separate these concerns.

remove `data_output_dir` and assume `data_path` is processed data.
also remove the `process_data` argument.

Adjust documentation, notebooks, etc

Signed-off-by: Charlie Doern <[email protected]>
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot added the one-approval label Apr 2, 2025
@@ -335,13 +334,14 @@ from instructlab.training import (
DataProcessArgs,
data_process as dp
)
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, Opinion, Not required: in some codebases I worked with, it's common to import os.path instead of os to give a hint that we care about the path functions specifically and not just any os stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@booxter or better yet, from os import path 😉


# this field defines where we should be saving the processed version of the training dataset
# after we have tokenized it
data_output_dir: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU instructlab CLI currently uses this field to pass --data-output-dir from the user. It also assumes the default behavior - process_data being True.

Would it be possible / desirable to make the transition smoother, e.g. by

  1. marking data_output_dir field as deprecated;
  2. if it's passed, direct users with a warning to use the data_path name instead (plus not assume pre-processing?).

Or is the strategy here is to rewrite CLI + bump the minimal version for training library? Even if the CLI training rewrite for the new interface is the plan, it would be easier on the remaining CLI maintainers if the change happens through gradual deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the CLI training rewrite for the new interface is the plan, it would be easier on the remaining CLI maintainers if the change happens through gradual deprecation

I'm not sure how this is handled in the current refactor @cdoern is working on. Assuming it doesn't handle this case, this is how I recommend we handle it based on how we made the initial refactor to move to the newer version of data processing:

  1. Create a new API for what is currently run_training which only runs training, we can either call it run_training_v2 or something else.
  2. Keep run_training with a deprecation notice and have it still perform this orchestration under the hood
  3. Refactor ilab to use the new APIs
  4. Remove run_training after a few releases as you have described here

@booxter
Copy link
Contributor

booxter commented Apr 15, 2025

You mentioned #437 in the PR description as resolved. I am not sure I understand the relation between them. Would you mind clarifying how they are related for those new to the repo? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure documentation Improvements or additions to documentation one-approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor data processing to enable passing output file path
3 participants