-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Unify inference interface - request for comments #505
base: main
Are you sure you want to change the base?
Conversation
if isinstance(results_path, str): | ||
results_path = pathlib.Path(results_path) | ||
|
||
if os.path.isdir(results_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change as we will require results_path
to be a directory for storing results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @skothenhill-nv. This looks solid to me and makes a lot of sense overall. I like the approach and I don’t see any major gaps.
I don’t think this impacts the proposal, but I want to point out a difference between the ESM2 and Geneformer datamodule/dataset. For ESM2, I started with an in-memory CSV dataset and a datamodule that’s separate from the pre-training datamodule (which uses pairs of .db and parquet files) to simplify the inference process.
Thanks Farhad, just waiting on feedback from @jstjohn before moving forward with this. Glad to hear we are aligned and thanks for the extra context. |
@classmethod | ||
def model_class(cls) -> Type[ESM2Config]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say either make it a class method and return cls
or static and return the one class? Right now there's no change in terms of correct usage of class method vs instance method. Static would be the lowest featured method type that still works here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good callout!
# TODO will want to bring in Farhad's work, eventually we should rebase on his PR | ||
# Does this have to be a DataModule or can it be a DataLoader? | ||
predictions = trainer.predict(module, datamodule=datamodule, return_predictions=True) | ||
# TODO evaluate this section. I worry about gathering predictions in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the future we should be able to say return_predictions=False
and use a prediction writer. I think you were getting at that with mention of Farhad's PR but maybe specifically say that we should get return_predictions=False working alongside a prediction writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I have a request for:
- a few more comments in a couple places
- switch to static method for the thing that returns the type or, if appropriate, return
cls
?
Overview
Unifies the inference interface between esm2 and geneformer. Unlike training, user configuration is limited to parallelism arguments, which has no impact on the output, and inference specific arguments (checkpoint, data, etc). Due to this, I think we should stick with argparse style parsing and instead focus on validating the inputs we can and unifying the inference workflows for our models. The primary obstacles are conditional parameterization for
DataConfig
. For training, this is solved by parsing the config directly from json and constructing the datamodule from this config. Users must know which fields belong in the json file.With argparse, this becomes more tricky and potentially impossible.
Proposed workflow:
Pseudocode (in python) then looks like:
Issue with Dataset example
If there are additional arguments required to parse certain data configs, we have no way to tell argparse what arguments should be expected as the DataConfig class is selected at runtime (as an argument to argparse itself~).
One thought is to use an alternative to DataConfigT with a unified set of parameters, meaning, we need some minimal set of params for all DataConfigs used during inference. something like:
In this case we assume implementors can figure out additional required files while parsing
path_to_data
. For example, this could be a directory with a set of files, the implementation then makes hard assumptions about how the data is stored in the directory.In anycase, I would hope most inference workflows are operating on single files or with relatively simple configuration. That is how I would use them!