Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Unify inference interface - request for comments #505

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skothenhill-nv
Copy link
Collaborator

@skothenhill-nv skothenhill-nv commented Dec 6, 2024

Overview

Unifies the inference interface between esm2 and geneformer. Unlike training, user configuration is limited to parallelism arguments, which has no impact on the output, and inference specific arguments (checkpoint, data, etc). Due to this, I think we should stick with argparse style parsing and instead focus on validating the inputs we can and unifying the inference workflows for our models. The primary obstacles are conditional parameterization for DataConfig. For training, this is solved by parsing the config directly from json and constructing the datamodule from this config. Users must know which fields belong in the json file.

With argparse, this becomes more tricky and potentially impossible.

Proposed workflow:

bionemo-infer --checkpoint path/to/ckpt
    --model-config-t  ESM2Config  # (Exposed or not is fine here)
    --data-config-t InMemoryCSVDatasetConfig
    --data-args arg1=val arg2=val arg3=val 
    --checkpoint-path /path/to/ckpt
    --precision
    --pipeline-model-parallel
    ... 
    (other runtime args)

Pseudocode (in python) then looks like:

args = parse_args()
data_config[args.DataConfigT](args.data1, args.data2... etc)
parallel_config = ParallelConfig(args.parallel_arg1, parallel_arg2... etc)

infer(args.ckpt, data_config, parallel_config, result_path, include_embeddings/logits/etc)

Issue with Dataset example

If there are additional arguments required to parse certain data configs, we have no way to tell argparse what arguments should be expected as the DataConfig class is selected at runtime (as an argument to argparse itself~).

class FakeESM2Config
     path
     special_argument1
class FakeGeneformerConfig
     path

One thought is to use an alternative to DataConfigT with a unified set of parameters, meaning, we need some minimal set of params for all DataConfigs used during inference. something like:

class InferenceDataConfig:
        path_to_data:  Path
        num_workers: int
        construct_data_module(self) -> DataModule:

In this case we assume implementors can figure out additional required files while parsing path_to_data. For example, this could be a directory with a set of files, the implementation then makes hard assumptions about how the data is stored in the directory.

In anycase, I would hope most inference workflows are operating on single files or with relatively simple configuration. That is how I would use them!

if isinstance(results_path, str):
results_path = pathlib.Path(results_path)

if os.path.isdir(results_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change as we will require results_path to be a directory for storing results.

Copy link
Collaborator

@farhadrgh farhadrgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skothenhill-nv. This looks solid to me and makes a lot of sense overall. I like the approach and I don’t see any major gaps.

I don’t think this impacts the proposal, but I want to point out a difference between the ESM2 and Geneformer datamodule/dataset. For ESM2, I started with an in-memory CSV dataset and a datamodule that’s separate from the pre-training datamodule (which uses pairs of .db and parquet files) to simplify the inference process.

@skothenhill-nv
Copy link
Collaborator Author

Thanks @skothenhill-nv. This looks solid to me and makes a lot of sense overall. I like the approach and I don’t see any major gaps.

I don’t think this impacts the proposal, but I want to point out a difference between the ESM2 and Geneformer datamodule/dataset. For ESM2, I started with an in-memory CSV dataset and a datamodule that’s separate from the pre-training datamodule (which uses pairs of .db and parquet files) to simplify the inference process.

Thanks Farhad, just waiting on feedback from @jstjohn before moving forward with this. Glad to hear we are aligned and thanks for the extra context.

Comment on lines +226 to +227
@classmethod
def model_class(cls) -> Type[ESM2Config]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say either make it a class method and return cls or static and return the one class? Right now there's no change in terms of correct usage of class method vs instance method. Static would be the lowest featured method type that still works here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good callout!

# TODO will want to bring in Farhad's work, eventually we should rebase on his PR
# Does this have to be a DataModule or can it be a DataLoader?
predictions = trainer.predict(module, datamodule=datamodule, return_predictions=True)
# TODO evaluate this section. I worry about gathering predictions in memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the future we should be able to say return_predictions=False and use a prediction writer. I think you were getting at that with mention of Farhad's PR but maybe specifically say that we should get return_predictions=False working alongside a prediction writer.

Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have a request for:

  • a few more comments in a couple places
  • switch to static method for the thing that returns the type or, if appropriate, return cls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants