[Draft] Unify inference interface - request for comments #505

skothenhill-nv · 2024-12-06T22:35:27Z

Overview

Unifies the inference interface between esm2 and geneformer. Unlike training, user configuration is limited to parallelism arguments, which has no impact on the output, and inference specific arguments (checkpoint, data, etc). Due to this, I think we should stick with argparse style parsing and instead focus on validating the inputs we can and unifying the inference workflows for our models. The primary obstacles are conditional parameterization for DataConfig. For training, this is solved by parsing the config directly from json and constructing the datamodule from this config. Users must know which fields belong in the json file.

With argparse, this becomes more tricky and potentially impossible.

Proposed workflow:

bionemo-infer --checkpoint path/to/ckpt
    --model-config-t  ESM2Config  # (Exposed or not is fine here)
    --data-config-t InMemoryCSVDatasetConfig
    --data-args arg1=val arg2=val arg3=val 
    --checkpoint-path /path/to/ckpt
    --precision
    --pipeline-model-parallel
    ... 
    (other runtime args)

Pseudocode (in python) then looks like:

args = parse_args()
data_config[args.DataConfigT](args.data1, args.data2... etc)
parallel_config = ParallelConfig(args.parallel_arg1, parallel_arg2... etc)

infer(args.ckpt, data_config, parallel_config, result_path, include_embeddings/logits/etc)

Issue with Dataset example

If there are additional arguments required to parse certain data configs, we have no way to tell argparse what arguments should be expected as the DataConfig class is selected at runtime (as an argument to argparse itself~).

class FakeESM2Config
     path
     special_argument1
class FakeGeneformerConfig
     path

One thought is to use an alternative to DataConfigT with a unified set of parameters, meaning, we need some minimal set of params for all DataConfigs used during inference. something like:

class InferenceDataConfig:
        path_to_data:  Path
        num_workers: int
        construct_data_module(self) -> DataModule:

In this case we assume implementors can figure out additional required files while parsing path_to_data. For example, this could be a directory with a set of files, the implementation then makes hard assumptions about how the data is stored in the directory.

In anycase, I would hope most inference workflows are operating on single files or with relatively simple configuration. That is how I would use them!

farhadrgh · 2024-12-09T16:39:53Z

sub-packages/bionemo-llm/src/bionemo/llm/infer.py

+    if isinstance(results_path, str):
+        results_path = pathlib.Path(results_path)
+
+    if os.path.isdir(results_path):


This will change as we will require results_path to be a directory for storing results.

farhadrgh

Thanks @skothenhill-nv. This looks solid to me and makes a lot of sense overall. I like the approach and I don’t see any major gaps.

I don’t think this impacts the proposal, but I want to point out a difference between the ESM2 and Geneformer datamodule/dataset. For ESM2, I started with an in-memory CSV dataset and a datamodule that’s separate from the pre-training datamodule (which uses pairs of .db and parquet files) to simplify the inference process.

skothenhill-nv · 2024-12-12T15:15:49Z

Thanks @skothenhill-nv. This looks solid to me and makes a lot of sense overall. I like the approach and I don’t see any major gaps.

I don’t think this impacts the proposal, but I want to point out a difference between the ESM2 and Geneformer datamodule/dataset. For ESM2, I started with an in-memory CSV dataset and a datamodule that’s separate from the pre-training datamodule (which uses pairs of .db and parquet files) to simplify the inference process.

Thanks Farhad, just waiting on feedback from @jstjohn before moving forward with this. Glad to hear we are aligned and thanks for the extra context.

jstjohn · 2024-12-12T16:05:40Z

sub-packages/bionemo-esm2/src/bionemo/esm2/run/config_models.py

+    @classmethod
+    def model_class(cls) -> Type[ESM2Config]:


I'd say either make it a class method and return cls or static and return the one class? Right now there's no change in terms of correct usage of class method vs instance method. Static would be the lowest featured method type that still works here.

oh good callout!

jstjohn · 2024-12-12T16:08:42Z

sub-packages/bionemo-llm/src/bionemo/llm/infer.py

+    # TODO will want to bring in Farhad's work, eventually we should rebase on his PR
+    # Does this have to be a DataModule or can it be a DataLoader?
+    predictions = trainer.predict(module, datamodule=datamodule, return_predictions=True)
+    # TODO evaluate this section. I worry about gathering predictions in memory.


So in the future we should be able to say return_predictions=False and use a prediction writer. I think you were getting at that with mention of Farhad's PR but maybe specifically say that we should get return_predictions=False working alongside a prediction writer.

jstjohn

Looks good! I have a request for:

a few more comments in a couple places
switch to static method for the thing that returns the type or, if appropriate, return cls?

initial commit

54d8cfa

skothenhill-nv requested review from jstjohn, malcolmgreaves, farhadrgh, dorotat-nv and pstjohn as code owners December 6, 2024 22:35

farhadrgh reviewed Dec 9, 2024

View reviewed changes

jstjohn reviewed Dec 12, 2024

View reviewed changes

jstjohn approved these changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Unify inference interface - request for comments #505

[Draft] Unify inference interface - request for comments #505

skothenhill-nv commented Dec 6, 2024 •

edited

Loading

farhadrgh Dec 9, 2024

farhadrgh left a comment

skothenhill-nv commented Dec 12, 2024

jstjohn Dec 12, 2024

skothenhill-nv Dec 12, 2024

jstjohn Dec 12, 2024

jstjohn left a comment

[Draft] Unify inference interface - request for comments #505

Are you sure you want to change the base?

[Draft] Unify inference interface - request for comments #505

Conversation

skothenhill-nv commented Dec 6, 2024 • edited Loading

Overview

Proposed workflow:

Pseudocode (in python) then looks like:

Issue with Dataset example

farhadrgh Dec 9, 2024

Choose a reason for hiding this comment

farhadrgh left a comment

Choose a reason for hiding this comment

skothenhill-nv commented Dec 12, 2024

jstjohn Dec 12, 2024

Choose a reason for hiding this comment

skothenhill-nv Dec 12, 2024

Choose a reason for hiding this comment

jstjohn Dec 12, 2024

Choose a reason for hiding this comment

jstjohn left a comment

Choose a reason for hiding this comment

skothenhill-nv commented Dec 6, 2024 •

edited

Loading