A method for managing upstream training task confs #1529

hnyu · 2023-08-27T00:19:36Z

Background

Our Hobot pipeline is long and has four stages:

low-level -> expert -> agent -> deployment

Each stage can be called a 'task'. For any task, the tasks on its left are 'upstream' tasks. Generally, a task needs its upstream task config files and model weights for both training and deployment (which additionally requires its own configs and model weights).

A desirable scenario

In a brief discussion with @Haichao-Zhang on Friday, we agreed that in order to easily manage and use training results of upstream tasks, two important properties are desired:

The model weights are always stored as one ckpt file, regardless of where the task is at the pipeline. For example, an agent's ckpt contains the model weights for low-level, expert, and itself.
We only need to look up at one stage to get all needed configurations. For example, for either agent training or deployment, it only needs the expert's job dir but not the low-level's. Similarly, for deployment, it only needs the agent's job dir.

The above two properties simplify ckpt and conf management, because we don't want multiple training dirs passed to a downstream task.

Solution

For model weights, it's straightforward to store all as one ckpt. However, it's a little tricky when it comes to conf management. Below is a simple hack for that purpose.

def save_upstream_confs(upstream_task_root_dir: str):
    """When training the current task B, we copy all upstream task (C,D,...) confs
    to './.upstream_confs', and then add them to ``_CONF_FILES``.

    This will make them further copied to 'config_files' under the TB directory
    of B when later ALF writes the config.

    So later when one wants to use the ckpt of B for a new downstream task A, he
    doesn't need trained dirs of C,D,..., because their conf files have been
    included in B.

    To use any cached upstream conf ``x_conf.py``, one needs only to do

    .. code-block:: python

        alf.import_config('./.upstream_confs/x_conf.py')

    This will also work if ``x_conf.py`` also imports some upstream conf ``y_conf.py``,
    if inside ``x_conf.py`` it's written as

    .. code-block:: python

        alf.import_config('./.upstream_confs/y_conf.py')

    A general template of using/saving upstream confs:

    .. code-block:: python

        if is_training:
            save_upstream_confs(upstream_task_root_dir)
        # import conf files of the current task
        alf.import_config('x_conf.py')
        alf.import_config('y_conf.py')
        # import conf files of upstream tasks
        alf.import_config('./upstream_confs/z_conf.py')

    Args:
        upstream_task_root_dir: the root dir of the upstream task
    """
    root_dir = upstream_task_root_dir
    dst = pathlib.Path(__file__).parent
    dst = dst / ".upstream_confs/"
    os.system(f"mkdir -p {dst}")
    # Copy the upstream task config files, along with its upstream task conf files
    # if existing.
    if os.path.isdir(f"{root_dir}/config_files/.upstream_confs"):
        os.system(f"cp -r {root_dir}/config_files/.upstream_confs {dst}")
    os.system(f"cp {root_dir}/config_files/*.py {dst}")
    for f in glob.glob(f"{dst}/**/*.py", recursive=True):
        _add_conf_file(f)

Generally, we copy all files under config_files of an upstream root_dir, recursively to the path of the current conf file, under a special dir called .upstream_confs. Then we add all files in this special dir recursively to ALF's _CONF_FILES which will be copied by ALF to the config_files of the training root dir after one training iteration of the current task.

This can satisfy the second property, if in any conf file x_conf.py of the current task, we import another conf file y_conf.py of the immediate upstream task by

alf.import_config('.upstream_confs/y_conf.py')

This works for both the training and deployment modes of the task. We only call the above function when it's in the training mode:

if is_training:
  save_upstream_confs(upstream_task_root_dir)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A method for managing upstream training task confs #1529

A method for managing upstream training task confs #1529

hnyu commented Aug 27, 2023 •

edited

Loading

A method for managing upstream training task confs #1529

A method for managing upstream training task confs #1529

Comments

hnyu commented Aug 27, 2023 • edited Loading

Background

A desirable scenario

Solution

hnyu commented Aug 27, 2023 •

edited

Loading