Skip to content

Conversation

@farhadrgh
Copy link
Collaborator

@farhadrgh farhadrgh commented Oct 29, 2025

Description

Bump NeMo to get the changes in NVIDIA-NeMo/NeMo#14914

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Signed-off-by: Farhad Ramezanghorbani <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@farhadrgh farhadrgh added the ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests label Oct 29, 2025
@farhadrgh farhadrgh enabled auto-merge October 29, 2025 18:22
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
@jwilber
Copy link
Collaborator

jwilber commented Oct 30, 2025

/ok to test 10c14cf

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2025

❌ 27 Tests Failed:

Tests completed Failed Passed Skipped
1339 27 1312 53
View the top 3 failed test(s) by shortest run time
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[ConsumedSamplesCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[GlobalStepStateCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[OptimizerStateCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[TrainInputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[TrainLossCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[TrainOutputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[ValidInputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[ValidLossCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[ValidOutputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_train_val_init_consumed_samples
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_all_valid_batch_inputs_are_identical
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[ConsumedSamplesCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[GlobalStepStateCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[OptimizerStateCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[TrainInputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[TrainLossCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[TrainOutputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[ValidInputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[ValidLossCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[ValidOutputCallback]
Stack Traces | 0.001s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
sub-packages/bionemo-example_model/tests/bionemo/example_model/lightning/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[32]
Stack Traces | 0.898s run time
tmpdir = local('.../pytest-of-root/pytest-4/test_train_mnist_litautoencode0')
precision = 32

    @pytest.mark.needs_gpu
    @pytest.mark.parametrize("precision", [32, "bf16-mixed"])
    def test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu(tmpdir: LEGACY_PATH, precision: PrecisionTypes):
        with megatron_parallel_state_utils.distributed_model_parallel_state():
>           ckpt_path, initial_metrics = _train_model_get_ckpt(
                name="test_experiment",
                root_dir=tmpdir / "pretrain",
                model_cfg_cls=lb.PretrainConfig,
                ckpt_path=None,
                skip_weight_prefixes=set(),
                precision=precision,
            )

.../example_model/lightning/test_lightning_basic.py:128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../example_model/lightning/test_lightning_basic.py:110: in _train_model_get_ckpt
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:118: in train
    app_state = _setup(
.../local/lib/python3.12.../collections/llm/api.py:953: in _setup
    CallbackGroup.get_instance().update_config(nemo_version='v2', trainer=trainer, data=data)
.../local/lib/python3.12.../nemo/lightning/callback_group.py:75: in update_config
    method(nemo_version=nemo_version, trainer=trainer, **kwargs)
.../local/lib/python3.12.../nemo/lightning/one_logger_callback.py:298: in update_config
    config = get_nemo_v2_callback_config(trainer=trainer, data=data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

trainer = <nemo.lightning.pytorch.trainer.Trainer object at 0x76367ff56e10>
data = <bionemo.example_model.lightning.lightning_basic.MNISTDataModule object at 0x76367ff70290>

    def get_nemo_v2_callback_config(
        trainer: Any,
        data: Any,
    ) -> Dict[str, Any]:
        """Generate NeMo v2 specific configuration for the OneLogger training callback.
    
        This function extracts the global batch size and sequence length from the provided NeMo v2 data module,
        and uses them to construct the configuration dictionary for the OneLogger training callback.
    
        Args:
            trainer: PyTorch Lightning trainer instance.
            data: NeMo v2 data module (required).
    
        Returns:
            Dictionary containing the NeMo v2 training callback configuration.
        """
        # NeMo v2: Extract batch size and sequence length from data module (most reliable source)
        global_batch_size = 1  # Default fallback
        seq_length = 1  # Default fallback
    
        if data is not None:
>           seq_length = data.seq_length
E           AttributeError: 'MNISTDataModule' object has no attribute 'seq_length'

.../local/lib/python3.12.../nemo/lightning/one_logger_callback.py:226: AttributeError
sub-packages/bionemo-example_model/tests/bionemo/example_model/lightning/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[bf16-mixed]
Stack Traces | 0.903s run time
tmpdir = local('.../pytest-of-root/pytest-4/test_train_mnist_litautoencode1')
precision = 'bf16-mixed'

    @pytest.mark.needs_gpu
    @pytest.mark.parametrize("precision", [32, "bf16-mixed"])
    def test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu(tmpdir: LEGACY_PATH, precision: PrecisionTypes):
        with megatron_parallel_state_utils.distributed_model_parallel_state():
>           ckpt_path, initial_metrics = _train_model_get_ckpt(
                name="test_experiment",
                root_dir=tmpdir / "pretrain",
                model_cfg_cls=lb.PretrainConfig,
                ckpt_path=None,
                skip_weight_prefixes=set(),
                precision=precision,
            )

.../example_model/lightning/test_lightning_basic.py:128: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../example_model/lightning/test_lightning_basic.py:110: in _train_model_get_ckpt
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:118: in train
    app_state = _setup(
.../local/lib/python3.12.../collections/llm/api.py:953: in _setup
    CallbackGroup.get_instance().update_config(nemo_version='v2', trainer=trainer, data=data)
.../local/lib/python3.12.../nemo/lightning/callback_group.py:75: in update_config
    method(nemo_version=nemo_version, trainer=trainer, **kwargs)
.../local/lib/python3.12.../nemo/lightning/one_logger_callback.py:298: in update_config
    config = get_nemo_v2_callback_config(trainer=trainer, data=data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

trainer = <nemo.lightning.pytorch.trainer.Trainer object at 0x7636800b3170>
data = <bionemo.example_model.lightning.lightning_basic.MNISTDataModule object at 0x7636800beea0>

    def get_nemo_v2_callback_config(
        trainer: Any,
        data: Any,
    ) -> Dict[str, Any]:
        """Generate NeMo v2 specific configuration for the OneLogger training callback.
    
        This function extracts the global batch size and sequence length from the provided NeMo v2 data module,
        and uses them to construct the configuration dictionary for the OneLogger training callback.
    
        Args:
            trainer: PyTorch Lightning trainer instance.
            data: NeMo v2 data module (required).
    
        Returns:
            Dictionary containing the NeMo v2 training callback configuration.
        """
        # NeMo v2: Extract batch size and sequence length from data module (most reliable source)
        global_batch_size = 1  # Default fallback
        seq_length = 1  # Default fallback
    
        if data is not None:
>           seq_length = data.seq_length
E           AttributeError: 'MNISTDataModule' object has no attribute 'seq_length'

.../local/lib/python3.12.../nemo/lightning/one_logger_callback.py:226: AttributeError
sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py::test_forward[evo2/1b-8k:1.0-expected_matchpercents1]
Stack Traces | 9.23s run time
sequences = ['GAATAGGAACAGCTCCGGTCTACAGCTCCCAGCGTGAGCGACGCAGAAGACGGTGATTTCTGCATTTCCATCTGAGGTACCGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGC...CTCCATGACTTTTTCAAAAAGGTATTAGAAAAACCATTTCATAACTTTGTCAAAGTTAAATTATAGGCTAAATCCTATATATCTTAATGGCACATGCAGCGCAAGTAGGTCTACAAG']
ckpt_name = 'evo2/1b-8k:1.0'
expected_matchpercents = [96.27, 67.93, 77.5, 80.3]

    @pytest.mark.parametrize(
        "ckpt_name,expected_matchpercents",
        [
            ("evo2/1b-8k-bf16:1.0", [96.27, 67.93, 77.50, 80.30]),
            ("evo2/1b-8k:1.0", [96.27, 67.93, 77.50, 80.30]),
            ("evo2/7b-8k:1.0", [97.60, 89.63, 80.03, 84.57]),
            ("evo2/7b-1m:1.0", [97.60, 89.63, 80.03, 84.57]),
        ],
    )
    def test_forward(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float]):
        assert len(sequences) > 0
        seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
            ckpt_name, test_name=inspect.currentframe().f_code.co_name
        )
    
        is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
        skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
        if skip:
            # This checkpoint is sensitive to FP8, so we skip it if it is not supported on the current device.
            pytest.skip(f"Skipping {ckpt_name} because it is not supported on {device_info} ({compute_capability})")
        vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
        inference_wrapped_model, mcore_tokenizer = get_model_and_tokenizer(
            ckpt_name, vortex_style_fp8=vortex_style_fp8, flash_decode=True, enable_flash_decode=True
        )
        matchrates = []
        for seq in sequences:
            seq = seq[:seq_len_cap]  # TODO: artificial limit, megatron uses more memory. Vortex can process full sequences
            with torch.no_grad():
                device = torch.cuda.current_device()
                tokens = torch.tensor([mcore_tokenizer.tokenize(seq)], device=device)
                forward_args = {
                    "tokens": tokens,
                    "position_ids": None,
                    "attention_mask": None,
                }
    
                inference_wrapped_model.prep_model_for_inference(prompts_tokens=None)
                logits = inference_wrapped_model.run_one_forward_step(forward_args)
                inference_wrapped_model.inference_context.reset()
    
                from megatron.core.inference.communication_utils import broadcast_from_last_pipeline_stage
    
                batch_size, context_length, vocab_size = 1, len(seq), 512
>               logits = broadcast_from_last_pipeline_stage(
                    [batch_size, context_length, vocab_size],
                    dtype=inference_wrapped_model.inference_wrapper_config.params_dtype,
                    tensor=logits,
                )

.../bionemo/evo2/test_evo2.py:514: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

size = [1, 6000, 512], dtype = torch.bfloat16
tensor = tensor([[[ -4.2188, -46.0000, -46.0000, -46.0000, -46.0000, -46.0000, -46.0000,
          -46.0000, -46.0000, -46.0000...6.0000, -46.0000, -46.0000, -46.0000, -46.2500, -46.0000,
          -46.0000]]], device='cuda:0', dtype=torch.bfloat16)
pp_group = <torch.distributed.distributed_c10d.ProcessGroup object at 0x7e6e141800f0>

    def broadcast_from_last_pipeline_stage(
        size: List[int],
        dtype: torch.dtype,
        tensor: Optional[torch.Tensor] = None,
        pp_group: Optional[ProcessGroup] = None,
    ):
        """Broadcast a tensor from last pipeline stage to all ranks.
    
        Args:
            size: Expected tensor size
            dtype: Expected tensor dtype
            tensor: Tensor to broadcast (only on last stage)
            pp_group: Custom process group (if None, uses global state)
        """
        # Use custom process group or fall back to global state
        if pp_group is None:
            pp_group = parallel_state.get_pipeline_model_parallel_group()
            last_rank = parallel_state.get_pipeline_model_parallel_last_rank()
    
            # add ignore_virtual=True since vpp is not used in inference
            is_last_stage = parallel_state.is_pipeline_last_stage(ignore_virtual=True)
        else:
            # Lists of ProcessGroups are used for multimodal inference but not supported here
            assert isinstance(
                pp_group, ProcessGroup
            ), "pp_group must be a single ProcessGroup, not a list of ProcessGroups"
            last_rank = torch.distributed.get_process_group_ranks(pp_group)[pp_group.size() - 1]
            is_last_stage = pp_group.rank() == pp_group.size() - 1
    
        if is_last_stage:
>           assert size == list(
                tensor.shape
            ), f"Expected tensor of shape {size} but got {list(tensor.shape)}"
E           AssertionError: Expected tensor of shape [1, 6000, 512] but got [1, 1, 512]

.../local/lib/python3.12.../core/inference/communication_utils.py:64: AssertionError
sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py::test_forward_manual[evo2/1b-8k-bf16:1.0-expected_matchpercents0-True]
Stack Traces | 11.3s run time
sequences = ['GAATAGGAACAGCTCCGGTCTACAGCTCCCAGCGTGAGCGACGCAGAAGACGGTGATTTCTGCATTTCCATCTGAGGTACCGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGC...CTCCATGACTTTTTCAAAAAGGTATTAGAAAAACCATTTCATAACTTTGTCAAAGTTAAATTATAGGCTAAATCCTATATATCTTAATGGCACATGCAGCGCAAGTAGGTCTACAAG']
ckpt_name = 'evo2/1b-8k-bf16:1.0'
expected_matchpercents = [96.27, 67.93, 77.5, 80.3], flash_decode = True

    @pytest.mark.parametrize(
        "ckpt_name,expected_matchpercents,flash_decode",
        [
            # Try flash decode with one and not the other to verify that both paths work.
            ("evo2/1b-8k-bf16:1.0", [96.27, 67.93, 77.50, 80.30], True),
            ("evo2/1b-8k:1.0", [96.27, 67.93, 77.50, 80.30], False),
            ("evo2/7b-8k:1.0", [97.60, 89.63, 80.03, 84.57], False),
            ("evo2/7b-1m:1.0", [97.60, 89.63, 80.03, 84.57], False),
        ],
    )
    def test_forward_manual(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float], flash_decode: bool):
        assert len(sequences) > 0
        seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
            ckpt_name, test_name=inspect.currentframe().f_code.co_name
        )
    
        is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
        skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
    
        vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
        if skip:
            # This checkpoint is sensitive to FP8, so we skip it if it is not supported on the current device.
            pytest.skip(f"Skipping {ckpt_name} because it is not supported on {device_info} ({compute_capability})")
        with distributed_model_parallel_state(), torch.no_grad():
            tokenizer = get_nmt_tokenizer(
                "byte-level",
            )
            flash_decode_kwargs: dict[str, Any] = {"flash_decode": flash_decode}
            if flash_decode:
                flash_decode_kwargs["attention_backend"] = AttnBackend.flash
            if "1b-8k" in ckpt_name:
                model_config = llm.Hyena1bConfig(
                    use_te=True,
                    seq_length=8192,
                    vortex_style_fp8=vortex_style_fp8,
                    **flash_decode_kwargs,
                )
            elif "7b-8k" in ckpt_name:
                model_config = llm.Hyena7bConfig(
                    use_te=True,
                    seq_length=8192,
                    vortex_style_fp8=vortex_style_fp8,
                    **flash_decode_kwargs,
                )
            elif "7b-1m" in ckpt_name:
                model_config = llm.Hyena7bARCLongContextConfig(
                    use_te=True,
                    seq_length=8192,
                    vortex_style_fp8=vortex_style_fp8,
                    **flash_decode_kwargs,
                )
            else:
                raise NotImplementedError
            ckpt_weights: Path = load(ckpt_name) / "weights"
            raw_megatron_model = model_config.configure_model(tokenizer).eval().cuda()
            device = raw_megatron_model.parameters().__next__().device
            load_weights_sharded_inplace_nemo2_to_mcore(raw_megatron_model, ckpt_weights, {}, "torch_dist")
            model = Float16Module(model_config, raw_megatron_model)
            if flash_decode:
                inference_context = HyenaInferenceContext(max_batch_size=1, max_sequence_length=8192)
                forward_kwargs = {"runtime_gather_output": True, "inference_context": inference_context}
            else:
                forward_kwargs = {}
            matchrates = []
            for seq in sequences:
                seq = seq[
                    :seq_len_cap
                ]  # TODO: artificial limit, megatron uses more memory. Vortex can process full sequences
                with torch.no_grad():
                    device = torch.cuda.current_device()
                    # tokens = torch.tensor([tokenizer.tokenize(seq)], device=device)
                    input_ids = torch.tensor(tokenizer.text_to_ids(seq)).int().unsqueeze(0).to(device)
                    attention_mask = None
                    # when labels is None, the model returns logits
                    logits = model(
                        input_ids=input_ids,
                        position_ids=None,
                        attention_mask=attention_mask,
                        labels=None,
                        **forward_kwargs,
                    )
                    if flash_decode:
                        forward_kwargs["inference_context"].reset()
>                   matchrate = calc_matchrate(tokenizer=tokenizer, in_seq=seq, logits=logits)

.../bionemo/evo2/test_evo2.py:614: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def calc_matchrate(*, tokenizer, in_seq, logits):
        softmax_logprobs = torch.log_softmax(logits, dim=-1)
        softmax_logprobs = softmax_logprobs[:, :-1]
        o = softmax_logprobs.argmax(dim=-1)[0]
        if hasattr(tokenizer, "tokenize"):
            i = torch.tensor(tokenizer.tokenize(in_seq[1:]), device=o.device)
        else:
            i = torch.tensor(tokenizer.text_to_ids(in_seq[1:]), device=o.device)
>       return (i == o).sum().item() / (i.size()[0] - 1)
E       RuntimeError: The size of tensor a (5999) must match the size of tensor b (0) at non-singleton dimension 0

.../bionemo/evo2/test_evo2.py:452: RuntimeError
sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py::test_forward[evo2/1b-8k-bf16:1.0-expected_matchpercents0]
Stack Traces | 11.5s run time
sequences = ['GAATAGGAACAGCTCCGGTCTACAGCTCCCAGCGTGAGCGACGCAGAAGACGGTGATTTCTGCATTTCCATCTGAGGTACCGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGC...CTCCATGACTTTTTCAAAAAGGTATTAGAAAAACCATTTCATAACTTTGTCAAAGTTAAATTATAGGCTAAATCCTATATATCTTAATGGCACATGCAGCGCAAGTAGGTCTACAAG']
ckpt_name = 'evo2/1b-8k-bf16:1.0'
expected_matchpercents = [96.27, 67.93, 77.5, 80.3]

    @pytest.mark.parametrize(
        "ckpt_name,expected_matchpercents",
        [
            ("evo2/1b-8k-bf16:1.0", [96.27, 67.93, 77.50, 80.30]),
            ("evo2/1b-8k:1.0", [96.27, 67.93, 77.50, 80.30]),
            ("evo2/7b-8k:1.0", [97.60, 89.63, 80.03, 84.57]),
            ("evo2/7b-1m:1.0", [97.60, 89.63, 80.03, 84.57]),
        ],
    )
    def test_forward(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float]):
        assert len(sequences) > 0
        seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
            ckpt_name, test_name=inspect.currentframe().f_code.co_name
        )
    
        is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
        skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
        if skip:
            # This checkpoint is sensitive to FP8, so we skip it if it is not supported on the current device.
            pytest.skip(f"Skipping {ckpt_name} because it is not supported on {device_info} ({compute_capability})")
        vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
        inference_wrapped_model, mcore_tokenizer = get_model_and_tokenizer(
            ckpt_name, vortex_style_fp8=vortex_style_fp8, flash_decode=True, enable_flash_decode=True
        )
        matchrates = []
        for seq in sequences:
            seq = seq[:seq_len_cap]  # TODO: artificial limit, megatron uses more memory. Vortex can process full sequences
            with torch.no_grad():
                device = torch.cuda.current_device()
                tokens = torch.tensor([mcore_tokenizer.tokenize(seq)], device=device)
                forward_args = {
                    "tokens": tokens,
                    "position_ids": None,
                    "attention_mask": None,
                }
    
                inference_wrapped_model.prep_model_for_inference(prompts_tokens=None)
                logits = inference_wrapped_model.run_one_forward_step(forward_args)
                inference_wrapped_model.inference_context.reset()
    
                from megatron.core.inference.communication_utils import broadcast_from_last_pipeline_stage
    
                batch_size, context_length, vocab_size = 1, len(seq), 512
>               logits = broadcast_from_last_pipeline_stage(
                    [batch_size, context_length, vocab_size],
                    dtype=inference_wrapped_model.inference_wrapper_config.params_dtype,
                    tensor=logits,
                )

.../bionemo/evo2/test_evo2.py:514: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

size = [1, 6000, 512], dtype = torch.bfloat16
tensor = tensor([[[ -5.1250, -41.7500, -41.7500, -41.7500, -41.7500, -41.7500, -41.7500,
          -41.7500, -41.7500, -41.7500...1.7500, -41.7500, -41.7500, -41.7500, -41.7500, -41.7500,
          -41.7500]]], device='cuda:0', dtype=torch.bfloat16)
pp_group = <torch.distributed.distributed_c10d.ProcessGroup object at 0x7e6e141800f0>

    def broadcast_from_last_pipeline_stage(
        size: List[int],
        dtype: torch.dtype,
        tensor: Optional[torch.Tensor] = None,
        pp_group: Optional[ProcessGroup] = None,
    ):
        """Broadcast a tensor from last pipeline stage to all ranks.
    
        Args:
            size: Expected tensor size
            dtype: Expected tensor dtype
            tensor: Tensor to broadcast (only on last stage)
            pp_group: Custom process group (if None, uses global state)
        """
        # Use custom process group or fall back to global state
        if pp_group is None:
            pp_group = parallel_state.get_pipeline_model_parallel_group()
            last_rank = parallel_state.get_pipeline_model_parallel_last_rank()
    
            # add ignore_virtual=True since vpp is not used in inference
            is_last_stage = parallel_state.is_pipeline_last_stage(ignore_virtual=True)
        else:
            # Lists of ProcessGroups are used for multimodal inference but not supported here
            assert isinstance(
                pp_group, ProcessGroup
            ), "pp_group must be a single ProcessGroup, not a list of ProcessGroups"
            last_rank = torch.distributed.get_process_group_ranks(pp_group)[pp_group.size() - 1]
            is_last_stage = pp_group.rank() == pp_group.size() - 1
    
        if is_last_stage:
>           assert size == list(
                tensor.shape
            ), f"Expected tensor of shape {size} but got {list(tensor.shape)}"
E           AssertionError: Expected tensor of shape [1, 6000, 512] but got [1, 1, 512]

.../local/lib/python3.12.../core/inference/communication_utils.py:64: AssertionError
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGoCheckpointNotAtValidation::test_stop_and_go_consistency[LearningRateCallback]
Stack Traces | 18.4s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGoCheckpointNotAtValidation'>

    @override
    @classmethod
    def setup_class(cls):
        super().setup_class()
        cls.data_dir = Path(cls.tempdir.name) / "data"
        cls.data_dir.mkdir(parents=True, exist_ok=True)
    
        # setup data
        data_dir = load("esm2/testdata_esm2_pretrain:2.0") / "2024_03_sanity"
    
        cls.train_cluster_path = data_dir / "train_clusters_sanity.parquet"
        cls.train_database_path = data_dir / "train_sanity.db"
        cls.valid_cluster_path = data_dir / "valid_clusters.parquet"
        cls.valid_database_path = data_dir / "validation.db"
        cls.tokenizer: BioNeMoESMTokenizer = get_tokenizer()
    
        # run stop and go
>       cls.run_stop_and_go()

.../esm2/model/test_stop_and_go.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: in run_stop_and_go
    cls.resume()
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ef43bd130>
model_param = Parameter containing:
tensor([-5.8999e-06, -5.9239e-06, -5.9267e-06, -5.9098e-06,  5.7665e-06,
         3.8576e-06,  5...-5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.4231e-04,  1.1678e-04,  9.9285e-05,  1.0032e-04, -1.5391e-03,
        -6.3849e-04, -8.9896e-04,..., -5.9722e-06, -5.9126e-06, -5.9227e-06, -5.9192e-06,
        -5.9195e-06, -5.9177e-06, -5.8719e-06], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError
../../../usr/local/lib/python3.12/dist-packages/bionemo/testing/harnesses/stop_and_go.py::test_stop_and_go_consistency[LearningRateCallback]
Stack Traces | 29s run time
cls = <class 'esm2.model.test_stop_and_go.TestESM2StopAndGo'>

    @classmethod
    def run_stop_and_go(cls):
        """Executes training both continuously and with a checkpoint interruption."""
        # Interrupted model training
        cls.stop()
>       cls.resume()

.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:315: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../testing/harnesses/stop_and_go.py:288: in resume
    llm.train(
.../local/lib/python3.12.../collections/llm/api.py:129: in train
    trainer.fit(model, data)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:538: in fit
    call._call_and_handle_interrupt(
.../local/lib/python3.12.../pytorch/trainer/call.py:46: in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
.../local/lib/python3.12.../strategies/launchers/subprocess_script.py:105: in launch
    return function(*args, **kwargs)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:574: in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
.../local/lib/python3.12.../pytorch/trainer/trainer.py:972: in _run
    self._checkpoint_connector.restore_training_state()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:298: in restore_training_state
    self.restore_optimizers_and_schedulers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:368: in restore_optimizers_and_schedulers
    self.restore_optimizers()
.../local/lib/python3.12.../trainer/connectors/checkpoint_connector.py:383: in restore_optimizers
    self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
.../local/lib/python3.12.../pytorch/strategies/megatron_strategy.py:1308: in load_optimizer_state_dict
    optimizer.load_state_dict(opt_state)
.../local/lib/python3.12.../core/optim/mcore_optim.py:95: in load_state_dict
    self.mcore_optimizer.load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/optimizer.py:1215: in load_state_dict
    self.chained_optimizers[0].load_state_dict(state_dict)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:865: in load_state_dict
    self.load_parameter_state_from_dp_reshardable(param_state)
.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:1781: in load_parameter_state_from_dp_reshardable
    self._set_main_param_and_optimizer_states(model_param, src_tensors)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <megatron.core.optimizer.distrib_optimizer.DistributedOptimizer object at 0x721ec85a6720>
model_param = Parameter containing:
tensor([-1.1761e-05, -1.1844e-05, -1.1866e-05, -1.1839e-05,  1.1568e-05,
         8.4812e-06,  1...-1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], requires_grad=True)
tensors = {'exp_avg': tensor([ 1.7037e-04,  1.4331e-04,  1.2414e-04,  1.2611e-04, -2.4063e-03,
        -8.2074e-04, -1.1227e-03,..., -1.1959e-05, -1.1846e-05, -1.1872e-05, -1.1833e-05,
        -1.1839e-05, -1.1828e-05, -1.1790e-05], device='cuda:0')}

    def _set_main_param_and_optimizer_states(self, model_param, tensors):
        """Set the main param and optimizer states corresponding to the input model_param.
    
        The structure of the input `tensors`:
        tensors = {
            "param": torch.Tensor
            "exp_avg": torch.Tensor
            "exp_avg_sq": torch.Tensor
        }
        """
        group_index, group_order = self.model_param_group_index_map[model_param]
        if self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8:
            sharded_model_param = self.optimizer.param_groups[group_index]["params"][group_order]
            for k, v in tensors.items():
                if isinstance(self.optimizer, HybridDeviceOptimizer):
                    if k == "param":
                        k = "master_param"
                    self.optimizer.state[sharded_model_param][k] = v
                    continue
    
                if k == "param":
                    self.optimizer.set_scaled_state(sharded_model_param, "master_param", v)
                else:
                    self.optimizer.set_scaled_state(sharded_model_param, k, v)
        else:
            main_param = self.optimizer.param_groups[group_index]["params"][group_order]
            optim_state = self.optimizer.state[main_param]
            dst_tensors = {"param": main_param, **optim_state}
            for key in dst_tensors:
>               dst_tensors[key].copy_(tensors[key])
E               RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

.../local/lib/python3.12.../core/optimizer/distrib_optimizer.py:928: RuntimeError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Signed-off-by: Farhad Ramezanghorbani <[email protected]>
@farhadrgh
Copy link
Collaborator Author

/ok to test fbfbcdb

@farhadrgh
Copy link
Collaborator Author

/ok to test dda492d

@farhadrgh
Copy link
Collaborator Author

/ok to test b4d2365

@farhadrgh farhadrgh closed this Nov 3, 2025
auto-merge was automatically disabled November 3, 2025 19:26

Pull request was closed

@farhadrgh farhadrgh reopened this Nov 3, 2025
Resolved submodule conflicts:
- Updated 3rdparty/Megatron-LM to main's version (b615e7310)
- Kept 3rdparty/NeMo at current version (231e75835a)
- Auto-merged datamodule.py files with seq_length property additions

Signed-off-by: Farhad Ramezanghorbani <[email protected]>
@farhadrgh
Copy link
Collaborator Author

/ok to test 07d88fa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants