Adds reset and step method to the BaseEnv class (#239)

# Description The current `omni.isaac.orbit.envs.BaseEnv` does not include the methods to `reset` and `step`, while the `RLTaskEnv` adds those functionalities. This PR unifies the structure of an `Env` and adds these core components to the `BaseEnv` as well. ## Type of change - New feature (non-breaking change which adds functionality) - This change requires a documentation update ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format` - [ ] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there --------- Co-authored-by: Mayank Mittal <[email protected]>
isaac-sim · Nov 16, 2023 · 39f4e96 · 39f4e96
1 parent 849d9b4
commit 39f4e96
Show file tree

Hide file tree

Showing 6 changed files with 681 additions and 70 deletions.
diff --git a/source/extensions/omni.isaac.orbit/config/extension.toml b/source/extensions/omni.isaac.orbit/config/extension.toml
@@ -1,7 +1,7 @@
 [package]
 
 # Note: Semantic Versioning is used: https://semver.org/
-version = "0.9.43"
+version = "0.9.44"
 
 # Description
 title = "ORBIT framework for Robot Learning"

diff --git a/source/extensions/omni.isaac.orbit/docs/CHANGELOG.rst b/source/extensions/omni.isaac.orbit/docs/CHANGELOG.rst
@@ -1,6 +1,16 @@
 Changelog
 ---------
 
+0.9.44 (2023-11-16)
+~~~~~~~~~~~~~~~~~~~
+
+Added
+^^^^^
+
+* Added methods :meth:`reset` and :meth:`step` to the :class:`omni.isaac.orbit.envs.BaseEnv`. This unifies
+  the environment interface for simple standalone applications with the class.
+
+
 0.9.43 (2023-11-16)
 ~~~~~~~~~~~~~~~~~~~
 

diff --git a/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/base_env.py b/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/base_env.py
@@ -6,6 +6,8 @@
 from __future__ import annotations
 
 import builtins
+import torch
+from typing import Any, Dict, Sequence, Union
 
 import omni.isaac.core.utils.torch as torch_utils
 
@@ -16,6 +18,29 @@
 
 from .base_env_cfg import BaseEnvCfg
 
+VecEnvObs = Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]
+"""Observation returned by the environment.
+
+The observations are stored in a dictionary. The keys are the group to which the observations belong.
+This is useful for various setups such as reinforcement learning with asymmetric actor-critic or
+multi-agent learning. For non-learning paradigms, this may include observations for different components
+of a system.
+
+Within each group, the observations can be stored either as a dictionary with keys as the names of each
+observation term in the group, or a single tensor obtained from concatenating all the observation terms.
+For example, for asymmetric actor-critic, the observation for the actor and the critic can be accessed
+using the keys ``"policy"`` and ``"critic"`` respectively.
+
+Note:
+    By default, most learning frameworks deal with default and privileged observations in different ways.
+    This handling must be taken care of by the wrapper around the :class:`RLTaskEnv` instance.
+
+    For included frameworks (RSL-RL, RL-Games, skrl), the observations must have the key "policy". In case,
+    the key "critic" is also present, then the critic observations are taken from the "critic" group.
+    Otherwise, they are the same as the "policy" group.
+
+"""
+
 
 class BaseEnv:
     """The base environment encapsulates the simulation scene and the environment managers.
@@ -112,6 +137,9 @@ def __init__(self, cfg: BaseEnvCfg):
             # if no window, then we don't need to store the window
             self._window = None
 
+        # allocate dictionary to store metrics
+        self.extras = {}
+
     def __del__(self):
         """Cleanup for the environment."""
         self.close()
@@ -171,6 +199,66 @@ def load_managers(self):
     Operations - MDP.
     """
 
+    def reset(self, seed: int | None = None, options: dict[str, Any] | None = None) -> tuple[VecEnvObs, dict]:
+        """Resets all the environments and returns observations.
+
+        Args:
+            seed: The seed to use for randomization. Defaults to None, in which case the seed is not set.
+            options: Additional information to specify how the environment is reset. Defaults to None.
+
+                Note:
+                    This argument is used for compatibility with Gymnasium environment definition.
+
+        Returns:
+            A tuple containing the observations and extras.
+        """
+        # set the seed
+        if seed is not None:
+            self.seed(seed)
+        # reset state of scene
+        indices = torch.arange(self.num_envs, dtype=torch.int64, device=self.device)
+        self._reset_idx(indices)
+        # return observations
+        return self.observation_manager.compute(), self.extras
+
+    def step(self, action: torch.Tensor) -> VecEnvObs:
+        """Execute one time-step of the environment's dynamics.
+
+        The environment steps forward at a fixed time-step, while the physics simulation is
+        decimated at a lower time-step. This is to ensure that the simulation is stable. These two
+        time-steps can be configured independently using the :attr:`BaseEnvCfg.decimation` (number of
+        simulation steps per environment step) and the :attr:`BaseEnvCfg.sim.dt` (physics time-step).
+        Based on these parameters, the environment time-step is computed as the product of the two.
+
+        Args:
+            action: The actions to apply on the environment. Shape is ``(num_envs, action_dim)``.
+
+        Returns:
+            A tuple containing the observations and extras.
+        """
+        # process actions
+        self.action_manager.process_action(action)
+        # perform physics stepping
+        for _ in range(self.cfg.decimation):
+            # set actions into buffers
+            self.action_manager.apply_action()
+            # set actions into simulator
+            self.scene.write_data_to_sim()
+            # simulate
+            self.sim.step(render=False)
+            # update buffers at sim dt
+            self.scene.update(dt=self.physics_dt)
+        # perform rendering if gui is enabled
+        if self.sim.has_gui():
+            self.sim.render()
+
+        # post-step: step interval randomization
+        if "interval" in self.randomization_manager.available_modes:
+            self.randomization_manager.randomize(mode="interval", dt=self.step_dt)
+
+        # return observations and extras
+        return self.observation_manager.compute(), self.extras
+
     @staticmethod
     def seed(seed: int = -1) -> int:
         """Set the seed for the environment.
@@ -202,3 +290,33 @@ def close(self):
                 self._window = None
             # update closing status
             self._is_closed = True
+
+    """
+    Helper functions.
+    """
+
+    def _reset_idx(self, env_ids: Sequence[int]):
+        """Reset environments based on specified indices.
+
+        Args:
+            env_ids: List of environment ids which must be reset
+        """
+        # reset the internal buffers of the scene elements
+        self.scene.reset(env_ids)
+        # randomize the MDP for environments that need a reset
+        if "reset" in self.randomization_manager.available_modes:
+            self.randomization_manager.randomize(env_ids=env_ids, mode="reset")
+
+        # iterate over all managers and reset them
+        # this returns a dictionary of information which is stored in the extras
+        # note: This is order-sensitive! Certain things need be reset before others.
+        self.extras["log"] = dict()
+        # -- observation manager
+        info = self.observation_manager.reset(env_ids)
+        self.extras["log"].update(info)
+        # -- action manager
+        info = self.action_manager.reset(env_ids)
+        self.extras["log"].update(info)
+        # -- randomization manager
+        info = self.randomization_manager.reset(env_ids)
+        self.extras["log"].update(info)
diff --git a/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/rl_task_env.py b/source/extensions/omni.isaac.orbit/omni/isaac/orbit/envs/rl_task_env.py
@@ -9,40 +9,16 @@
 import math
 import numpy as np
 import torch
-from typing import Any, ClassVar, Dict, Sequence, Tuple, Union
+from typing import Any, ClassVar, Dict, Sequence, Tuple
 
 from omni.isaac.version import get_version
 
 from omni.isaac.orbit.command_generators import CommandGeneratorBase
 from omni.isaac.orbit.managers import CurriculumManager, RewardManager, TerminationManager
 
-from .base_env import BaseEnv
+from .base_env import BaseEnv, VecEnvObs
 from .rl_task_env_cfg import RLTaskEnvCfg
 
-VecEnvObs = Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]
-"""Observation returned by the environment.
-
-The observations are stored in a dictionary. The keys are the group to which the observations belong.
-This is useful for various learning setups beyond vanilla reinforcement learning, such as asymmetric
-actor-critic, multi-agent, or hierarchical reinforcement learning.
-
-For example, for asymmetric actor-critic, the observation for the actor and the critic can be accessed
-using the keys ``"policy"`` and ``"critic"`` respectively.
-
-Within each group, the observations can be stored either as a dictionary with keys as the names of each
-observation term in the group, or a single tensor obtained from concatenating all the observation terms.
-
-Note:
-    By default, most learning frameworks deal with default and privileged observations in different ways.
-    This handling must be taken care of by the wrapper around the :class:`RLTaskEnv` instance.
-
-    For included frameworks (RSL-RL, RL-Games, skrl), the observations must have the key "policy". In case,
-    the key "critic" is also present, then the critic observations are taken from the "critic" group.
-    Otherwise, they are the same as the "policy" group.
-
-"""
-
-
 VecEnvStepReturn = Tuple[VecEnvObs, torch.Tensor, torch.Tensor, torch.Tensor, Dict]
 """The environment signals processed at the end of each step.
 
@@ -76,6 +52,14 @@ class RLTaskEnv(BaseEnv, gym.Env):
     environment. Thus, to reduce complexity, we directly use the :class:`gym.Env` over
     here and leave it up to library-defined wrappers to take care of wrapping this
     environment for their agents.
+
+    Note:
+        For vectorized environments, it is recommended to **only** call the :meth:`reset`
+        method once before the first call to :meth:`step`, i.e. after the environment is created.
+        After that, the :meth:`step` function handles the reset of terminated sub-environments.
+        This is because the simulator does not support resetting individual sub-environments
+        in a vectorized environment.
+
     """
 
     is_vector_env: ClassVar[bool] = True
@@ -107,8 +91,6 @@ def __init__(self, cfg: RLTaskEnvCfg, render_mode: str | None = None, **kwargs):
         self.common_step_counter = 0
         # -- init buffers
         self.episode_length_buf = torch.zeros(self.num_envs, device=self.device, dtype=torch.long)
-        # -- allocate dictionary to store metrics
-        self.extras = {}
 
         # setup the action and observation spaces for Gym
         self._configure_gym_env_spaces()
@@ -158,48 +140,18 @@ def load_managers(self):
     Operations - MDP
     """
 
-    def reset(self, seed: int | None = None, options: dict[str, Any] | None = None) -> tuple[VecEnvObs, dict]:
-        """Resets all the environments and returns observations and extras.
-
-        Note:
-            This function (if called) must **only** be called before the first call to :meth:`step`, i.e.
-            after the environment is created. After that, the :meth:`step` function handles the reset
-            of terminated sub-environments.
-
-        Args:
-            seed: The seed to use for randomization. Defaults to None, in which case the seed is not set.
-            options: Additional information to specify how the environment is reset. Defaults to None.
-
-                Note:
-                    This is not used in the current implementation. It is mostly there for compatibility with
-                    Gymnasium environment definition.
-
-        Returns:
-            A tuple containing the observations and extras.
-        """
-        # set the seed
-        if seed is not None:
-            gym.Env.reset(self, seed=seed)
-            self.seed(seed)
-        # reset state of scene
-        indices = torch.arange(self.num_envs, dtype=torch.int64, device=self.device)
-        self._reset_idx(indices)
-        # return observations
-        return self.observation_manager.compute(), self.extras
-
     def step(self, action: torch.Tensor) -> VecEnvStepReturn:
-        """Run one timestep of the environment's dynamics and reset terminated environments.
+        """Execute one time-step of the environment's dynamics and reset terminated environments.
 
-        The environment dynamics may comprise of many steps of the physics engine. The number of steps
-        is controlled by the :attr:`RLTaskEnvCfg.decimation` parameter in the configuration. This means
-        that the agent control can happen at a slower rate than the physics simulation. This is useful
-        for real-time control of the robot, where the control loop may be slower than the frequency of
-        the actual dynamics.
+        Unlike the :class:`BaseEnv.step` class, the function performs the following operations:
 
-        The function also handles resetting of the terminated environments, at the end of the physics
-        stepping and computation of the reward and terminated signals. This is because it is not
-        possible to reset the sub-environments individually due to the vectorized implementation
-        of sub-environments in the simulator.
+        1. Process the actions.
+        2. Perform physics stepping.
+        3. Perform rendering if gui is enabled.
+        4. Update the environment counters and compute the rewards and terminations.
+        5. Reset the environments that terminated.
+        6. Compute the observations.
+        7. Return the observations, rewards, resets and extras.
 
         Args:
             action: The actions to apply on the environment. Shape is ``(num_envs, action_dim)``.
@@ -255,12 +207,12 @@ def render(self) -> np.ndarray | None:
 
         By convention, if mode is:
 
-        - **human**: render to the current display and return nothing. Usually for human consumption.
+        - **human**: Render to the current display and return nothing. Usually for human consumption.
         - **rgb_array**: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an
           x-by-y pixel image, suitable for turning into a video.
 
         Returns:
-            The rendered image as a numpy array if mode is "rgb_array".
+            The rendered image as a numpy array if mode is "rgb_array". Otherwise, returns None.
 
         Raises:
             RuntimeError: If mode is set to "rgb_data" and simulation render mode does not support it.