Lbtq main #1311

le-horizon · 2022-05-04T23:44:34Z

main change for lower bound target q experiments.

still quite large.. Let me know if you'd prefer even smaller pieces.

hnyu

I only roughly skimmed over this PR. In general, I think it's not clear if it's necessary for most changes to be included in ALF. From my understanding, the best scenario would be to separate LBTQ algorithm from other misc stuff in a clean way with a general API, and only integrate it in ALF. All other stuff can be put in your personal repo which would contain all experiments details and helper functions, etc. So far I think the generality of some ALF apis has been compromised by the PR.

hnyu · 2022-05-05T18:17:53Z

alf/algorithms/data_transformer.py

+
+                result = result._replace(discount=discount)
+                result = result._replace(step_type=step_type)
+                relabeled_rewards = suite_socialbot.transform_reward_tensor(


My understanding is that HindsightExperienceTransformer is a general data transformer independent of the environment. So it feels strange to include suite_socialbot here.

This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).

Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.

For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?

hnyu · 2022-05-05T18:19:58Z

alf/algorithms/data_transformer.py

@@ -922,4 +1022,7 @@ def create_data_transformer(data_transformer_ctor, observation_spec):
    if len(data_transformer_ctor) == 1:
        return data_transformer_ctor[0](observation_spec)

+    if HindsightExperienceTransformer in data_transformer_ctor:
+        assert HindsightExperienceTransformer == data_transformer_ctor[0], \
+            "Hindsight relabeling should happen before all other transforms."


This is no longer true after Break's recent PR on SequentialDataTransformer. Basically UntransformedTimeStep could appear before any other data transformer. I suggest either justifying why HindsightExperienceTransformer has to appear before UntransformedTimeStep or removing this assertion here.

Good point. Removed.

hnyu · 2022-05-05T18:21:12Z

alf/algorithms/data_transformer.py

+
+        final_relabeled_rewards = relabeled_rewards
+        if result.reward.ndim > 2:
+            # multi dimensional env reward, first dim is goal related reward.


Is this reward assumption too strict and easily violated by a user?

True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.

hnyu · 2022-05-05T18:22:28Z

alf/algorithms/data_transformer.py

@@ -759,6 +750,9 @@ def __init__(self,
                exp nest.
            desired_goal_field (str): path to the desired_goal field in the
                exp nest.
+            sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.


This transform from -1/0 to 0/1 seems specific to certain environments and is not a general thing.

Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).

Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.

Calling the variable relabel_with_episodic_rewards, since -1/0 is for continuous and 0/1 is for episodic.

hnyu · 2022-05-05T18:23:10Z

alf/algorithms/data_transformer.py

@@ -180,6 +182,7 @@ def __init__(self,
                 observation_spec,
                 stack_size=4,
                 stack_axis=0,
+                 convert_only_minibatch_to_device=False,


What does this argument mean? docstring should explain its actual meaning.

Added comment

hnyu · 2022-05-05T18:34:18Z

alf/algorithms/td_loss.py

+                 lb_loss_scale: bool = False,
+                 reward_multiplier: float = 1.,
+                 positive_reward: bool = True,
+                 use_retrace: bool = False,


To me, it's not clear which of these arguments are really essential to TDLoss in a general sense. Some of them seem only applicable to specific scenarios.

Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.

hnyu · 2022-05-05T18:36:09Z

alf/bin/train_play_test.py

-    'TrainerConfig.replay_buffer_length=64',
+    # If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training
+    # algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training.
+    # This fails during playing, when the config file isn't generated in root_dir.


Why a longer replay buffer won't give enough training data?

because the replay buffer is cleared before it's fully filled. So training never starts.

because the replay buffer is cleared before it's fully filled. So training never starts.

Whole replay buffer training doesn't require the buffer to be full? It just gets all of data in the buffer even though then it's not full.

Just curious: this number of 64 had no issue before? Did you meet this problem for new test cases?

Ah, you are right. Turns out it was due to my earlier change in rl_algorithm which required buffer to be full even when clear_replay_buffer is False. That change was already reverted. This change is also reverted.

hnyu · 2022-05-05T18:39:08Z

alf/environments/suite_robotics.py

+                 env,
+                 reward_cap=1.,
+                 positive_reward=True,
+                 append_reward_dim=False):


Docstrings are needed for the new arguments. The name of reward_cap doesn't correctly reflect its usage which is actually a scaling factor. Why do you need append_reward_dim (it should be put in another wrapper if needed)?

Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.

hnyu · 2022-05-05T18:41:49Z

alf/environments/suite_robotics.py

+    assert (
+        environment_name.startswith("Fetch")
+        or environment_name.startswith("HandManipulate")
+        or environment_name.startswith("Ant")


Not sure if it's appropriate to include multiworld Ant in this suite? Note that "Fetch" and "HandManipulate" were introduced at the same time by OpenAI with very similar implementations.

It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.

I see. How about moving this code to a separate suite_mujoco.py, since all these Ant envs are under: https://github.com/openai/gym/tree/master/gym/envs/mujoco

hnyu · 2022-05-05T18:43:10Z

alf/utils/math_ops.py

@@ -457,3 +458,36 @@ def transform(self, x):

    def inverse_transform(self, y):
        return y.sign() * ((y / self._alpha).abs().exp() - 1)
+
+
+@alf.configurable


These hindsight related functions shouldn't be put in a general file like math_ops.py.

Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

These calculations are very simple (basically one line) and don't have to be factored into functions.

The hindsight data_transformer requires an input function, and the reward function in these two places need to be exactly the same, otherwise, bugs can happen. Plus it is already not that simple due to multi dim reward.
That's why I keep them as alf configurable functions.

le-horizon

Thanks for the insightful comments @hnyu .
Tried to address all. Replies or questions inlined. PTAL.

Thanks,
Le

le-horizon · 2022-05-06T21:31:14Z

alf/algorithms/data_transformer.py

@@ -759,6 +750,9 @@ def __init__(self,
                exp nest.
            desired_goal_field (str): path to the desired_goal field in the
                exp nest.
+            sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.


Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

le-horizon · 2022-05-06T23:29:35Z

alf/algorithms/data_transformer.py

+
+                result = result._replace(discount=discount)
+                result = result._replace(step_type=step_type)
+                relabeled_rewards = suite_socialbot.transform_reward_tensor(


This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).

Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.

For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?

le-horizon · 2022-05-06T23:55:26Z

alf/algorithms/data_transformer.py

+
+        final_relabeled_rewards = relabeled_rewards
+        if result.reward.ndim > 2:
+            # multi dimensional env reward, first dim is goal related reward.


True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.

le-horizon · 2022-05-09T23:10:48Z

alf/algorithms/data_transformer.py

@@ -922,4 +1022,7 @@ def create_data_transformer(data_transformer_ctor, observation_spec):
    if len(data_transformer_ctor) == 1:
        return data_transformer_ctor[0](observation_spec)

+    if HindsightExperienceTransformer in data_transformer_ctor:
+        assert HindsightExperienceTransformer == data_transformer_ctor[0], \
+            "Hindsight relabeling should happen before all other transforms."


Good point. Removed.

le-horizon · 2022-05-09T23:12:40Z

alf/algorithms/ddpg_algorithm.py

        return AlgStep(
            output=noisy_action,
            state=state,
-            info=DdpgInfo(action=noisy_action, action_distribution=action))
+            info=DdpgInfo(action=noisy_action, action_distribution=()))


This is to fail early and clearly. action is a tensor, not distribution. Putting action directly there could cause confusion when debugging. See comment 3 lines above.

le-horizon · 2022-05-10T00:22:25Z

alf/algorithms/td_loss.py

+                 lb_loss_scale: bool = False,
+                 reward_multiplier: float = 1.,
+                 positive_reward: bool = True,
+                 use_retrace: bool = False,


Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.

le-horizon · 2022-05-10T04:22:38Z

alf/bin/train_play_test.py

-    'TrainerConfig.replay_buffer_length=64',
+    # If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training
+    # algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training.
+    # This fails during playing, when the config file isn't generated in root_dir.


because the replay buffer is cleared before it's fully filled. So training never starts.

le-horizon · 2022-05-10T04:28:01Z

alf/environments/suite_robotics.py

+                 env,
+                 reward_cap=1.,
+                 positive_reward=True,
+                 append_reward_dim=False):


Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.

le-horizon · 2022-05-10T04:54:00Z

alf/environments/suite_robotics.py

+    assert (
+        environment_name.startswith("Fetch")
+        or environment_name.startswith("HandManipulate")
+        or environment_name.startswith("Ant")


It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

le-horizon · 2022-05-10T05:05:35Z

alf/utils/math_ops.py

@@ -457,3 +458,36 @@ def transform(self, x):

    def inverse_transform(self, y):
        return y.sign() * ((y / self._alpha).abs().exp() - 1)
+
+
+@alf.configurable


Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

…ield "improve_w_nstep_bootstrap".

emailweixu

This PR still contains a lot of separable parts. It's better to separate them and describe the motivation for each of them.

emailweixu · 2022-05-11T01:03:09Z

alf/algorithms/ddpg_algorithm.py

        return AlgStep(
            output=noisy_action,
            state=state,
-            info=DdpgInfo(action=noisy_action, action_distribution=action))
+            info=DdpgInfo(action=noisy_action, action_distribution=()))


It is intended to return action_distribution here so that some other algorithm can use it (e.g. TracAlgorithm). Do you find it causing problem?

emailweixu · 2022-05-11T01:06:54Z

alf/algorithms/sac_algorithm.py

@@ -152,6 +154,9 @@ def __init__(self,
                 q_network_cls=QNetwork,
                 reward_weights=None,
                 epsilon_greedy=None,
+                 rollout_epsilon_greedy=1.0,
+                 use_epsilon_schedule=0,


It's better to change epsilon_greedy to accept both a float and a Scheduler. For float, you can use schedulers.as_scheduler to convert it to a scheduler. Similar to TrainerConfig.priority_replay_alpha

emailweixu · 2022-05-11T01:10:19Z

alf/algorithms/sac_algorithm.py

@@ -274,7 +290,10 @@ def __init__(self,
            )

        def _init_log_alpha():
-            return nn.Parameter(torch.tensor(float(initial_log_alpha)))
+            alpha = torch.tensor(float(initial_log_alpha))
+            if alpha_optimizer is None:


I think if you want to fix alpha, you can simply set the learning rate for it to be zero. ???_optimizer==None ususally means using a default optimizer.

Oh, I misunderstood the param. lr=0 sounds good. Removed.

emailweixu · 2022-05-11T01:13:50Z

alf/algorithms/sac_algorithm.py

            inputs.observation,
            state=state.action,
            epsilon_greedy=self._epsilon_greedy,
            eps_greedy_sampling=True)
+        info = SacInfo(action_distribution=action_dist)
+        if (alf.summary.render.is_rendering_enabled()


Agree with @hynu. The current situation is that different person want to render different thing. So it's hard to have an agreed-upon set of figures for rendering. This is different from tensorboard summary where space is not that precious.

emailweixu · 2022-05-11T01:17:16Z

alf/algorithms/ddpg_algorithm.py

+                    "_improve_w_nstep_bootstrap") and \
+            self._critic_losses[0]._improve_w_nstep_bootstrap:
+            # Ignore 2nd - nth step actor losses.
+            actor_loss.loss[1:] = 0


why do you need to igore the actor loss for those steps? It seems to me it doesn't hurt to keep them.

Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.

emailweixu · 2022-05-11T01:18:50Z

alf/algorithms/td_loss.py

            td_lambda: Lambda parameter for TD-lambda computation.
            normalize_target (bool): whether to normalize target.
                Note that the effect of this is to change the loss. The critic
                value itself is not normalized.
+            use_retrace: turn on retrace loss
+                :math:`\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t} c_{s}\right)\left(r_{t}+\gamma \mathbb{E}_{\pi} Q\left(x_{t+1}, \cdot\right)-Q\left(x_{t}, a_{t}\right)\right)\right]`
+                copied from PR #695.


"copied from PR #695" can be removed.

emailweixu · 2022-05-11T01:28:32Z

alf/utils/value_ops.py

@@ -118,6 +118,67 @@ def action_importance_ratio(action_distribution,
    return importance_ratio, importance_ratio_clipped


+def generalized_advantage_estimation(rewards,


This should be put side-by-side with the original generalized_advantage_estimation for comparison.

emailweixu · 2022-05-11T01:30:30Z

alf/utils/value_ops.py

+                                             importance_ratio,
+                                             use_retrace=False,
+                                             td_lambda=1.0,
+                                             time_major=True):


also should cite retrace paper.

emailweixu · 2022-05-11T01:31:58Z

alf/environments/suite_robotics.py

+    def __init__(self, env, reward_weight=1., positive_reward=True):
+        """
+        Args:
+            reward_weight (float): weight of output reward.


use type hints at function signature instead

def __init__(self, env, reward_weight: float = 1., positive_reward: float = True)

hnyu · 2022-05-11T21:44:23Z

alf/algorithms/data_transformer.py

                 fields=None):
        """Create a FrameStacker object.

        Args:
            observation_spec (nested TensorSpec): describing the observation in timestep
            stack_size (int): stack so many frames
            stack_axis (int): the dimension to stack the observation.
+            convert_only_minibatch_to_device (bool): whether to convert only the
+                minibatch or the whole batch of data to the default device.


Variable name and docstring inconsistent with the code? FrameStacker will always affect the entire batch?

hnyu · 2022-05-11T21:49:33Z

alf/algorithms/data_transformer.py

@@ -759,6 +750,9 @@ def __init__(self,
                exp nest.
            desired_goal_field (str): path to the desired_goal field in the
                exp nest.
+            sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.


Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).

hnyu · 2022-05-11T21:51:52Z

alf/algorithms/data_transformer.py

-                 reward_fn=l2_dist_close_reward_fn):
+                 sparse_reward=False,
+                 add_noise_to_goals=False,
+                 threshold=.05,


Can merge add_noise_to_goals and threshold into just one relabeled_goal_noise, and when it's 0 there is no noise added.

Good point. Done.

hnyu · 2022-05-11T21:54:12Z

alf/algorithms/data_transformer.py

            dist = buffer.steps_to_episode_end(last_step_pos, last_env_ids)
            if alf.summary.should_record_summaries():
                alf.summary.scalar(
                    "replayer/" + buffer._name + ".mean_steps_to_episode_end",
                    torch.mean(dist.type(torch.float32)))

+            def _add_noise(t):


This function transform_experience() has already over 200 lines, and it's really difficult to read. I suggest splitting it into several member functions to make it more modular.

Good point. Cleaned up a bit.

hnyu · 2022-05-11T21:56:16Z

alf/algorithms/ddpg_algorithm.py

@@ -42,7 +42,7 @@
 DdpgInfo = namedtuple(
    "DdpgInfo", [
        "reward", "step_type", "discount", "action", "action_distribution",
-        "actor_loss", "critic", "discounted_return"
+        "actor_loss", "critic", "discounted_return", "future_distance", "her"


Better not to put DDPG-irrelevant concepts into this Info structure. HER and DDPG are two orthogonal concepts and should be disentangled in the code. Otherwise when we have a third algorithm in the future, this Info structure will be added more fields again.

hnyu · 2022-05-11T22:17:00Z

alf/environments/suite_robotics.py

+    assert (
+        environment_name.startswith("Fetch")
+        or environment_name.startswith("HandManipulate")
+        or environment_name.startswith("Ant")


It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.

hnyu · 2022-05-11T22:18:06Z

alf/environments/suite_socialbot.py

+    return reward
+
+
+class SparseReward(gym.Wrapper):


You can reuse the one in suite_robotics.py?

This also assumes first reward dimension being goal reward with multi dim reward.
So perhaps a bit of code duplication is actually better to separate out the env specific reward logic.

hnyu · 2022-05-11T22:18:42Z

alf/environments/suite_socialbot.py

+
+
+@alf.configurable
+def transform_reward_tensor(reward, reward_cap=1., positive_reward=True):


Where is this function used? If it's only in your conf file, then it should be put there instead.

Both suite_robotics and suite_socialbot and their conf files could use this.

hnyu · 2022-05-11T22:19:34Z

alf/environments/suite_socialbot.py

@@ -38,11 +40,61 @@ def is_available():
    return social_bot is not None


+@alf.configurable
+def transform_reward(reward, reward_cap=1., positive_reward=True):


Should be a class member of SparseReward below, as it is not used by other files / code.

Cannot do that. positive_reward needs to be consistent with transform_reward_tensor, typically done in config file or command line config. It needs to be alf.configurable.

hnyu · 2022-05-11T22:25:30Z

alf/utils/math_ops.py

@@ -457,3 +458,36 @@ def transform(self, x):

    def inverse_transform(self, y):
        return y.sign() * ((y / self._alpha).abs().exp() - 1)
+
+
+@alf.configurable


Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

These calculations are very simple (basically one line) and don't have to be factored into functions.

…and plots

le-horizon

Addressed all comments. Please hold off from reviewing, and let me further split this PR.

Thanks,
Le

le-horizon · 2022-05-12T16:12:36Z

alf/algorithms/data_transformer.py

@@ -180,6 +182,7 @@ def __init__(self,
                 observation_spec,
                 stack_size=4,
                 stack_axis=0,
+                 convert_only_minibatch_to_device=False,


Added comment

le-horizon · 2022-05-12T16:17:45Z

alf/algorithms/ddpg_algorithm.py

        return AlgStep(
            output=noisy_action,
            state=state,
-            info=DdpgInfo(action=noisy_action, action_distribution=action))
+            info=DdpgInfo(action=noisy_action, action_distribution=()))


Yes, when used with e.g. Retrace, a distribution is needed, but action is not a distribution.

le-horizon · 2022-05-12T16:20:48Z

alf/algorithms/ddpg_algorithm.py

+                    "_improve_w_nstep_bootstrap") and \
+            self._critic_losses[0]._improve_w_nstep_bootstrap:
+            # Ignore 2nd - nth step actor losses.
+            actor_loss.loss[1:] = 0


Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.

le-horizon · 2022-05-12T16:28:37Z

alf/algorithms/sac_algorithm.py

@@ -56,7 +57,8 @@
 SacInfo = namedtuple(
    "SacInfo", [
        "reward", "step_type", "discount", "action", "action_distribution",
-        "actor", "critic", "alpha", "log_pi", "discounted_return"
+        "actor", "critic", "alpha", "log_pi", "discounted_return",
+        "future_distance", "her"


Good point. For now, added comment that these three fields are optional, added TODO for future work to add the HER algorithm wrapper.

le-horizon · 2022-05-12T16:31:47Z

alf/algorithms/sac_algorithm.py

+                rollout_epsilon_greedy.
+            max_target_action (bool): whether to use the action with the highest
+                target value as the target action for computing bootstrapped value.
+                To mimic the DQN algorithm, set this to True.


Given it a try, how about this new dqn_algorithm.py?

le-horizon · 2022-05-12T23:49:00Z

alf/algorithms/data_transformer.py

-                 reward_fn=l2_dist_close_reward_fn):
+                 sparse_reward=False,
+                 add_noise_to_goals=False,
+                 threshold=.05,


Good point. Done.

le-horizon · 2022-05-13T00:15:28Z

alf/algorithms/data_transformer.py

@@ -759,6 +750,9 @@ def __init__(self,
                exp nest.
            desired_goal_field (str): path to the desired_goal field in the
                exp nest.
+            sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.


Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.

Calling the variable relabel_with_episodic_rewards, since -1/0 is for continuous and 0/1 is for episodic.

le-horizon · 2022-05-13T00:55:11Z

alf/algorithms/data_transformer.py

            dist = buffer.steps_to_episode_end(last_step_pos, last_env_ids)
            if alf.summary.should_record_summaries():
                alf.summary.scalar(
                    "replayer/" + buffer._name + ".mean_steps_to_episode_end",
                    torch.mean(dist.type(torch.float32)))

+            def _add_noise(t):


Good point. Cleaned up a bit.

le-horizon · 2022-05-13T04:36:04Z

alf/algorithms/sac_algorithm.py

+            alpha = torch.tensor(float(initial_log_alpha))
+            if alpha_optimizer is None:
+                return alpha
+            return nn.Parameter(alpha)


Got it. Removed.

le-horizon · 2022-05-13T21:09:18Z

alf/environments/suite_robotics.py

+
+    def step(self, action):
+        obs, reward, done, info = self.env.step(action)
+        return obs, reward, done, {}


le-horizon · 2022-05-17T16:08:45Z

Split into #1330 and a few others

Le Horizon added 2 commits May 4, 2022 11:54

main change for lower bounding value target

1dcac7b

fix

87b86c1

le-horizon requested review from emailweixu and hnyu May 4, 2022 23:44

hnyu requested changes May 5, 2022

View reviewed changes

address comments

dc76b01

le-horizon commented May 10, 2022

View reviewed changes

Le Horizon added 2 commits May 10, 2022 09:14

fix tests: ddpg or sac algorithm should not require loss to contain f…

639ab02

…ield "improve_w_nstep_bootstrap".

add comment

6188278

emailweixu requested changes May 11, 2022

View reviewed changes

hnyu requested changes May 11, 2022

View reviewed changes

address comments, separate loss class, separate dqn_algorithm, confs …

ed38b92

…and plots

le-horizon commented May 14, 2022

View reviewed changes

remove unused conf to fix test.

12d704a

le-horizon closed this May 17, 2022

		@@ -118,6 +118,67 @@ def action_importance_ratio(action_distribution,
		return importance_ratio, importance_ratio_clipped


		def generalized_advantage_estimation(rewards,



		@alf.configurable
		def transform_reward_tensor(reward, reward_cap=1., positive_reward=True):

Lbtq main #1311

Lbtq main #1311

Conversation

le-horizon commented May 4, 2022

hnyu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hnyu May 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

le-horizon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emailweixu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hnyu left a comment •

edited

Loading

hnyu May 5, 2022 •

edited

Loading