Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lbtq main #1311

Closed
wants to merge 7 commits into from
Closed

Lbtq main #1311

wants to merge 7 commits into from

Conversation

le-horizon
Copy link
Contributor

main change for lower bound target q experiments.

still quite large.. Let me know if you'd prefer even smaller pieces.

@le-horizon le-horizon requested review from emailweixu and hnyu May 4, 2022 23:44
Copy link
Collaborator

@hnyu hnyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only roughly skimmed over this PR. In general, I think it's not clear if it's necessary for most changes to be included in ALF. From my understanding, the best scenario would be to separate LBTQ algorithm from other misc stuff in a clean way with a general API, and only integrate it in ALF. All other stuff can be put in your personal repo which would contain all experiments details and helper functions, etc. So far I think the generality of some ALF apis has been compromised by the PR.


result = result._replace(discount=discount)
result = result._replace(step_type=step_type)
relabeled_rewards = suite_socialbot.transform_reward_tensor(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that HindsightExperienceTransformer is a general data transformer independent of the environment. So it feels strange to include suite_socialbot here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).

Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.

For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?

@@ -922,4 +1022,7 @@ def create_data_transformer(data_transformer_ctor, observation_spec):
if len(data_transformer_ctor) == 1:
return data_transformer_ctor[0](observation_spec)

if HindsightExperienceTransformer in data_transformer_ctor:
assert HindsightExperienceTransformer == data_transformer_ctor[0], \
"Hindsight relabeling should happen before all other transforms."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true after Break's recent PR on SequentialDataTransformer. Basically UntransformedTimeStep could appear before any other data transformer. I suggest either justifying why HindsightExperienceTransformer has to appear before UntransformedTimeStep or removing this assertion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed.


final_relabeled_rewards = relabeled_rewards
if result.reward.ndim > 2:
# multi dimensional env reward, first dim is goal related reward.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reward assumption too strict and easily violated by a user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.

@@ -759,6 +750,9 @@ def __init__(self,
exp nest.
desired_goal_field (str): path to the desired_goal field in the
exp nest.
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transform from -1/0 to 0/1 seems specific to certain environments and is not a general thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.

Calling the variable relabel_with_episodic_rewards, since -1/0 is for continuous and 0/1 is for episodic.

@@ -180,6 +182,7 @@ def __init__(self,
observation_spec,
stack_size=4,
stack_axis=0,
convert_only_minibatch_to_device=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this argument mean? docstring should explain its actual meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

lb_loss_scale: bool = False,
reward_multiplier: float = 1.,
positive_reward: bool = True,
use_retrace: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it's not clear which of these arguments are really essential to TDLoss in a general sense. Some of them seem only applicable to specific scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.

'TrainerConfig.replay_buffer_length=64',
# If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training
# algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training.
# This fails during playing, when the config file isn't generated in root_dir.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a longer replay buffer won't give enough training data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the replay buffer is cleared before it's fully filled. So training never starts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the replay buffer is cleared before it's fully filled. So training never starts.

Whole replay buffer training doesn't require the buffer to be full? It just gets all of data in the buffer even though then it's not full.

Just curious: this number of 64 had no issue before? Did you meet this problem for new test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right. Turns out it was due to my earlier change in rl_algorithm which required buffer to be full even when clear_replay_buffer is False. That change was already reverted. This change is also reverted.

env,
reward_cap=1.,
positive_reward=True,
append_reward_dim=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings are needed for the new arguments. The name of reward_cap doesn't correctly reflect its usage which is actually a scaling factor. Why do you need append_reward_dim (it should be put in another wrapper if needed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.

assert (
environment_name.startswith("Fetch")
or environment_name.startswith("HandManipulate")
or environment_name.startswith("Ant")
Copy link
Collaborator

@hnyu hnyu May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's appropriate to include multiworld Ant in this suite? Note that "Fetch" and "HandManipulate" were introduced at the same time by OpenAI with very similar implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about moving this code to a separate suite_mujoco.py, since all these Ant envs are under: https://github.com/openai/gym/tree/master/gym/envs/mujoco

@@ -457,3 +458,36 @@ def transform(self, x):

def inverse_transform(self, y):
return y.sign() * ((y / self._alpha).abs().exp() - 1)


@alf.configurable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hindsight related functions shouldn't be put in a general file like math_ops.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

These calculations are very simple (basically one line) and don't have to be factored into functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hindsight data_transformer requires an input function, and the reward function in these two places need to be exactly the same, otherwise, bugs can happen. Plus it is already not that simple due to multi dim reward.
That's why I keep them as alf configurable functions.

Copy link
Contributor Author

@le-horizon le-horizon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insightful comments @hnyu .
Tried to address all. Replies or questions inlined. PTAL.

Thanks,
Le

@@ -759,6 +750,9 @@ def __init__(self,
exp nest.
desired_goal_field (str): path to the desired_goal field in the
exp nest.
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.


result = result._replace(discount=discount)
result = result._replace(step_type=step_type)
relabeled_rewards = suite_socialbot.transform_reward_tensor(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).

Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.

For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?


final_relabeled_rewards = relabeled_rewards
if result.reward.ndim > 2:
# multi dimensional env reward, first dim is goal related reward.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.

@@ -922,4 +1022,7 @@ def create_data_transformer(data_transformer_ctor, observation_spec):
if len(data_transformer_ctor) == 1:
return data_transformer_ctor[0](observation_spec)

if HindsightExperienceTransformer in data_transformer_ctor:
assert HindsightExperienceTransformer == data_transformer_ctor[0], \
"Hindsight relabeling should happen before all other transforms."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed.

return AlgStep(
output=noisy_action,
state=state,
info=DdpgInfo(action=noisy_action, action_distribution=action))
info=DdpgInfo(action=noisy_action, action_distribution=()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fail early and clearly. action is a tensor, not distribution. Putting action directly there could cause confusion when debugging. See comment 3 lines above.

lb_loss_scale: bool = False,
reward_multiplier: float = 1.,
positive_reward: bool = True,
use_retrace: bool = False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.

'TrainerConfig.replay_buffer_length=64',
# If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training
# algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training.
# This fails during playing, when the config file isn't generated in root_dir.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the replay buffer is cleared before it's fully filled. So training never starts.

env,
reward_cap=1.,
positive_reward=True,
append_reward_dim=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.

assert (
environment_name.startswith("Fetch")
or environment_name.startswith("HandManipulate")
or environment_name.startswith("Ant")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

@@ -457,3 +458,36 @@ def transform(self, x):

def inverse_transform(self, y):
return y.sign() * ((y / self._alpha).abs().exp() - 1)


@alf.configurable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

Copy link
Contributor

@emailweixu emailweixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR still contains a lot of separable parts. It's better to separate them and describe the motivation for each of them.

return AlgStep(
output=noisy_action,
state=state,
info=DdpgInfo(action=noisy_action, action_distribution=action))
info=DdpgInfo(action=noisy_action, action_distribution=()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intended to return action_distribution here so that some other algorithm can use it (e.g. TracAlgorithm). Do you find it causing problem?

@@ -152,6 +154,9 @@ def __init__(self,
q_network_cls=QNetwork,
reward_weights=None,
epsilon_greedy=None,
rollout_epsilon_greedy=1.0,
use_epsilon_schedule=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to change epsilon_greedy to accept both a float and a Scheduler. For float, you can use schedulers.as_scheduler to convert it to a scheduler. Similar to TrainerConfig.priority_replay_alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -274,7 +290,10 @@ def __init__(self,
)

def _init_log_alpha():
return nn.Parameter(torch.tensor(float(initial_log_alpha)))
alpha = torch.tensor(float(initial_log_alpha))
if alpha_optimizer is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you want to fix alpha, you can simply set the learning rate for it to be zero. ???_optimizer==None ususally means using a default optimizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I misunderstood the param. lr=0 sounds good. Removed.

inputs.observation,
state=state.action,
epsilon_greedy=self._epsilon_greedy,
eps_greedy_sampling=True)
info = SacInfo(action_distribution=action_dist)
if (alf.summary.render.is_rendering_enabled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @hynu. The current situation is that different person want to render different thing. So it's hard to have an agreed-upon set of figures for rendering. This is different from tensorboard summary where space is not that precious.

"_improve_w_nstep_bootstrap") and \
self._critic_losses[0]._improve_w_nstep_bootstrap:
# Ignore 2nd - nth step actor losses.
actor_loss.loss[1:] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to igore the actor loss for those steps? It seems to me it doesn't hurt to keep them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.

td_lambda: Lambda parameter for TD-lambda computation.
normalize_target (bool): whether to normalize target.
Note that the effect of this is to change the loss. The critic
value itself is not normalized.
use_retrace: turn on retrace loss
:math:`\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t} c_{s}\right)\left(r_{t}+\gamma \mathbb{E}_{\pi} Q\left(x_{t+1}, \cdot\right)-Q\left(x_{t}, a_{t}\right)\right)\right]`
copied from PR #695.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"copied from PR #695" can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -118,6 +118,67 @@ def action_importance_ratio(action_distribution,
return importance_ratio, importance_ratio_clipped


def generalized_advantage_estimation(rewards,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be put side-by-side with the original generalized_advantage_estimation for comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

importance_ratio,
use_retrace=False,
td_lambda=1.0,
time_major=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should cite retrace paper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def __init__(self, env, reward_weight=1., positive_reward=True):
"""
Args:
reward_weight (float): weight of output reward.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use type hints at function signature instead

def __init__(self, env, reward_weight: float = 1., positive_reward: float = True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

fields=None):
"""Create a FrameStacker object.

Args:
observation_spec (nested TensorSpec): describing the observation in timestep
stack_size (int): stack so many frames
stack_axis (int): the dimension to stack the observation.
convert_only_minibatch_to_device (bool): whether to convert only the
minibatch or the whole batch of data to the default device.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name and docstring inconsistent with the code? FrameStacker will always affect the entire batch?

@@ -759,6 +750,9 @@ def __init__(self,
exp nest.
desired_goal_field (str): path to the desired_goal field in the
exp nest.
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.

Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.

From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).

reward_fn=l2_dist_close_reward_fn):
sparse_reward=False,
add_noise_to_goals=False,
threshold=.05,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can merge add_noise_to_goals and threshold into just one relabeled_goal_noise, and when it's 0 there is no noise added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

dist = buffer.steps_to_episode_end(last_step_pos, last_env_ids)
if alf.summary.should_record_summaries():
alf.summary.scalar(
"replayer/" + buffer._name + ".mean_steps_to_episode_end",
torch.mean(dist.type(torch.float32)))

def _add_noise(t):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function transform_experience() has already over 200 lines, and it's really difficult to read. I suggest splitting it into several member functions to make it more modular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Cleaned up a bit.

@@ -42,7 +42,7 @@
DdpgInfo = namedtuple(
"DdpgInfo", [
"reward", "step_type", "discount", "action", "action_distribution",
"actor_loss", "critic", "discounted_return"
"actor_loss", "critic", "discounted_return", "future_distance", "her"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not to put DDPG-irrelevant concepts into this Info structure. HER and DDPG are two orthogonal concepts and should be disentangled in the code. Otherwise when we have a third algorithm in the future, this Info structure will be added more fields again.

assert (
environment_name.startswith("Fetch")
or environment_name.startswith("HandManipulate")
or environment_name.startswith("Ant")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.

I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.

return reward


class SparseReward(gym.Wrapper):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reuse the one in suite_robotics.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also assumes first reward dimension being goal reward with multi dim reward.
So perhaps a bit of code duplication is actually better to separate out the env specific reward logic.



@alf.configurable
def transform_reward_tensor(reward, reward_cap=1., positive_reward=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this function used? If it's only in your conf file, then it should be put there instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both suite_robotics and suite_socialbot and their conf files could use this.

@@ -38,11 +40,61 @@ def is_available():
return social_bot is not None


@alf.configurable
def transform_reward(reward, reward_cap=1., positive_reward=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a class member of SparseReward below, as it is not used by other files / code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot do that. positive_reward needs to be consistent with transform_reward_tensor, typically done in config file or command line config. It needs to be alf.configurable.

@@ -457,3 +458,36 @@ def transform(self, x):

def inverse_transform(self, y):
return y.sign() * ((y / self._alpha).abs().exp() - 1)


@alf.configurable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?

These calculations are very simple (basically one line) and don't have to be factored into functions.

Copy link
Contributor Author

@le-horizon le-horizon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all comments. Please hold off from reviewing, and let me further split this PR.

Thanks,
Le

@@ -180,6 +182,7 @@ def __init__(self,
observation_spec,
stack_size=4,
stack_axis=0,
convert_only_minibatch_to_device=False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

return AlgStep(
output=noisy_action,
state=state,
info=DdpgInfo(action=noisy_action, action_distribution=action))
info=DdpgInfo(action=noisy_action, action_distribution=()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when used with e.g. Retrace, a distribution is needed, but action is not a distribution.

"_improve_w_nstep_bootstrap") and \
self._critic_losses[0]._improve_w_nstep_bootstrap:
# Ignore 2nd - nth step actor losses.
actor_loss.loss[1:] = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.

@@ -56,7 +57,8 @@
SacInfo = namedtuple(
"SacInfo", [
"reward", "step_type", "discount", "action", "action_distribution",
"actor", "critic", "alpha", "log_pi", "discounted_return"
"actor", "critic", "alpha", "log_pi", "discounted_return",
"future_distance", "her"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For now, added comment that these three fields are optional, added TODO for future work to add the HER algorithm wrapper.

rollout_epsilon_greedy.
max_target_action (bool): whether to use the action with the highest
target value as the target action for computing bootstrapped value.
To mimic the DQN algorithm, set this to True.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it a try, how about this new dqn_algorithm.py?

reward_fn=l2_dist_close_reward_fn):
sparse_reward=False,
add_noise_to_goals=False,
threshold=.05,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

@@ -759,6 +750,9 @@ def __init__(self,
exp nest.
desired_goal_field (str): path to the desired_goal field in the
exp nest.
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.

Calling the variable relabel_with_episodic_rewards, since -1/0 is for continuous and 0/1 is for episodic.

dist = buffer.steps_to_episode_end(last_step_pos, last_env_ids)
if alf.summary.should_record_summaries():
alf.summary.scalar(
"replayer/" + buffer._name + ".mean_steps_to_episode_end",
torch.mean(dist.type(torch.float32)))

def _add_noise(t):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Cleaned up a bit.

alpha = torch.tensor(float(initial_log_alpha))
if alpha_optimizer is None:
return alpha
return nn.Parameter(alpha)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Removed.


def step(self, action):
obs, reward, done, info = self.env.step(action)
return obs, reward, done, {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@le-horizon
Copy link
Contributor Author

Split into #1330 and a few others

@le-horizon le-horizon closed this May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants