-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lbtq main #1311
Lbtq main #1311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only roughly skimmed over this PR. In general, I think it's not clear if it's necessary for most changes to be included in ALF. From my understanding, the best scenario would be to separate LBTQ algorithm from other misc stuff in a clean way with a general API, and only integrate it in ALF. All other stuff can be put in your personal repo which would contain all experiments details and helper functions, etc. So far I think the generality of some ALF apis has been compromised by the PR.
alf/algorithms/data_transformer.py
Outdated
|
||
result = result._replace(discount=discount) | ||
result = result._replace(step_type=step_type) | ||
relabeled_rewards = suite_socialbot.transform_reward_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that HindsightExperienceTransformer
is a general data transformer independent of the environment. So it feels strange to include suite_socialbot
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).
Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.
For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?
alf/algorithms/data_transformer.py
Outdated
@@ -922,4 +1022,7 @@ def create_data_transformer(data_transformer_ctor, observation_spec): | |||
if len(data_transformer_ctor) == 1: | |||
return data_transformer_ctor[0](observation_spec) | |||
|
|||
if HindsightExperienceTransformer in data_transformer_ctor: | |||
assert HindsightExperienceTransformer == data_transformer_ctor[0], \ | |||
"Hindsight relabeling should happen before all other transforms." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer true after Break's recent PR on SequentialDataTransformer
. Basically UntransformedTimeStep
could appear before any other data transformer. I suggest either justifying why HindsightExperienceTransformer
has to appear before UntransformedTimeStep
or removing this assertion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removed.
alf/algorithms/data_transformer.py
Outdated
|
||
final_relabeled_rewards = relabeled_rewards | ||
if result.reward.ndim > 2: | ||
# multi dimensional env reward, first dim is goal related reward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reward assumption too strict and easily violated by a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.
alf/algorithms/data_transformer.py
Outdated
@@ -759,6 +750,9 @@ def __init__(self, | |||
exp nest. | |||
desired_goal_field (str): path to the desired_goal field in the | |||
exp nest. | |||
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transform from -1/0 to 0/1 seems specific to certain environments and is not a general thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.
Calling the variable relabel_with_episodic_rewards
, since -1/0 is for continuous and 0/1 is for episodic.
@@ -180,6 +182,7 @@ def __init__(self, | |||
observation_spec, | |||
stack_size=4, | |||
stack_axis=0, | |||
convert_only_minibatch_to_device=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this argument mean? docstring should explain its actual meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
lb_loss_scale: bool = False, | ||
reward_multiplier: float = 1., | ||
positive_reward: bool = True, | ||
use_retrace: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it's not clear which of these arguments are really essential to TDLoss
in a general sense. Some of them seem only applicable to specific scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.
alf/bin/train_play_test.py
Outdated
'TrainerConfig.replay_buffer_length=64', | ||
# If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training | ||
# algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training. | ||
# This fails during playing, when the config file isn't generated in root_dir. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a longer replay buffer won't give enough training data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the replay buffer is cleared before it's fully filled. So training never starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the replay buffer is cleared before it's fully filled. So training never starts.
Whole replay buffer training doesn't require the buffer to be full? It just gets all of data in the buffer even though then it's not full.
Just curious: this number of 64 had no issue before? Did you meet this problem for new test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right. Turns out it was due to my earlier change in rl_algorithm which required buffer to be full even when clear_replay_buffer is False. That change was already reverted. This change is also reverted.
alf/environments/suite_robotics.py
Outdated
env, | ||
reward_cap=1., | ||
positive_reward=True, | ||
append_reward_dim=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings are needed for the new arguments. The name of reward_cap
doesn't correctly reflect its usage which is actually a scaling factor. Why do you need append_reward_dim
(it should be put in another wrapper if needed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.
assert ( | ||
environment_name.startswith("Fetch") | ||
or environment_name.startswith("HandManipulate") | ||
or environment_name.startswith("Ant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's appropriate to include multiworld Ant in this suite? Note that "Fetch" and "HandManipulate" were introduced at the same time by OpenAI with very similar implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. How about moving this code to a separate suite_mujoco.py, since all these Ant envs are under: https://github.com/openai/gym/tree/master/gym/envs/mujoco
@@ -457,3 +458,36 @@ def transform(self, x): | |||
|
|||
def inverse_transform(self, y): | |||
return y.sign() * ((y / self._alpha).abs().exp() - 1) | |||
|
|||
|
|||
@alf.configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hindsight related functions shouldn't be put in a general file like math_ops.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
These calculations are very simple (basically one line) and don't have to be factored into functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hindsight data_transformer requires an input function, and the reward function in these two places need to be exactly the same, otherwise, bugs can happen. Plus it is already not that simple due to multi dim reward.
That's why I keep them as alf configurable functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insightful comments @hnyu .
Tried to address all. Replies or questions inlined. PTAL.
Thanks,
Le
alf/algorithms/data_transformer.py
Outdated
@@ -759,6 +750,9 @@ def __init__(self, | |||
exp nest. | |||
desired_goal_field (str): path to the desired_goal field in the | |||
exp nest. | |||
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
alf/algorithms/data_transformer.py
Outdated
|
||
result = result._replace(discount=discount) | ||
result = result._replace(step_type=step_type) | ||
relabeled_rewards = suite_socialbot.transform_reward_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).
Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.
For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?
alf/algorithms/data_transformer.py
Outdated
|
||
final_relabeled_rewards = relabeled_rewards | ||
if result.reward.ndim > 2: | ||
# multi dimensional env reward, first dim is goal related reward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.
alf/algorithms/data_transformer.py
Outdated
@@ -922,4 +1022,7 @@ def create_data_transformer(data_transformer_ctor, observation_spec): | |||
if len(data_transformer_ctor) == 1: | |||
return data_transformer_ctor[0](observation_spec) | |||
|
|||
if HindsightExperienceTransformer in data_transformer_ctor: | |||
assert HindsightExperienceTransformer == data_transformer_ctor[0], \ | |||
"Hindsight relabeling should happen before all other transforms." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removed.
return AlgStep( | ||
output=noisy_action, | ||
state=state, | ||
info=DdpgInfo(action=noisy_action, action_distribution=action)) | ||
info=DdpgInfo(action=noisy_action, action_distribution=())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to fail early and clearly. action is a tensor, not distribution. Putting action directly there could cause confusion when debugging. See comment 3 lines above.
lb_loss_scale: bool = False, | ||
reward_multiplier: float = 1., | ||
positive_reward: bool = True, | ||
use_retrace: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.
alf/bin/train_play_test.py
Outdated
'TrainerConfig.replay_buffer_length=64', | ||
# If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training | ||
# algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training. | ||
# This fails during playing, when the config file isn't generated in root_dir. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the replay buffer is cleared before it's fully filled. So training never starts.
alf/environments/suite_robotics.py
Outdated
env, | ||
reward_cap=1., | ||
positive_reward=True, | ||
append_reward_dim=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.
assert ( | ||
environment_name.startswith("Fetch") | ||
or environment_name.startswith("HandManipulate") | ||
or environment_name.startswith("Ant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
@@ -457,3 +458,36 @@ def transform(self, x): | |||
|
|||
def inverse_transform(self, y): | |||
return y.sign() * ((y / self._alpha).abs().exp() - 1) | |||
|
|||
|
|||
@alf.configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
…ield "improve_w_nstep_bootstrap".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR still contains a lot of separable parts. It's better to separate them and describe the motivation for each of them.
return AlgStep( | ||
output=noisy_action, | ||
state=state, | ||
info=DdpgInfo(action=noisy_action, action_distribution=action)) | ||
info=DdpgInfo(action=noisy_action, action_distribution=())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intended to return action_distribution
here so that some other algorithm can use it (e.g. TracAlgorithm). Do you find it causing problem?
alf/algorithms/sac_algorithm.py
Outdated
@@ -152,6 +154,9 @@ def __init__(self, | |||
q_network_cls=QNetwork, | |||
reward_weights=None, | |||
epsilon_greedy=None, | |||
rollout_epsilon_greedy=1.0, | |||
use_epsilon_schedule=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to change epsilon_greedy
to accept both a float and a Scheduler. For float, you can use schedulers.as_scheduler to convert it to a scheduler. Similar to TrainerConfig.priority_replay_alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
alf/algorithms/sac_algorithm.py
Outdated
@@ -274,7 +290,10 @@ def __init__(self, | |||
) | |||
|
|||
def _init_log_alpha(): | |||
return nn.Parameter(torch.tensor(float(initial_log_alpha))) | |||
alpha = torch.tensor(float(initial_log_alpha)) | |||
if alpha_optimizer is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you want to fix alpha, you can simply set the learning rate for it to be zero. ???_optimizer==None
ususally means using a default optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I misunderstood the param. lr=0 sounds good. Removed.
alf/algorithms/sac_algorithm.py
Outdated
inputs.observation, | ||
state=state.action, | ||
epsilon_greedy=self._epsilon_greedy, | ||
eps_greedy_sampling=True) | ||
info = SacInfo(action_distribution=action_dist) | ||
if (alf.summary.render.is_rendering_enabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @hynu. The current situation is that different person want to render different thing. So it's hard to have an agreed-upon set of figures for rendering. This is different from tensorboard summary where space is not that precious.
alf/algorithms/ddpg_algorithm.py
Outdated
"_improve_w_nstep_bootstrap") and \ | ||
self._critic_losses[0]._improve_w_nstep_bootstrap: | ||
# Ignore 2nd - nth step actor losses. | ||
actor_loss.loss[1:] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to igore the actor loss for those steps? It seems to me it doesn't hurt to keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.
alf/algorithms/td_loss.py
Outdated
td_lambda: Lambda parameter for TD-lambda computation. | ||
normalize_target (bool): whether to normalize target. | ||
Note that the effect of this is to change the loss. The critic | ||
value itself is not normalized. | ||
use_retrace: turn on retrace loss | ||
:math:`\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t} c_{s}\right)\left(r_{t}+\gamma \mathbb{E}_{\pi} Q\left(x_{t+1}, \cdot\right)-Q\left(x_{t}, a_{t}\right)\right)\right]` | ||
copied from PR #695. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"copied from PR #695" can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -118,6 +118,67 @@ def action_importance_ratio(action_distribution, | |||
return importance_ratio, importance_ratio_clipped | |||
|
|||
|
|||
def generalized_advantage_estimation(rewards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be put side-by-side with the original generalized_advantage_estimation for comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
alf/utils/value_ops.py
Outdated
importance_ratio, | ||
use_retrace=False, | ||
td_lambda=1.0, | ||
time_major=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should cite retrace paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
alf/environments/suite_robotics.py
Outdated
def __init__(self, env, reward_weight=1., positive_reward=True): | ||
""" | ||
Args: | ||
reward_weight (float): weight of output reward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use type hints at function signature instead
def __init__(self, env, reward_weight: float = 1., positive_reward: float = True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
fields=None): | ||
"""Create a FrameStacker object. | ||
|
||
Args: | ||
observation_spec (nested TensorSpec): describing the observation in timestep | ||
stack_size (int): stack so many frames | ||
stack_axis (int): the dimension to stack the observation. | ||
convert_only_minibatch_to_device (bool): whether to convert only the | ||
minibatch or the whole batch of data to the default device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name and docstring inconsistent with the code? FrameStacker will always affect the entire batch?
alf/algorithms/data_transformer.py
Outdated
@@ -759,6 +750,9 @@ def __init__(self, | |||
exp nest. | |||
desired_goal_field (str): path to the desired_goal field in the | |||
exp nest. | |||
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).
alf/algorithms/data_transformer.py
Outdated
reward_fn=l2_dist_close_reward_fn): | ||
sparse_reward=False, | ||
add_noise_to_goals=False, | ||
threshold=.05, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can merge add_noise_to_goals
and threshold
into just one relabeled_goal_noise
, and when it's 0 there is no noise added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
alf/algorithms/data_transformer.py
Outdated
dist = buffer.steps_to_episode_end(last_step_pos, last_env_ids) | ||
if alf.summary.should_record_summaries(): | ||
alf.summary.scalar( | ||
"replayer/" + buffer._name + ".mean_steps_to_episode_end", | ||
torch.mean(dist.type(torch.float32))) | ||
|
||
def _add_noise(t): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function transform_experience()
has already over 200 lines, and it's really difficult to read. I suggest splitting it into several member functions to make it more modular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Cleaned up a bit.
alf/algorithms/ddpg_algorithm.py
Outdated
@@ -42,7 +42,7 @@ | |||
DdpgInfo = namedtuple( | |||
"DdpgInfo", [ | |||
"reward", "step_type", "discount", "action", "action_distribution", | |||
"actor_loss", "critic", "discounted_return" | |||
"actor_loss", "critic", "discounted_return", "future_distance", "her" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not to put DDPG-irrelevant concepts into this Info structure. HER and DDPG are two orthogonal concepts and should be disentangled in the code. Otherwise when we have a third algorithm in the future, this Info structure will be added more fields again.
assert ( | ||
environment_name.startswith("Fetch") | ||
or environment_name.startswith("HandManipulate") | ||
or environment_name.startswith("Ant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.
return reward | ||
|
||
|
||
class SparseReward(gym.Wrapper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reuse the one in suite_robotics.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also assumes first reward dimension being goal reward with multi dim reward.
So perhaps a bit of code duplication is actually better to separate out the env specific reward logic.
|
||
|
||
@alf.configurable | ||
def transform_reward_tensor(reward, reward_cap=1., positive_reward=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this function used? If it's only in your conf file, then it should be put there instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both suite_robotics and suite_socialbot and their conf files could use this.
@@ -38,11 +40,61 @@ def is_available(): | |||
return social_bot is not None | |||
|
|||
|
|||
@alf.configurable | |||
def transform_reward(reward, reward_cap=1., positive_reward=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a class member of SparseReward
below, as it is not used by other files / code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot do that. positive_reward
needs to be consistent with transform_reward_tensor, typically done in config file or command line config. It needs to be alf.configurable.
@@ -457,3 +458,36 @@ def transform(self, x): | |||
|
|||
def inverse_transform(self, y): | |||
return y.sign() * ((y / self._alpha).abs().exp() - 1) | |||
|
|||
|
|||
@alf.configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
These calculations are very simple (basically one line) and don't have to be factored into functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all comments. Please hold off from reviewing, and let me further split this PR.
Thanks,
Le
@@ -180,6 +182,7 @@ def __init__(self, | |||
observation_spec, | |||
stack_size=4, | |||
stack_axis=0, | |||
convert_only_minibatch_to_device=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
return AlgStep( | ||
output=noisy_action, | ||
state=state, | ||
info=DdpgInfo(action=noisy_action, action_distribution=action)) | ||
info=DdpgInfo(action=noisy_action, action_distribution=())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when used with e.g. Retrace, a distribution is needed, but action is not a distribution.
alf/algorithms/ddpg_algorithm.py
Outdated
"_improve_w_nstep_bootstrap") and \ | ||
self._critic_losses[0]._improve_w_nstep_bootstrap: | ||
# Ignore 2nd - nth step actor losses. | ||
actor_loss.loss[1:] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.
alf/algorithms/sac_algorithm.py
Outdated
@@ -56,7 +57,8 @@ | |||
SacInfo = namedtuple( | |||
"SacInfo", [ | |||
"reward", "step_type", "discount", "action", "action_distribution", | |||
"actor", "critic", "alpha", "log_pi", "discounted_return" | |||
"actor", "critic", "alpha", "log_pi", "discounted_return", | |||
"future_distance", "her" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. For now, added comment that these three fields are optional, added TODO for future work to add the HER algorithm wrapper.
alf/algorithms/sac_algorithm.py
Outdated
rollout_epsilon_greedy. | ||
max_target_action (bool): whether to use the action with the highest | ||
target value as the target action for computing bootstrapped value. | ||
To mimic the DQN algorithm, set this to True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it a try, how about this new dqn_algorithm.py?
alf/algorithms/data_transformer.py
Outdated
reward_fn=l2_dist_close_reward_fn): | ||
sparse_reward=False, | ||
add_noise_to_goals=False, | ||
threshold=.05, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
alf/algorithms/data_transformer.py
Outdated
@@ -759,6 +750,9 @@ def __init__(self, | |||
exp nest. | |||
desired_goal_field (str): path to the desired_goal field in the | |||
exp nest. | |||
sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.
Calling the variable relabel_with_episodic_rewards
, since -1/0 is for continuous and 0/1 is for episodic.
alf/algorithms/data_transformer.py
Outdated
dist = buffer.steps_to_episode_end(last_step_pos, last_env_ids) | ||
if alf.summary.should_record_summaries(): | ||
alf.summary.scalar( | ||
"replayer/" + buffer._name + ".mean_steps_to_episode_end", | ||
torch.mean(dist.type(torch.float32))) | ||
|
||
def _add_noise(t): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Cleaned up a bit.
alf/algorithms/sac_algorithm.py
Outdated
alpha = torch.tensor(float(initial_log_alpha)) | ||
if alpha_optimizer is None: | ||
return alpha | ||
return nn.Parameter(alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Removed.
alf/environments/suite_robotics.py
Outdated
|
||
def step(self, action): | ||
obs, reward, done, info = self.env.step(action) | ||
return obs, reward, done, {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Split into #1330 and a few others |
main change for lower bound target q experiments.
still quite large.. Let me know if you'd prefer even smaller pieces.