draft_retrace #695

zhuboli · 2020-09-29T22:27:25Z

Change code in file value_ops and td_loss. Default value for train_info is None. If we give the train_info parameter and lambda is not equal to 1 and 0, we will use retrace method. So we do not need to change the code of sac_algorithm or sarsa_algorithm when other people do not want retrace method.

emailweixu

take a look at
https://github.com/HorizonRobotics/alf/blob/pytorch/docs/contributing.rst
to format your code properly

emailweixu · 2020-09-30T23:27:39Z

alf/algorithms/td_loss.py

@@ -99,15 +102,37 @@ def forward(self, experience, value, target_value):
                values=target_value,
                step_types=experience.step_type,
                discounts=experience.discount * self._gamma)
-        else:
+        elif train_info == None:


Instead of checking whether train_info is None, you should add an argument in __init__ to indicate whether use retrace.
You should also change SarsaAlgorithm and SacAlgorithm to pass in train_info.

emailweixu · 2020-09-30T23:28:45Z

alf/utils/value_ops.py

@@ -255,3 +255,36 @@ def generalized_advantage_estimation(rewards,
        advs = advs.transpose(0, 1)

    return advs.detach()
+####### add for the retrace method
+def generalized_advantage_estimation_retrace(importance_ratio, discounts, rewards, td_lambda, time_major, values, target_value,step_types):


please comment following the way of other functions.

Also need unittest for this function.

line too long

add space after ,

comments for the function need to be added

Haichao-Zhang · 2020-10-21T20:21:26Z

alf/algorithms/sarsa_algorithm.py

@@ -435,7 +435,7 @@ def calc_loss(self, experience, info: SarsaInfo):
            target_critic = tensor_utils.tensor_prepend_zero(
                info.target_critics)
            loss_info = self._critic_losses[i](shifted_experience, critic,
-                                               target_critic)
+                                               target_critic,info)


add space after ,

Haichao-Zhang · 2020-10-21T20:21:46Z

alf/algorithms/td_loss.py

@@ -31,6 +31,10 @@ def __init__(self,
                 td_error_loss_fn=element_wise_squared_loss,
                 td_lambda=0.95,
                 normalize_target=False,
+ some-feature-retrace


need to be removed

Haichao-Zhang · 2020-10-21T20:22:03Z

alf/algorithms/td_loss.py

+ some-feature-retrace
+                 use_retrace=0,
+
+ pytorch


need to be removed

Haichao-Zhang · 2020-10-21T20:22:39Z

alf/algorithms/td_loss.py

@@ -76,8 +80,13 @@ def __init__(self,
        self._debug_summaries = debug_summaries
        self._normalize_target = normalize_target
        self._target_normalizer = None
+ some-feature-retrace


remove, seems to be the tags from a merge

Haichao-Zhang · 2020-10-21T20:23:17Z

alf/algorithms/td_loss.py


    def forward(self, experience, value, target_value):
+ pytorch


Haichao-Zhang · 2020-10-21T20:24:03Z

alf/algorithms/td_loss.py

+        else:
+            scope = alf.summary.scope(self.__class__.__name__)       
+            importance_ratio,importance_ratio_clipped = value_ops.action_importance_ratio(
+                action_distribution=train_info.action_distribution,


format, line is too long

Haichao-Zhang

There seems to be many format issues. You may need to follow the workflow here to setup the formatting tools and also get a reference of coding standard:
https://alf.readthedocs.io/en/latest/contributing.html#workflow

Haichao-Zhang · 2020-10-21T20:39:27Z

alf/algorithms/td_loss.py

@@ -46,7 +50,7 @@ def __init__(self,
            :math:`G_t^\lambda = \hat{A}^{GAE}_t + V(s_t)`
        where the generalized advantage estimation is defined as:
            :math:`\hat{A}^{GAE}_t = \sum_{i=t}^{T-1}(\gamma\lambda)^{i-t}(R_{i+1} + \gamma V(s_{i+1}) - V(s_i))`
-
+        use_retrace = 0 means one step or multi_step loss, use_retrace = 1 means retrace loss


Can change use_retrace use bool value

Need to update comment

Haichao-Zhang · 2020-10-21T20:42:53Z

alf/algorithms/td_loss.py

-
+        else:
+            scope = alf.summary.scope(self.__class__.__name__)       
+            importance_ratio,importance_ratio_clipped = value_ops.action_importance_ratio(


add space after ,

Haichao-Zhang · 2020-10-21T20:46:32Z

alf/utils/value_ops.py

@@ -255,3 +255,36 @@ def generalized_advantage_estimation(rewards,
        advs = advs.transpose(0, 1)

    return advs.detach()
+####### add for the retrace method
+def generalized_advantage_estimation_retrace(importance_ratio, discounts, rewards, td_lambda, time_major, values, target_value,step_types):


line too long

add space after ,

comments for the function need to be added

Haichao-Zhang · 2020-10-21T20:47:01Z

alf/utils/value_ops_test.py

@@ -170,7 +170,32 @@ def test_generalized_advantage_estimation(self):
            discounts=discounts,
            td_lambda=td_lambda,
            expected=expected)
+
+class GeneralizedAdvantage_retrace_Test(unittest.TestCase):
+    """Tests for alf.utils.value_ops


comments not correct

Haichao-Zhang · 2020-10-26T20:37:14Z

alf/algorithms/td_loss.py

@@ -46,7 +50,7 @@ def __init__(self,
            :math:`G_t^\lambda = \hat{A}^{GAE}_t + V(s_t)`
        where the generalized advantage estimation is defined as:
            :math:`\hat{A}^{GAE}_t = \sum_{i=t}^{T-1}(\gamma\lambda)^{i-t}(R_{i+1} + \gamma V(s_{i+1}) - V(s_i))`
-
+        use_retrace = 0 means one step or multi_step loss, use_retrace = 1 means retrace loss


Need to update comment

Haichao-Zhang · 2020-10-26T20:52:23Z

alf/algorithms/td_loss.py

+                log_prob_clipping=0.0,
+                scope=scope,
+                check_numerics=False,
+                debug_summaries=True)


debug_summaries= debug_summaries

Haichao-Zhang · 2020-10-26T21:19:05Z

alf/utils/value_ops.py

+
+
+####### add for the retrace method
+def generalized_advantage_estimation_retrace(importance_ratio, discounts,


This function can be merged with generalized_advantage_estimation function

Haichao-Zhang · 2020-10-26T21:21:31Z

alf/algorithms/td_loss.py

+        else:
+            scope = alf.summary.scope(self.__class__.__name__)       
+            importance_ratio,importance_ratio_clipped = value_ops.action_importance_ratio(
+                action_distribution=train_info.action_distribution,


Haichao-Zhang · 2020-10-26T21:22:42Z

alf/algorithms/td_loss.py

@@ -91,6 +97,8 @@ def forward(self, experience, value, target_value):
            target_value (torch.Tensor): the time-major tensor for the value at
                each time step. This is used to calculate return. ``target_value``
                can be same as ``value``.
+            train_info (sarsa info, sac info): information used to calcuate importance_ratio


What is sarsa info, sac info here? Can this function be used with other algorithms beyond sac and sarsa?

draft_retrace

edde2da

emailweixu requested changes Sep 30, 2020

View reviewed changes

zhuboli added 3 commits October 8, 2020 15:31

fix retrace

394a39a

fix retrace

074b5da

fix retrace

07b5929

Haichao-Zhang mentioned this pull request Oct 19, 2020

fix retrace #705

Closed

zhuboli added 2 commits October 19, 2020 14:58

fix conflicts

1219ef1

Merge branch 'pytorch' into some-feature-retrace

5dc49be

Haichao-Zhang requested changes Oct 21, 2020

View reviewed changes

Haichao-Zhang reviewed Oct 21, 2020

View reviewed changes

fix retrace

027c817

Haichao-Zhang requested changes Oct 26, 2020

View reviewed changes

zhuboli added 3 commits October 27, 2020 18:41

still need merge advantage function

23d5945

merge function and fix bug

5cafd48

merge function and fix bug

2466126

emailweixu mentioned this pull request May 11, 2022

Lbtq main #1311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft_retrace #695

draft_retrace #695

zhuboli commented Sep 29, 2020

emailweixu left a comment

emailweixu Sep 30, 2020

emailweixu Sep 30, 2020

emailweixu Sep 30, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 26, 2020

Haichao-Zhang left a comment

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 26, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 21, 2020

Haichao-Zhang Oct 26, 2020

Haichao-Zhang Oct 26, 2020

Haichao-Zhang Oct 26, 2020

Haichao-Zhang Oct 26, 2020

Haichao-Zhang Oct 26, 2020



		####### add for the retrace method
		def generalized_advantage_estimation_retrace(importance_ratio, discounts,

draft_retrace #695

Are you sure you want to change the base?

draft_retrace #695

Conversation

zhuboli commented Sep 29, 2020

emailweixu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Haichao-Zhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment