DDPG/TD3/SAC for robotics tasks #65

sjywdxs · 2023-04-03T10:28:55Z

sjywdxs
Apr 3, 2023

I am working on a project that aims to leverage skrl on robotics manipulation (Franka) in Isaac gym environments.
I first tried PPO and TRPO agents as the controller for the robotics manipulator, and they provided great performance for different tasks.

However, when I moved on to actor-critic agents (DDPG, TD3, SAC), these agents seemed not to learn anything from the environments with the same reward configuration.

From the tensorboard, the reward and episode behaved like flat lines with small oscillations. I have tried the ideas from other posts and tuned the structure of the NN, hyperparameters ( clip_action, batch_size, memory_size, log_std_parameter, learning_rate, exploration.noise , etc.), but none of these parameters or combinations fixed the issue.

Are there any additional adjustments I need to make for these agents (DDPG, TD3, SAC) on robotics tasks?
I've attached the code for you to look over.

Thanks for your time.

# Define the models (deterministic models) for the DDPG agent using mixins
# and programming with two approaches (torch functional and torch.nn.Sequential class).
# - Actor (policy): takes as input the environment's observation/state and returns an action
# - Critic: takes the state and action as input and provides a value to guide the policy
class DeterministicActor(DeterministicMixin, Model):
    def __init__(self, observation_space, action_space, device, clip_actions=True):
        Model.__init__(self, observation_space, action_space, device)
        DeterministicMixin.__init__(self, clip_actions)

        self.net = nn.Sequential(nn.Linear(self.num_observations, 512),
                                 nn.ELU(),
                                 nn.Linear(512, 256),
                                 nn.ELU(),
                                 nn.Linear(256, self.num_actions),
                                 nn.Tanh())

    def compute(self, inputs, role):
        return self.net(inputs["states"]), {}

class DeterministicCritic(DeterministicMixin, Model):
    def __init__(self, observation_space, action_space, device, clip_actions=False):
        Model.__init__(self, observation_space, action_space, device)
        DeterministicMixin.__init__(self, clip_actions)

        self.net = nn.Sequential(nn.Linear(self.num_observations + self.num_actions, 512),
                                 nn.ELU(),
                                 nn.Linear(512, 256),
                                 nn.ELU(),
                                 nn.Linear(256, 1))

    def compute(self, inputs, role):
        return self.net(torch.cat([inputs["states"], inputs["taken_actions"]], dim=1)), {}




# Load and wrap the Omniverse Isaac Gym environment]
omniisaacgymenvs_path = os.path.realpath( os.path.join(os.path.realpath(__file__), "..") ) 
env = load_omniverse_isaacgym_env(task_name="FrankaCatching", omniisaacgymenvs_path = omniisaacgymenvs_path)
env = wrap_env(env)

device = env.device


# Instantiate a RandomMemory as rollout buffer (any memory can be used for this)
memory = RandomMemory(memory_size=8000, num_envs=env.num_envs, device=device, replacement=False)


# Instantiate the agent's models (function approximators).
# DDPG requires 4 models, visit its documentation for more details
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.ddpg.html#spaces-and-models
models_ddpg = {}
models_ddpg["policy"] = DeterministicActor(env.observation_space, env.action_space, device)
models_ddpg["target_policy"] = DeterministicActor(env.observation_space, env.action_space, device)
models_ddpg["critic"] = DeterministicCritic(env.observation_space, env.action_space, device)
models_ddpg["target_critic"] = DeterministicCritic(env.observation_space, env.action_space, device)


# Initialize the models' parameters (weights and biases) using a Gaussian distribution
for model in models_ddpg.values():
    model.init_parameters(method_name="normal_", mean=0.0, std=0.5)


# Configure and instantiate the agent.
# Only modify some of the default configuration, visit its documentation to see all the options
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.ddpg.html#configuration-and-hyperparameters
cfg_ddpg = DDPG_DEFAULT_CONFIG.copy()
cfg_ddpg["exploration"]["noise"] = OrnsteinUhlenbeckNoise(theta=0.15, sigma=0.1, base_scale=1.0, device=device)
cfg_ddpg["gradient_steps"] = 1          # gradient steps
cfg_ddpg["batch_size"] = 128           # training batch size
cfg_ddpg["polyak"] = 0.005              # soft update hyperparameter (tau)
cfg_ddpg["discount_factor"] = 0.99      # discount factor (gamma)
cfg_ddpg["random_timesteps"] = 0      # random exploration steps
cfg_ddpg["learning_starts"] = 0       # learning starts after this many steps
cfg_ddpg["actor_learning_rate"] = 1e-4
cfg_ddpg["critic_learning_rate"] = 5e-4
# cfg_ddpg["rewards_shaper"] = lambda rewards, timestep, timesteps: rewards * 0.01  # rewards shaping function: Callable(reward, timestep, timesteps) -> reward


# logging to TensorBoard and write checkpoints each 1000 and 5000 timesteps respectively
cfg_ddpg["experiment"]["write_interval"] = 100
cfg_ddpg["experiment"]["checkpoint_interval"] = 1000
# cfg_ddpg["experiment"]["experiment_name"] = ""

agent = DDPG(models=models_ddpg,
                  memory=memory,
                  cfg=cfg_ddpg,
                  observation_space=env.observation_space,
                  action_space=env.action_space,
                  device=device)


# Configure and instantiate the RL trainer
cfg_trainer = {"timesteps": 320000, "headless": True}
trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=agent)

# start training
trainer.train()

Answered by Toni-SM

Apr 4, 2023

Hi @sjywdxs

This is an interesting topic for discussion and research!

Off-policy algorithms are generally more sample-efficient than on-policy algorithms because they can learn from a broader set of experiences (off-policy algorithms can learn from experiences generated by any policy, not just the policy currently being executed). The problem with running multiple environments in parallel is that the samples collected are strongly correlated in time, and for each time step, a lot of information (equal to the number of environments) accumulates. Off-policy algorithms are more suitable for problems with a small number of environments, at least with standard sampling memories.

However, for R…

View full answer

Toni-SM · 2023-04-04T09:01:10Z

Toni-SM
Apr 4, 2023
Maintainer

Hi @sjywdxs

This is an interesting topic for discussion and research!

Off-policy algorithms are generally more sample-efficient than on-policy algorithms because they can learn from a broader set of experiences (off-policy algorithms can learn from experiences generated by any policy, not just the policy currently being executed). The problem with running multiple environments in parallel is that the samples collected are strongly correlated in time, and for each time step, a lot of information (equal to the number of environments) accumulates. Off-policy algorithms are more suitable for problems with a small number of environments, at least with standard sampling memories.

However, for RL problems with a large number of parallel environments, on-policy algorithms outperform off-policy algorithms. On-policy algorithms learn from the experience generated by the current policy, which is specifically tailored to the current environment (then, on-policy algorithms can adapt more quickly and efficiently to the unique characteristics of each environment).

2 replies

psh9002 May 31, 2023

Hi @Toni-SM

I have a question about the usage of off-policy model on parallel environments.
I watched that the main benefit for the off-policy will be faster data collection(link). If so, can I guess that the next two papers that used SAC+HER did not use parallel environments?

They did not mention about the parallel environments.

Toni-SM Jun 5, 2023
Maintainer

Hi @psh9002

No, these works don't use parallel environments.

However, doing a search I found the following paper addressing off-policy algorithms training in massively parallel environments in Isaac Gym: Parallel Q-Learning: Scaling Off-policy Reinforcement Learning

IanWangg · 2023-05-19T16:22:43Z

IanWangg
May 19, 2023

I think another problem with this script is that you are using the parallel environment (I was using IsaacGym, which initializes 512 environments parallelly), but your gradient step is 1, that is saying you are updating one batch for every 512 steps if I understand the code correctly. You can try to modify the gradient step so that your update frequency is once per env step. (For comparison, I think PPO updates are counted in epochs, so every collected data in the rollout will be used for #epochs)

0 replies

leedokeun · 2024-03-06T03:46:25Z

leedokeun
Mar 6, 2024

Did you solve this problem? If you solved it, could you let me know the parameter?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDPG/TD3/SAC for robotics tasks #65

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

DDPG/TD3/SAC for robotics tasks #65

sjywdxs Apr 3, 2023

Replies: 3 comments · 2 replies

Toni-SM Apr 4, 2023 Maintainer

psh9002 May 31, 2023

Toni-SM Jun 5, 2023 Maintainer

IanWangg May 19, 2023

leedokeun Mar 6, 2024

sjywdxs
Apr 3, 2023

Replies: 3 comments 2 replies

Toni-SM
Apr 4, 2023
Maintainer

Toni-SM Jun 5, 2023
Maintainer

IanWangg
May 19, 2023

leedokeun
Mar 6, 2024