How to implement the curriculum learning using the existing data #44

famora2 · 2022-11-22T15:59:57Z

famora2
Nov 22, 2022

Hi,

I would like to implement the so-called curriculum learning using skrl, where I initialize the training with a pre-recorded data and gradually decrease the usage of this pre-recorded data.
The part that I do not understand is the way the code is structured. Taking the "FrankaCabinet" as an example:


agent = PPO(models=models_ppo,
            memory=memory, 
            cfg=cfg_ppo, 
            observation_space=env.observation_space, 
            action_space=env.action_space,
            device=device)

# Configure and instantiate the RL trainer
cfg_trainer = {"timesteps": 24000, "headless": True}
trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=agent)

# start training
trainer.train()

Above code is used to initialize the agent and start the training. Assuming I have the pre-recorded joint trajectory of Franka arm as Numpy array, I would like to overwrite action (which is the output of the agent) with this Numpy array to guide the robot arm towards the desired behavior. However, in this way, the whole training would be messed up, as the provided action is actually crap. So, by simply overwriting the action values, the pre-recorded numpy array can not be appropriately used.

Do you have advice/tips for this case?

Toni-SM · 2022-11-23T08:41:55Z

Toni-SM
Nov 23, 2022
Maintainer

Hi @famora2

Typically, curricular learning is accomplished by increasing the difficulty of the task.
For this, the easy way to go is to use the manual trainer (for controlling each interaction with the environment) and increase the environment/task difficulty according to some metric (for example, the timesteps, rewards, etc.)

# NOTE: using API skrl-v0.8.0
env = ...

# create a sequential trainer
cfg = {"timesteps": 50000, "headless": False}
trainer = ManualTrainer(env=env, agents=agents, cfg=cfg)

# train the agent(s)
for timestep in range(cfg["timesteps"]):
    trainer.train(timestep=timestep)
    # adjust environment difficulty
    if SOME_METRIC is REACHED:
        env.increase_difficulty()

In the case, you want to overwrite the agent's action, you can overwrite the policy `.act(...)` method to return whatever actions you want.

# NOTE: using API skrl-v0.8.0
from skrl.models.torch import Model, GaussianMixin

# define the model
class Policy(GaussianMixin, Model):
    def __init__(self, observation_space, action_space, device, 
                 clip_actions=False, clip_log_std=True, min_log_std=-20, max_log_std=2, reduction="sum"):
        Model.__init__(self, observation_space, action_space, device)
        GaussianMixin.__init__(self, clip_actions, clip_log_std, min_log_std, max_log_std, reduction)

        self.net = nn.Sequential(...)
        self.log_std_parameter = nn.Parameter(torch.zeros(self.num_actions))

    def act(self, states, taken_actions, role):
        # use custom recorded actions...
        if SOME_METRIC is USED:
             return CUSTOM_action, CUSTOM_log_prob, None
        # use the policy
        else:
             return super().act(states, taken_actions, role)

    def compute(self, states, taken_actions, role):
        return self.net(states), self.log_std_parameter

How about these ideas?
Let me know your comments and let's continue the discussion :)

5 replies

famora2 Nov 28, 2022
Author

Hi @Toni-SM,

thank you for the reply. I get your point about increasing the complexity of the task. For the first try, I changed the reset_idx() in the original franka_cabinet.py as following:
Instead of randomly resetting the robot pos, I choose the position near the drawer at the first of the training and gradually increase the distance to the drawer to increase the difficulty for the robot (see the line with pos= tensor_clamp(...)).
The result is rather disappointing, the reward value is huge at the beginning, but whenever the curriculum (self.global_progress_buf[env_ids]/200 ->1 ) is progressing, the reward value immediately drops and not recovered. Is my approach too naive? For instance, the authors from (https://arxiv.org/abs/2204.02372) criticizes this kind of naive curriculum (see Fig. 2) such that only the actor part of the RL pipeline gets trained and not critic part, which results in this poor performance.

Did you experience same issues by any chance? Or do you have any suggestions?

def reset_idx(self, env_ids):
    env_ids_int32 = env_ids.to(dtype=torch.int32)

    # reset franka
    # pos = tensor_clamp(
    #    self.franka_default_dof_pos.unsqueeze(0) + 0.25 * (torch.rand((len(env_ids), self.num_franka_dofs), device=self.device) - 0.5),
    #    self.franka_dof_lower_limits, self.franka_dof_upper_limits)

    # Curriculum: Reset the robot near the drawer and gradually increase the distance to the drawer until 200 epochs
    pos = tensor_clamp(
        self.franka_default_dof_pos_near_drawer.unsqueeze(0) + self.global_progress_buf[env_ids]/200  * 0.25 * (torch.rand((len(env_ids), self.num_franka_dofs), device=self.device) - 0.5),
        self.franka_dof_lower_limits, self.franka_dof_upper_limits)

    self.franka_dof_pos[env_ids, :] = pos
    self.franka_dof_vel[env_ids, :] = torch.zeros_like(self.franka_dof_vel[env_ids])
    self.franka_dof_targets[env_ids, :self.num_franka_dofs] = pos

    # reset cabinet
    self.cabinet_dof_state[env_ids, :] = torch.zeros_like(self.cabinet_dof_state[env_ids])

    # reset props
    if self.num_props > 0:
        prop_indices = self.global_indices[env_ids, 2:].flatten()
        self.prop_states[env_ids] = self.default_prop_states[env_ids]
        self.gym.set_actor_root_state_tensor_indexed(self.sim,
                                                     gymtorch.unwrap_tensor(self.root_state_tensor),
                                                     gymtorch.unwrap_tensor(prop_indices), len(prop_indices))

    multi_env_ids_int32 = self.global_indices[env_ids, :2].flatten()
    self.gym.set_dof_position_target_tensor_indexed(self.sim,
                                                    gymtorch.unwrap_tensor(self.franka_dof_targets),
                                                    gymtorch.unwrap_tensor(multi_env_ids_int32), len(multi_env_ids_int32))

    self.gym.set_dof_state_tensor_indexed(self.sim,
                                          gymtorch.unwrap_tensor(self.dof_state),
                                          gymtorch.unwrap_tensor(multi_env_ids_int32), len(multi_env_ids_int32))

    self.global_progress_buf[env_ids] = self.global_progress_buf[env_ids] +1 
    self.progress_buf[env_ids] = 0
    self.reset_buf[env_ids] = 0

Toni-SM Dec 1, 2022
Maintainer

Sorry for the late reply.

I'll try to test the snippet you provide this weekend... to investigate what might be going on.

famora2 Dec 8, 2022
Author

Were you able to find some time for this test? Getting curious..

famora2 Dec 18, 2022
Author

@Toni-SM
Hi, I guess my error was PPO. It seems like that one has to take "off-policy methods" (ex. TD3) so that the policies can actually take some benefits from these "demonstrations".
Now, assuming I have manually controlled the robot and saved the corresponding joint velocity values. I want to use this data by showing the "good" joint velocity values to the agent. According to (https://skrl.readthedocs.io/en/latest/intro/examples.html?highlight=td3#learning-by-scopes-in-an-isaac-gym-environment), TD3 requires 6 models.
Does this mean that "target_policy", "target_critic_1" and "target_critic_2" need to be overwritten in this case? Let me know what you think, thank you.

Toni-SM Dec 21, 2022
Maintainer

Hi @famora2

Sorry for my late reply... again
I tried the curriculum learning with the Franka cabinet example by increasing the distance in the reset method as you appointed.., but I did not find any relevant results. I think PPO (with the tunned parameters, learning rate scheduler, and input preprocessing) learns too fast to see any difference.

Regarding the use of the "good" joint values... I think this idea is related to learning for demonstration. The idea, in this case, is to initialize the replay buffer with the recorded data so the agent can learn from it.

The target networks, for both the policy and the critic, in TD3 are required by the algorithm to learn to avoid action-value (Q) overestimation and improve learning stability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement the curriculum learning using the existing data #44

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to implement the curriculum learning using the existing data #44

famora2 Nov 22, 2022

Replies: 1 comment · 5 replies

Toni-SM Nov 23, 2022 Maintainer

famora2 Nov 28, 2022 Author

Toni-SM Dec 1, 2022 Maintainer

famora2 Dec 8, 2022 Author

famora2 Dec 18, 2022 Author

Toni-SM Dec 21, 2022 Maintainer

famora2
Nov 22, 2022

Replies: 1 comment 5 replies

Toni-SM
Nov 23, 2022
Maintainer

famora2 Nov 28, 2022
Author

Toni-SM Dec 1, 2022
Maintainer

famora2 Dec 8, 2022
Author

famora2 Dec 18, 2022
Author

Toni-SM Dec 21, 2022
Maintainer