-
Hi @Toni-SM , I am working on a project that aims to leverage skrl on robotics manipulation (Franka) in Isaac gym environments. However, when I moved on to actor-critic agents (DDPG, TD3, SAC), these agents seemed not to learn anything from the environments with the same reward configuration. From the tensorboard, the reward and episode behaved like flat lines with small oscillations. I have tried the ideas from other posts and tuned the structure of the NN, hyperparameters ( Are there any additional adjustments I need to make for these agents (DDPG, TD3, SAC) on robotics tasks? Thanks for your time. # Define the models (deterministic models) for the DDPG agent using mixins
# and programming with two approaches (torch functional and torch.nn.Sequential class).
# - Actor (policy): takes as input the environment's observation/state and returns an action
# - Critic: takes the state and action as input and provides a value to guide the policy
class DeterministicActor(DeterministicMixin, Model):
def __init__(self, observation_space, action_space, device, clip_actions=True):
Model.__init__(self, observation_space, action_space, device)
DeterministicMixin.__init__(self, clip_actions)
self.net = nn.Sequential(nn.Linear(self.num_observations, 512),
nn.ELU(),
nn.Linear(512, 256),
nn.ELU(),
nn.Linear(256, self.num_actions),
nn.Tanh())
def compute(self, inputs, role):
return self.net(inputs["states"]), {}
class DeterministicCritic(DeterministicMixin, Model):
def __init__(self, observation_space, action_space, device, clip_actions=False):
Model.__init__(self, observation_space, action_space, device)
DeterministicMixin.__init__(self, clip_actions)
self.net = nn.Sequential(nn.Linear(self.num_observations + self.num_actions, 512),
nn.ELU(),
nn.Linear(512, 256),
nn.ELU(),
nn.Linear(256, 1))
def compute(self, inputs, role):
return self.net(torch.cat([inputs["states"], inputs["taken_actions"]], dim=1)), {}
# Load and wrap the Omniverse Isaac Gym environment]
omniisaacgymenvs_path = os.path.realpath( os.path.join(os.path.realpath(__file__), "..") )
env = load_omniverse_isaacgym_env(task_name="FrankaCatching", omniisaacgymenvs_path = omniisaacgymenvs_path)
env = wrap_env(env)
device = env.device
# Instantiate a RandomMemory as rollout buffer (any memory can be used for this)
memory = RandomMemory(memory_size=8000, num_envs=env.num_envs, device=device, replacement=False)
# Instantiate the agent's models (function approximators).
# DDPG requires 4 models, visit its documentation for more details
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.ddpg.html#spaces-and-models
models_ddpg = {}
models_ddpg["policy"] = DeterministicActor(env.observation_space, env.action_space, device)
models_ddpg["target_policy"] = DeterministicActor(env.observation_space, env.action_space, device)
models_ddpg["critic"] = DeterministicCritic(env.observation_space, env.action_space, device)
models_ddpg["target_critic"] = DeterministicCritic(env.observation_space, env.action_space, device)
# Initialize the models' parameters (weights and biases) using a Gaussian distribution
for model in models_ddpg.values():
model.init_parameters(method_name="normal_", mean=0.0, std=0.5)
# Configure and instantiate the agent.
# Only modify some of the default configuration, visit its documentation to see all the options
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.ddpg.html#configuration-and-hyperparameters
cfg_ddpg = DDPG_DEFAULT_CONFIG.copy()
cfg_ddpg["exploration"]["noise"] = OrnsteinUhlenbeckNoise(theta=0.15, sigma=0.1, base_scale=1.0, device=device)
cfg_ddpg["gradient_steps"] = 1 # gradient steps
cfg_ddpg["batch_size"] = 128 # training batch size
cfg_ddpg["polyak"] = 0.005 # soft update hyperparameter (tau)
cfg_ddpg["discount_factor"] = 0.99 # discount factor (gamma)
cfg_ddpg["random_timesteps"] = 0 # random exploration steps
cfg_ddpg["learning_starts"] = 0 # learning starts after this many steps
cfg_ddpg["actor_learning_rate"] = 1e-4
cfg_ddpg["critic_learning_rate"] = 5e-4
# cfg_ddpg["rewards_shaper"] = lambda rewards, timestep, timesteps: rewards * 0.01 # rewards shaping function: Callable(reward, timestep, timesteps) -> reward
# logging to TensorBoard and write checkpoints each 1000 and 5000 timesteps respectively
cfg_ddpg["experiment"]["write_interval"] = 100
cfg_ddpg["experiment"]["checkpoint_interval"] = 1000
# cfg_ddpg["experiment"]["experiment_name"] = ""
agent = DDPG(models=models_ddpg,
memory=memory,
cfg=cfg_ddpg,
observation_space=env.observation_space,
action_space=env.action_space,
device=device)
# Configure and instantiate the RL trainer
cfg_trainer = {"timesteps": 320000, "headless": True}
trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=agent)
# start training
trainer.train() |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Hi @sjywdxs This is an interesting topic for discussion and research! Off-policy algorithms are generally more sample-efficient than on-policy algorithms because they can learn from a broader set of experiences (off-policy algorithms can learn from experiences generated by any policy, not just the policy currently being executed). The problem with running multiple environments in parallel is that the samples collected are strongly correlated in time, and for each time step, a lot of information (equal to the number of environments) accumulates. Off-policy algorithms are more suitable for problems with a small number of environments, at least with standard sampling memories. However, for RL problems with a large number of parallel environments, on-policy algorithms outperform off-policy algorithms. On-policy algorithms learn from the experience generated by the current policy, which is specifically tailored to the current environment (then, on-policy algorithms can adapt more quickly and efficiently to the unique characteristics of each environment). |
Beta Was this translation helpful? Give feedback.
-
I think another problem with this script is that you are using the parallel environment (I was using IsaacGym, which initializes 512 environments parallelly), but your gradient step is 1, that is saying you are updating one batch for every 512 steps if I understand the code correctly. You can try to modify the gradient step so that your update frequency is once per env step. (For comparison, I think PPO updates are counted in epochs, so every collected data in the rollout will be used for #epochs) |
Beta Was this translation helpful? Give feedback.
-
Did you solve this problem? If you solved it, could you let me know the parameter? |
Beta Was this translation helpful? Give feedback.
Hi @sjywdxs
This is an interesting topic for discussion and research!
Off-policy algorithms are generally more sample-efficient than on-policy algorithms because they can learn from a broader set of experiences (off-policy algorithms can learn from experiences generated by any policy, not just the policy currently being executed). The problem with running multiple environments in parallel is that the samples collected are strongly correlated in time, and for each time step, a lot of information (equal to the number of environments) accumulates. Off-policy algorithms are more suitable for problems with a small number of environments, at least with standard sampling memories.
However, for R…