Code to reproduce experiments from https://arxiv.org/abs/1903.01567
See Makefile. Run the appropriate command as argument to make to run the particular experiment.
ckpt_path: the checkpoint to load if restore_model is True.
math: the coupling option, i.e. inclusion of Pi_k in posterior target
stable_old: given that math is True, should Pi_k be computed using the current policy or an old policy
l2: whether the individual posterior probabilities are reweighted using an l2 norm or a l1 sum
obstacle: whether there are two walls in the 8-P&P taskset
obstacle_height: given obstacle is true, the height of the walls
repeat: the one-hot vector has to be repeated a couple times to allow the baseline PPO to learn more quickly;
this is the number of repeats
redundant: whether the observation contains stage information of both boxes in 8-P&P,
or just the stage information of the current box
bring_close: the maximum allowable distance for successful reach above actions
drop_close: the maximum allowable distance for successful dropping actions
drop_width: the maximum allowable distance for successful carry actions
split: whether to normalize one-hot vectors in 8-P&P
bounded: whether to use bounded distance
dist_bound: if so, what is the bounded distance
dist_obv: whether to include relative distance between objects in 8-P&P observation
above_target: the multiplier of box size in 8-P&P for distance above the box during reach above actions
stage_obv: whether stage is observable in 8-P&P
manhattan: whether distance is calculated using manhattan or l2 distance
soft: whether the gating controller outputs softmax or hardmax selection
weighted: whether MPHRL reweights the cross entropy based on ground truth label
restore_model: whether to restore a checkpoint
always_restore: whether to restore the checkpoint for every new task in lifelong learning
oracle_master: whether to use oracle gating controller
old_policy: whether to use old policy in posterior calculation
enforced: whether to minibatch optimation of gating controller in target tasks
reset: whether to reset gating controller for new tasks
transfer: whether to transfer subpolicies for new tasks
paths: the lifelong learning taskset configuration
survival_reward: reward for the ant to be alive per timestep
record: record videos during training
prior: has no effect on MPHRL
num_cpus: number of actors
num_cores: same as num_cpus
bs_per_cpu: train batch size
max_num_timesteps: horizon
bs_per_core: batch size per core during parallel operations
total_ts: total timesteps per lifelong leraning task
prev_timestep: skipped timesteps during checkpoint restore for faster convergence
num_batches: number of training batches per task
mov_avg: moving average calculation of accuracies, etc.
If you found this useful, consider citing:
@inproceedings{wu2019model,
title={Model Primitive Hierarchical Lifelong Reinforcement Learning},
author={Wu, Bohan and Gupta, Jayesh K and Kochenderfer, Mykel J},
booktitle={Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems},
pages={34--42},
year={2019},
organization={International Foundation for Autonomous Agents and Multiagent Systems}
}