This repository contains re-implementations of Deep RL algorithms for continuous action spaces. Some highlights:
- Code is readable, and written to be easy to modify for future research.
- Train and Test on different environments (for generalization research).
- Built-in Tensorboard logging, parameter saving.
- Support for offline (batch) RL.
- Quick setup for benchmarks like Gym Mujoco, Atari, Pybullet, and DeepMind Control Suite.
- Separate training and learning routines, which make it easy to mix and match techniques that improve the training process with techniques that improve the learning update.
Paper: Continuous control with deep reinforcement learning, Lillicrap et al., 2015.
Description: a baseline model-free, offline, actor-critic method that forms the template for many of the other algorithms here.
Code: deep_control.ddpg
(with extra comments for an intro to deep actor-critics)
Examples: examples/basic_control/ddpg_gym.py
Paper: Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al., 2018.
Description: Builds off of DDPG and makes several changes to improve the critic's learning and performance (Clipped Double Q Learning, Target Smoothing, Actor Delay). Also includes the TD regularization term from "TD-Regularized Actor-Critic Methods."
Code: deep_control.td3
Examples: examples/basic_control/td3_gym.py
Other References: author's implementation
Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al., 2018.
Description: Samples actions from a stochastic actor rather than relying on added exploration noise during training. Uses a TD3-like double critic system. We do implement the learnable entropy coefficient approach described in the follow-up paper. This version also supports discrete action spaces and can avoid using target networks by applying the self-regularized crticic updates from GRAC (see below).
Code: deep_control.sac
Examples: examples/dmc/sac.py
, examples/sacd_demo.py
Other References: Yarats and Kostrikov's implementation, author's implementation.
Paper: Measuring Visual Generalization in Continuous Control from Pixels
Description: This is a pixel-specific version of SAC with a few tricks/hyperparemter settings to improve performance. We include many different data augmentation techniques, including those used in RAD, DrQ and Network Randomization. The DrQ augmentation is turned on by default, and has a huge impact on performance.
Code: deep_control.sac_aug
Examples: examples/dmcr/sac_aug.py
Other References: SAC+AE code, RAD Procgen code.
Paper: GRAC: Self-Regularized Actor-Critic, Shao et al., 2020.
Description: GRAC is a combination of a stochastic policy with TD3-like stability improvements and CEM-based action selection like you'd see in Qt-Opt or CAQL.
Code: deep_control.grac
Examples: examples/dmc/grac.py
Other References: author's implementation
Paper: Randomized Ensemble Double Q-Learning: Learning Fast Without a Model
Description: Extends the double Q trick to random subsets of a larger critic ensemble. Reduced Q function bias allows for a much higher replay ratio. REDQ is sample efficient but slow (compared to other model-free methods). We implement the SAC version.
Code: deep_control.redq
Examples: examples/dmc/redq.py
Paper: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction, Kumar et al., 2020.
Description: Reduce the effect of inaccurate target values propagating through the Q-function by learning to estimate the target networks' inaccuracies and adjusting the TD error accordingly. Implemented on top of standard SAC.
Code: deep_control.discor
Examples: examples/dmc/discor.py
Paper: SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning, Lee et al., 2020.
Description: Extends SAC using an ensemble of actors and critics. Adds UCB-based exploration, ensembled inference, and a simpler weighted bellman backup. This version does not use the replay buffer masks from the original.
Code: deep_control.sunrise
Examples: examples/dmc/sunrise.py
Description: A simple approach to offline RL that trains the actor network to emulate the action choices of the demonstration dataset. Uses the stochastic actor from SAC and some basic ensembling to make this a reasonable baseline.
Code: deep_control.sbc
Examples: examples/d4rl/sbc.py
Paper: Accelerating Online Reinforcement Learning with Offline Datasets, Nair et al., 2020. & Critic Regularized Regression, Wang et al., 2020.
Description: TD3 with a stochastic policy and a modified actor update that makes better use of offline experience before finetuning in the online environment. The current implementation is a mix between AWAC and CRR. We allow for online finetuning and use standard critic networks as in AWAC, but add the binary advantage function, and max/mean advantage estimates from CRR.
Code: deep_control.awac
Examples: examples/d4rl/awac.py
Paper: When to Trust Your Model: Model-Based Policy Optimization, Janner et al., 2019.
Warning: in alpha
Description: Improves SAC's sample efficiency by training the policy on transitions generated by a learned world model.
Code: deep_control.mbpo
Other References: author's implementation.
git clone https://github.com/jakegrigsby/deep_control.git
cd deep_control
pip install -e .
see the examples
folder for a look at how to train agents in environments like the DeepMind Control Suite and OpenAI Gym.
Things that will hopefully be included by the end of 2020: