TME RLADL

Answers to the Reinforcement Learning course exercises of Sorbonne Université M2 DAC (Master of Data Science), by Victor Duthoit and Pierre Wan-Fat.

Useful resources

Learn as soon as possible how to use TensorBoard and the logging module, and implement a robust checkpointing module. This will save you a lot of time.
Start simple: you don't need to code advanced techniques (such as Target Network or Prioritized Replay Buffer) at first, especially on simple environments (CartPole).
As soon as you are quite confident in your algorithm, use a grid search to tune the hyperparameters and launch it on the PPTI. Don't try to tune the hyperparameters by yourself, this will likely not work, and you will lose a lot of time as well as your mind!
Your teachers will sometimes give you a LOT of (confusing) boilerplate code. You don't have to use it to succeed; starting fresh is sometimes the best choice.

Before doing anything, check that nobody is already using the machine you are connected to (who and nvidia-smi).
There is always a risk that your script randomly crashes without you noticing it, and anyways the PPTI reboots every day at 8 AM. So be sure to log and checkpoint everything you do.
Learn how to use tmux.
The magic command to install a Python package is pip3 install --user --proxy=proxy:3128 xxx.
Git also works on the PPTI, you just need to configure the HTTP proxy to proxy:3128.
Use /tempory if you don't have enough space in your home folder (which is limited to around 3 GB).

TME 1:
- Epsilon greedy, UCB and Lin-UCB for Bandits.
TME 2:
- Policy Iteration on GridWorld.
- Value Iteration on GridWorld.
TME 3:
- QLearning on Gridworld.
- SARSA on Gridworld.
TME 4:
- DQN on CartPole.
- Dueling DQN on CartPole.
- Prioritized DQN on CartPole.
- DQN on LunarLander.
TME 5: Actor-Critic.
- TD(0) on CartPole.
- TD(0) on LunarLander.
TME 6: PPO.
- Adaptive PPO on CartPole.
- Adaptive PPO on LunarLander.
- Clipped PPO on CartPole.
- Clipped PPO on LunarLander.
TME 7: DDPG.
- DDPG on Pendulum.
- DDPG on LunarLander.
- DDPG on MountainCar.
TME 8: GAN.
TME 9: VAE
- VAE (Linear).
- VAE (Convolutional).
TME 10: MADDPG
- Simple spread.
- Simple adversary.
- Simple tag.
TME 11: Imitation learning
- Behavior cloning on LunarLander.
- GAIL on LunarLander.
TME 12: Curriculum Learning
- Goal sampling.
- HER (Hinsight Experience Replay).
- ISG (Iterative Goal Sampling).

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
TME10		TME10
TME11		TME11
TME12		TME12
TME9		TME9
agent		agent
gridworld		gridworld
reports		reports
.gitignore		.gitignore
README.md		README.md
experiment.py		experiment.py
logger.py		logger.py
memory.py		memory.py
tme_1.pdf		tme_1.pdf
tme_1.py		tme_1.py
tme_1.txt		tme_1.txt
tme_2.pdf		tme_2.pdf
tme_3.pdf		tme_3.pdf
tme_3_q_learning_gridworld.py		tme_3_q_learning_gridworld.py
tme_3_sarsa_gridworld.py		tme_3_sarsa_gridworld.py
tme_4.pdf		tme_4.pdf
tme_4_dqn_cartpole.py		tme_4_dqn_cartpole.py
tme_4_dqn_lunarlander.py		tme_4_dqn_lunarlander.py
tme_4_dueling_dqn_cartpole.py		tme_4_dueling_dqn_cartpole.py
tme_4_prioritized_dqn_cartpole.py		tme_4_prioritized_dqn_cartpole.py
tme_5.pdf		tme_5.pdf
tme_5_actor_critic_cartpole.py		tme_5_actor_critic_cartpole.py
tme_5_actor_critic_lunarlander.py		tme_5_actor_critic_lunarlander.py
tme_6.pdf		tme_6.pdf
tme_6_ppo_adaptive_cartpole.py		tme_6_ppo_adaptive_cartpole.py
tme_6_ppo_adaptive_lunarlander.py		tme_6_ppo_adaptive_lunarlander.py
tme_6_ppo_adaptive_lunarlander_grid.py		tme_6_ppo_adaptive_lunarlander_grid.py
tme_6_ppo_clipped_cartpole.py		tme_6_ppo_clipped_cartpole.py
tme_6_ppo_clipped_lunarlander.py		tme_6_ppo_clipped_lunarlander.py
tme_6_ppo_clipped_lunarlander_grid.py		tme_6_ppo_clipped_lunarlander_grid.py
tme_7.pdf		tme_7.pdf
tme_7_ddpg_lunarlander.py		tme_7_ddpg_lunarlander.py
tme_7_ddpg_mountaincar.py		tme_7_ddpg_mountaincar.py
tme_7_ddpg_mountaincar_grid.py		tme_7_ddpg_mountaincar_grid.py
tme_7_ddpg_pendulum.py		tme_7_ddpg_pendulum.py
tme_8.ipynb		tme_8.ipynb