Temporal Difference Methods for Continuous Action Space

This repository investigates the suitability of temporal difference (TD) methods for continuous control action spaces. We implement and evaluate various TD algorithms across multiple environments, focusing on their performance under different configurations and hyperparameters. The results of these experiments, including comparisons across algorithms and environments, are available in notebook.ipynb. Due to the complexity and potential instability of some algorithms, hyperparameters and configurations are not fully validated, and caution is advised when interpreting the results.

Overview

This research is compelling because it explores the application of fundamental TD algorithms, which are space-time efficient and highly adaptive, in dynamic environments. TD methods excel at balancing computational simplicity with strong adaptability, making them suitable for reinforcement learning in dynamic systems. However, much of the research on TD methods has been constrained to small, local systems, limiting their scalability and applicability to broader contexts. A frequent observation in existing literature is the need to extend these methods to control systems operating in dynamic environments. This repository aims to address this gap by focusing on continuous control action spaces and incorporating dynamic environmental factors into the evaluation of these algorithms.

Algorithms

This repository includes a variety of TD-based algorithms, drawing from foundational reinforcement learning research:

TD(0): The standard temporal difference learning algorithm introduced in Reinforcement Learning: An Introduction by Sutton and Barto (Sutton & Barto).
TD(0)-Replay: An extension of TD(0) incorporating replay mechanisms, inspired by the idea of experience replay from Lin (1992) (10.1109/IJCNN.2018.8489300).
TrueOnlineTD(λ): A version of TD(λ) with online updates for improved convergence, introduced by van Seijen and Sutton (arXiv:1512.04087).
TrueOnlineTD(λ)-Replay: Combines eligibility traces with replay mechanisms, based on True Online TD(λ)-Replay: An Efficient Model-Free Planning with Full Replay by Altahhan (10.1109/IJCNN48605.2020.9206608).
Sarsa and Expected Sarsa: On-policy TD algorithms with epsilon-greedy policies and expected value updates, derived from Sutton and Barto's textbook.
Doya’s Continuous TD Model: Explores the use of reinforcement learning in continuous action and state spaces, based on the work of Kenji Doya, implemented in Experimental/Doya.py (10.1162/089976600300015961).

These algorithms are implemented in separate modules for clarity and modularity, allowing for easy experimentation and extension.

Environments

The repository uses several custom and prebuilt environments to evaluate the algorithms, implemented with the Gymnasium library:

FrozenLake: A grid-based environment with discrete actions, modified for experimentation.
RandomWalk: A simple 1D random walk environment, adapted from the Sutton and Barto textbook examples.
LunarLander: A continuous control environment modified with custom reward strategies, originally from the Gymnasium Box2D suite.
Pendulum: A classic control problem testing continuous action spaces, implemented via Gymnasium.

Gymnasium, an actively maintained reinforcement learning toolkit, provides flexible and customizable environments for algorithm evaluation (arXiv:1606.01540).

Results and Evaluations

The repository includes extensive evaluations of the algorithms across environments. Key findings from the experiments include:

TD(0)-Replay on Random Walk: With alpha=0.005 and gamma=1.0, this configuration achieved a success rate of 29.0%, highlighting the benefits of replay in improving sample efficiency.
TD(0)-Replay on Frozen Lake: Using alpha=0.002 and gamma=0.6, this method achieved a 100% success rate, demonstrating its robustness in grid-world environments.
Expected Sarsa: This algorithm showed promising results, achieving 27.71% success on Random Walk and 100% success on Frozen Lake with optimal hyperparameters.
TrueOnlineTD(λ): Performance varied across environments. For instance, it achieved a success rate of 25.0% on Random Walk but struggled in Frozen Lake.
TrueOnlineTD(λ)-Replay: Replay mechanisms slightly improved performance in Random Walk, with a success rate of 26.5%, though it remained unstable in Frozen Lake.
Doya's Continuous TD Model: Demonstrated potential in adapting TD methods for dynamic continuous spaces but requires further work for stability in complex environments.

Detailed summaries, metrics, and plots for each experiment can be found in notebook.ipynb.

Installation

Set up the repository using the provided Makefile:

make install

This command installs the necessary dependencies, sets up a Python virtual environment, and prepares the system for experimentation.

Running Experiments

To evaluate a specific algorithm on a given environment, import the corresponding module and run a cross-validation search. For example:

from TD.TDZero import TDZeroCV
from Environments.RandomWalk import make_random_walk

env = make_random_walk()
cv = TDZeroCV(env, {"alpha": [0.003], "gamma": [0.7]})
cv.search(episodes=5000)
cv.summary()
cv.plot_metrics()

Hyperparameter search results and metrics can be visualized using the built-in plotting tools.

Future Work

The primary focus of future work is on stabilizing the models and exploring their applicability to continuous action spaces. Specific directions include:

Stabilization Strategies: Addressing instability in algorithms like TrueOnlineTD(λ)-Replay and improving convergence in environments with sparse or noisy rewards.
Dynamic Action Spaces: Extending the methods to support dynamic action spaces, where the agent must adapt to changing action availability or constraints during training.
Exploring Doya’s Methods: Further investigating the continuous TD model introduced by Doya, refining its implementation in Experimental/Doya.py to improve performance in dynamic and high-dimensional environments.
Complex Continuous Environments: Expanding evaluations to more challenging tasks, such as those found in Mujoco, to assess scalability and robustness in real-world-inspired settings.
Hyperparameter Optimization: Implementing automated tuning pipelines to systematically explore the parameter space for all algorithms across diverse environments.

This repository provides a foundation for studying TD methods in dynamic systems, emphasizing their potential for scalability and real-world applications. Researchers and practitioners are encouraged to build upon this work to refine algorithms and extend their applicability to broader domains.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Environments		Environments
Experimental		Experimental
Sarsa		Sarsa
TD		TD
TrueTD		TrueTD
.gitignore		.gitignore
CrossValidation.py		CrossValidation.py
Makefile		Makefile
README.md		README.md
Replay.py		Replay.py
Specs.py		Specs.py
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Temporal Difference Methods for Continuous Action Space

Overview

Algorithms

Environments

Results and Evaluations

Installation

Running Experiments

Future Work

About

Uh oh!

Languages

pre63/temporal-difference

Folders and files

Latest commit

History

Repository files navigation

Temporal Difference Methods for Continuous Action Space

Overview

Algorithms

Environments

Results and Evaluations

Installation

Running Experiments

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages