diff --git a/README.md b/README.md index 8bc9419..73c9a66 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,8 @@ [![Organization](https://img.shields.io/badge/Organization-PKU--Alignment-blue)](https://github.com/PKU-Alignment) [![License](https://img.shields.io/github/license/PKU-Alignment/OmniSafe?label=license)](#license) +[![codecov](https://codecov.io/gh/PKU-Alignment/Safe-Policy-Optimization/graph/badge.svg?token=KF0UM0UNXW)](https://codecov.io/gh/PKU-Alignment/Safe-Policy-Optimization) +[![Documentation Status](https://readthedocs.org/projects/safe-policy-optimization/badge/?version=latest)](https://safe-policy-optimization.readthedocs.io/en/latest/?badge=latest) diff --git a/docs/source/algorithms/comparision.rst b/docs/source/algorithms/comparision.rst index 5641170..c76c755 100644 --- a/docs/source/algorithms/comparision.rst +++ b/docs/source/algorithms/comparision.rst @@ -1,27 +1,25 @@ Trustworthy Implementation ========================== -To ensure that the implementation is trustworthy, we have compared our -implementation with open source implementations of the same algorithms. +To ensure that SafePO's implementation is trustworthy, we have compared +our algorithms' performance with open source implementations of the same algorithms. As some of the algorithms can not be found in open source, we selected -``PPOLag``, ``TRPOLag``, ``CPO`` and ``FOCOPS`` for comparison. +``PPO-Lag``, ``TRPOLag``, ``CPO`` and ``FOCOPS`` for comparison. We have compared the following algorithms: -- ``PPOLag``: `OpenAI Baselines: Safety Starter Agents `_ +- ``PPO-Lag``: `OpenAI Baselines: Safety Starter Agents `_ - ``TRPOLag``: `OpenAI Baselines: Safety Starter Agents `_, `RL Safety Algorithms `_ - ``CPO``: `OpenAI Baselines: Safety Starter Agents `_, `RL Safety Algorithms `_ - ``FOCOPS``: `Original Implementation `_ -We compared those alforithms in 14 tasks from `Safety-Gymnasium `_, +We compared those alforithms in 12 tasks from `Safety-Gymnasium `_, they are: - ``SafetyPointButton1-v0`` - ``SafetyPointCircle1-v0`` - ``SafetyPointGoal1-v0`` -- ``SafetyPointPush1-v0`` - ``SafetyCarButton1-v0`` -- ``SafetyCarCircle1-v0`` - ``SafetyCarGoal1-v0`` - ``SafetyCarPush1-v0`` - ``SafetyAntVelocity-v1`` @@ -35,11 +33,11 @@ The results are shown as follows. .. tab-set:: - .. tab-item:: PPOLag + .. tab-item:: PPO-Lag .. raw:: html - - .. tab-item:: PPOLag + .. tab-item:: PPO-Lag .. raw:: html diff --git a/docs/source/algorithms/first_order.rst b/docs/source/algorithms/first_order.rst index 5cbb828..49f058a 100644 --- a/docs/source/algorithms/first_order.rst +++ b/docs/source/algorithms/first_order.rst @@ -1,5 +1,5 @@ -First Order Projection -====================== +First Order Projection Methods +============================== Experiment Results ------------------ diff --git a/docs/source/algorithms/lag.rst b/docs/source/algorithms/lag.rst index 34067c3..bbd93b5 100644 --- a/docs/source/algorithms/lag.rst +++ b/docs/source/algorithms/lag.rst @@ -1,12 +1,12 @@ -Lagrangian Method -================= +Lagrangian Methods +================== Experiment Results ------------------ .. tab-set:: - .. tab-item:: PPOLag + .. tab-item:: PPO-Lag .. raw:: html diff --git a/docs/source/index.rst b/docs/source/index.rst index e1ce6a4..59c6511 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -66,7 +66,7 @@ results (eavluation outcomes, training curves) in ``safepo/results``. .. toctree:: :hidden: - :caption: ALGORITHM + :caption: ALGORITHMS algorithms/curve algorithms/lag diff --git a/docs/source/usage/eval.rst b/docs/source/usage/eval.rst index 4fc2fac..28f7163 100644 --- a/docs/source/usage/eval.rst +++ b/docs/source/usage/eval.rst @@ -1,5 +1,5 @@ -Evaluating Trained Model -======================== +Evaluating Trained Models +========================= Model Evaluation ---------------- diff --git a/docs/source/usage/implement.rst b/docs/source/usage/implement.rst index 70f9fe9..af79ef8 100644 --- a/docs/source/usage/implement.rst +++ b/docs/source/usage/implement.rst @@ -33,9 +33,9 @@ Breifly, the ``PPO`` in SafePO has the following characteristics, which are also Beyond the above characteristics, the ``PPO`` in SafePO also provides a training pipeline for data collection and training. You can customize new alforithms based on it. -Next we will provide a detailed example to show how to customize the ``PPO`` algorithm to ``PPOLag`` algorithm. +Next we will provide a detailed example to show how to customize the ``PPO`` algorithm to ``PPO-Lag`` algorithm. -Example: PPOLag +Example: PPO-Lag --------------- The Lagrangian multiplier is a useful tool to control the constraint violation in the Safe RL algorithms. diff --git a/docs/source/usage/make.rst b/docs/source/usage/make.rst index c805b1c..e0f19a6 100644 --- a/docs/source/usage/make.rst +++ b/docs/source/usage/make.rst @@ -1,5 +1,5 @@ -Efficient Command -================= +Efficient Commands +================== To help users quickly reporduce our results, we provide a command line tool for easy installation, benchmarking, and evaluation. @@ -9,6 +9,11 @@ One line benchmark running First, create a conda environment with Python 3.8. +.. code-block:: bash + + conda create -n safepo python=3.8 + conda activate safepo + Then, run the following command to install SafePO and run the full benchmark: .. code-block:: bash @@ -42,19 +47,19 @@ The terminal output would be like: .. code-block:: bash ======= commands to run: - running python macpo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1 - running python mappo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1 - running python mappolag.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1 - running python happo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1 + running python macpo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000 + running python mappo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000 + running python mappolag.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000 + running python happo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000 ... - running python pcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python ppo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python cup.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python focops.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python rcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python trpo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python cpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 - running python cppo_pid.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000 + running python pcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python ppo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python cup.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python focops.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python rcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python trpo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python cpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 + running python cppo_pid.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000 ... Plotting from... ================================================== @@ -81,3 +86,4 @@ The terminal output would be like: After 1 episodes evaluation, the focops in SafetyPointGoal1-v0 evaluation reward: 12.21±2.18, cost: 26.0±19.51, the reuslt is saved in ./results/benchmark/eval_result.txt Start evaluating cppo_pid in SafetyPointGoal1-v0 After 1 episodes evaluation, the cppo_pid in SafetyPointGoal1-v0 evaluation reward: 13.42±0.44, cost: 18.79±2.1, the reuslt is saved in ./results/benchmark/eval_result.txt + ... \ No newline at end of file diff --git a/safepo/common/env.py b/safepo/common/env.py index f001924..0e645fe 100644 --- a/safepo/common/env.py +++ b/safepo/common/env.py @@ -81,7 +81,7 @@ def create_env() -> Callable: def make_sa_isaac_env(args, cfg, sim_params): """ - Creates and returns a VecTaskPython environment for the single agent Shadow Hand task. + Creates and returns a VecTaskPython environment for the single agent Isaac Gym task. Args: args: Command-line arguments. @@ -90,10 +90,10 @@ def make_sa_isaac_env(args, cfg, sim_params): sim_params: Parameters for the simulation. Returns: - env: VecTaskPython environment for the single agent Shadow Hand task. + env: VecTaskPython environment for the single agent Isaac Gym task. Warning: - SafePO's single agent Shadow Hand task is not ready for use yet. + SafePO's single agent Isaac Gym task is not ready for use yet. """ # create native task and pass custom config device_id = args.device_id @@ -119,7 +119,7 @@ def make_sa_isaac_env(args, cfg, sim_params): def make_ma_mujoco_env(scenario, agent_conf, seed, cfg_train): """ - Creates and returns a multi-agent environment using Mujoco scenarios. + Creates and returns a multi-agent environment using MuJoCo scenarios. Args: args: Command-line arguments. @@ -152,7 +152,7 @@ def init_env(): def make_ma_isaac_env(args, cfg, cfg_train, sim_params, agent_index): """ - Creates and returns a multi-agent environment for the Shadow Hand task. + Creates and returns a multi-agent environment for the Isaac Gym task. Args: args: Command-line arguments. @@ -162,7 +162,7 @@ def make_ma_isaac_env(args, cfg, cfg_train, sim_params, agent_index): agent_index: Index of the agent within the multi-agent environment. Returns: - env: A multi-agent environment for the Shadow Hand task. + env: A multi-agent environment for the Isaac Gym task. """ # create native task and pass custom config device_id = args.device_id