docs: update README

PKU-Alignment · Aug 19, 2023 · d679a91 · d679a91
2 parents 427fd66 + 7a1b1f9
commit d679a91
Show file tree

Hide file tree

Showing 10 changed files with 47 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 
 [![Organization](https://img.shields.io/badge/Organization-PKU--Alignment-blue)](https://github.com/PKU-Alignment)
 [![License](https://img.shields.io/github/license/PKU-Alignment/OmniSafe?label=license)](#license)
+[![codecov](https://codecov.io/gh/PKU-Alignment/Safe-Policy-Optimization/graph/badge.svg?token=KF0UM0UNXW)](https://codecov.io/gh/PKU-Alignment/Safe-Policy-Optimization)
+[![Documentation Status](https://readthedocs.org/projects/safe-policy-optimization/badge/?version=latest)](https://safe-policy-optimization.readthedocs.io/en/latest/?badge=latest)
 
 </div>
 

diff --git a/docs/source/algorithms/comparision.rst b/docs/source/algorithms/comparision.rst
@@ -1,27 +1,25 @@
 Trustworthy Implementation
 ==========================
 
-To ensure that the implementation is trustworthy, we have compared our 
-implementation with open source implementations of the same algorithms.
+To ensure that SafePO's implementation is trustworthy, we have compared 
+our algorithms' performance with open source implementations of the same algorithms.
 As some of the algorithms can not be found in open source, we selected
-``PPOLag``, ``TRPOLag``, ``CPO`` and ``FOCOPS`` for comparison. 
+``PPO-Lag``, ``TRPOLag``, ``CPO`` and ``FOCOPS`` for comparison. 
 
 We have compared the following algorithms:
 
-- ``PPOLag``: `OpenAI Baselines: Safety Starter Agents <https://github.com/openai/safety-starter-agents>`_
+- ``PPO-Lag``: `OpenAI Baselines: Safety Starter Agents <https://github.com/openai/safety-starter-agents>`_
 - ``TRPOLag``: `OpenAI Baselines: Safety Starter Agents <https://github.com/openai/safety-starter-agents>`_, `RL Safety Algorithms <https://github.com/SvenGronauer/RL-Safety-Algorithms>`_
 - ``CPO``: `OpenAI Baselines: Safety Starter Agents <https://github.com/openai/safety-starter-agents>`_, `RL Safety Algorithms <https://github.com/SvenGronauer/RL-Safety-Algorithms>`_
 - ``FOCOPS``: `Original Implementation <https://github.com/ymzhang01/focops>`_
 
-We compared those alforithms in 14 tasks from `Safety-Gymnasium <https://github.com/PKU-Alignment/safety-gymnasium>`_,
+We compared those alforithms in 12 tasks from `Safety-Gymnasium <https://github.com/PKU-Alignment/safety-gymnasium>`_,
 they are:
 
 - ``SafetyPointButton1-v0``
 - ``SafetyPointCircle1-v0``
 - ``SafetyPointGoal1-v0``
-- ``SafetyPointPush1-v0``
 - ``SafetyCarButton1-v0``
-- ``SafetyCarCircle1-v0``
 - ``SafetyCarGoal1-v0``
 - ``SafetyCarPush1-v0``
 - ``SafetyAntVelocity-v1``
@@ -35,11 +33,11 @@ The results are shown as follows.
 
 .. tab-set::
 
-    .. tab-item:: PPOLag
+    .. tab-item:: PPO-Lag
 
       .. raw:: html
 
-         <iframe src="https://wandb.ai/pku_rl/SafePO/reports/Comparison-of-PPOLag-s-Implementation--Vmlldzo1MTgxOTkx" style="border:none;width:90%; height:1000px" >
+         <iframe src="https://wandb.ai/pku_rl/SafePO/reports/Comparison-of-PPO-Lag-s-Implementation--Vmlldzo1MTgxOTkx" style="border:none;width:90%; height:1000px" >
 
       .. raw:: html
 
@@ -49,7 +47,7 @@ The results are shown as follows.
 
       .. raw:: html
 
-         <iframe src="https://wandb.ai/pku_rl/SafePO/reports/Comparison-of-TRPOLag-s-Implementation--Vmlldzo1MTgyMDAz" style="border:none;width:90%; height:1000px" >
+         <iframe src="https://wandb.ai/pku_rl/SafePO/reports/Comparison-of-TRPO-Lag-s-Implementation--Vmlldzo1MTgyMDAz" style="border:none;width:90%; height:1000px" >
 
       .. raw:: html
 

diff --git a/docs/source/algorithms/curve.rst b/docs/source/algorithms/curve.rst
@@ -42,7 +42,7 @@ First order
 
          </iframe>
 
-    .. tab-item:: PPOLag
+    .. tab-item:: PPO-Lag
 
       .. raw:: html
 

diff --git a/docs/source/algorithms/first_order.rst b/docs/source/algorithms/first_order.rst
@@ -1,5 +1,5 @@
-First Order Projection
-======================
+First Order Projection Methods
+==============================
 
 Experiment Results
 ------------------

diff --git a/docs/source/algorithms/lag.rst b/docs/source/algorithms/lag.rst
@@ -1,12 +1,12 @@
-Lagrangian Method
-=================
+Lagrangian Methods
+==================
 
 Experiment Results
 ------------------
 
 .. tab-set::
 
-    .. tab-item:: PPOLag
+    .. tab-item:: PPO-Lag
 
       .. raw:: html
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -66,7 +66,7 @@ results (eavluation outcomes, training curves) in ``safepo/results``.
 
 .. toctree::
    :hidden:
-   :caption: ALGORITHM
+   :caption: ALGORITHMS
 
    algorithms/curve
    algorithms/lag

diff --git a/docs/source/usage/eval.rst b/docs/source/usage/eval.rst
@@ -1,5 +1,5 @@
-Evaluating Trained Model
-========================
+Evaluating Trained Models
+=========================
 
 Model Evaluation
 ----------------

diff --git a/docs/source/usage/implement.rst b/docs/source/usage/implement.rst
@@ -33,9 +33,9 @@ Breifly, the ``PPO`` in SafePO has the following characteristics, which are also
 Beyond the above characteristics, the ``PPO`` in SafePO also provides a training pipeline for data collection and training.
 You can customize new alforithms based on it.
 
-Next we will provide a detailed example to show how to customize the ``PPO`` algorithm to ``PPOLag`` algorithm.
+Next we will provide a detailed example to show how to customize the ``PPO`` algorithm to ``PPO-Lag`` algorithm.
 
-Example: PPOLag
+Example: PPO-Lag
 ---------------
 
 The Lagrangian multiplier is a useful tool to control the constraint violation in the Safe RL algorithms.

diff --git a/docs/source/usage/make.rst b/docs/source/usage/make.rst
@@ -1,5 +1,5 @@
-Efficient Command
-=================
+Efficient Commands
+==================
 
 To help users quickly reporduce our results,
 we provide a command line tool for easy installation, benchmarking, and evaluation.
@@ -9,6 +9,11 @@ One line benchmark running
 
 First, create a conda environment with Python 3.8.
 
+.. code-block:: bash
+    
+    conda create -n safepo python=3.8
+    conda activate safepo
+
 Then, run the following command to install SafePO and run the full benchmark:
 
 .. code-block:: bash
@@ -42,19 +47,19 @@ The terminal output would be like:
 .. code-block:: bash
     
     ======= commands to run:
-    running python macpo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1
-    running python mappo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1
-    running python mappolag.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1
-    running python happo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 2000 --num-envs 1
+    running python macpo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000
+    running python mappo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000
+    running python mappolag.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000
+    running python happo.py --agent-conf 2x4 --scenario Ant --seed 0 --write-terminal False --experiment benchmark --headless True --total-steps 10000000
     ...
-    running python pcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python ppo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python cup.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python focops.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python rcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python trpo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python cpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
-    running python cppo_pid.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 2000 --num-envs 1 --steps-per-epoch 1000
+    running python pcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python ppo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python cup.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python focops.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python rcpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python trpo_lag.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python cpo.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
+    running python cppo_pid.py --task SafetyAntVelocity-v1 --seed 0 --write-terminal False --experiment benchmark --total-steps 10000000
     ...
     Plotting from...
     ==================================================
@@ -81,3 +86,4 @@ The terminal output would be like:
     After 1 episodes evaluation, the focops in SafetyPointGoal1-v0 evaluation reward: 12.21±2.18, cost: 26.0±19.51, the reuslt is saved in ./results/benchmark/eval_result.txt
     Start evaluating cppo_pid in SafetyPointGoal1-v0
     After 1 episodes evaluation, the cppo_pid in SafetyPointGoal1-v0 evaluation reward: 13.42±0.44, cost: 18.79±2.1, the reuslt is saved in ./results/benchmark/eval_result.txt
+    ...
diff --git a/safepo/common/env.py b/safepo/common/env.py
@@ -81,7 +81,7 @@ def create_env() -> Callable:
 
 def make_sa_isaac_env(args, cfg, sim_params):
     """
-    Creates and returns a VecTaskPython environment for the single agent Shadow Hand task.
+    Creates and returns a VecTaskPython environment for the single agent Isaac Gym task.
 
     Args:
         args: Command-line arguments.
@@ -90,10 +90,10 @@ def make_sa_isaac_env(args, cfg, sim_params):
         sim_params: Parameters for the simulation.
 
     Returns:
-        env: VecTaskPython environment for the single agent Shadow Hand task.
+        env: VecTaskPython environment for the single agent Isaac Gym task.
 
     Warning:
-        SafePO's single agent Shadow Hand task is not ready for use yet.
+        SafePO's single agent Isaac Gym task is not ready for use yet.
     """
     # create native task and pass custom config
     device_id = args.device_id
@@ -119,7 +119,7 @@ def make_sa_isaac_env(args, cfg, sim_params):
 
 def make_ma_mujoco_env(scenario, agent_conf, seed, cfg_train):
     """
-    Creates and returns a multi-agent environment using Mujoco scenarios.
+    Creates and returns a multi-agent environment using MuJoCo scenarios.
 
     Args:
         args: Command-line arguments.
@@ -152,7 +152,7 @@ def init_env():
 
 def make_ma_isaac_env(args, cfg, cfg_train, sim_params, agent_index):
     """
-    Creates and returns a multi-agent environment for the Shadow Hand task.
+    Creates and returns a multi-agent environment for the Isaac Gym task.
 
     Args:
         args: Command-line arguments.
@@ -162,7 +162,7 @@ def make_ma_isaac_env(args, cfg, cfg_train, sim_params, agent_index):
         agent_index: Index of the agent within the multi-agent environment.
 
     Returns:
-        env: A multi-agent environment for the Shadow Hand task.
+        env: A multi-agent environment for the Isaac Gym task.
     """
     # create native task and pass custom config
     device_id = args.device_id