From 51818709f1bef340f1abc67ce3f6d096c75338c3 Mon Sep 17 00:00:00 2001
From: Jiayi Zhou <108712610+Gaiejj@users.noreply.github.com>
Date: Tue, 3 Sep 2024 23:47:50 +0800
Subject: [PATCH] docs: update appendix (#350)
---
.github/workflows/lint.yml | 4 -
docs/source/benchmark/case-study.md | 54 +
docs/source/benchmark/modelbased.md | 223 ++
docs/source/benchmark/off-policy.md | 938 ++++++
docs/source/benchmark/offline.md | 275 ++
docs/source/benchmark/on-policy.md | 2841 +++++++++++++++++
docs/source/index.rst | 15 +
docs/source/spelling_wordlist.txt | 11 +
docs/source/start/algo.md | 111 +
docs/source/start/efficiency.rst | 60 +
docs/source/start/exp-grid.md | 31 +
docs/source/start/features.md | 257 ++
omnisafe/adapter/modelbased_adapter.py | 2 +-
omnisafe/common/logger.py | 4 +-
omnisafe/common/offline/data_collector.py | 2 +-
.../envs/classic_control/envs_from_crabs.py | 2 +-
omnisafe/envs/safety_gymnasium_modelbased.py | 9 +-
omnisafe/evaluator.py | 2 +-
omnisafe/utils/plotter.py | 3 +-
pyproject.toml | 20 +-
20 files changed, 4842 insertions(+), 22 deletions(-)
create mode 100644 docs/source/benchmark/case-study.md
create mode 100644 docs/source/benchmark/modelbased.md
create mode 100644 docs/source/benchmark/off-policy.md
create mode 100644 docs/source/benchmark/offline.md
create mode 100644 docs/source/benchmark/on-policy.md
create mode 100644 docs/source/start/algo.md
create mode 100644 docs/source/start/efficiency.rst
create mode 100644 docs/source/start/exp-grid.md
create mode 100644 docs/source/start/features.md
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
index 80db45035..e6f178cb0 100644
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -45,10 +45,6 @@ jobs:
run: |
make pre-commit
- - name: ruff
- run: |
- make ruff
-
- name: flake8
run: |
make flake8
diff --git a/docs/source/benchmark/case-study.md b/docs/source/benchmark/case-study.md
new file mode 100644
index 000000000..28f93ba72
--- /dev/null
+++ b/docs/source/benchmark/case-study.md
@@ -0,0 +1,54 @@
+# Case Study
+
+One important motivation for SafeRL is to enable agents to explore and
+learn safely. Therefore, evaluating algorithm performance concerning
+*procedural constraint violations* is also important. We have selected
+representative experimental results and report as shown in Figure 1 and Figure 2:
+
+#### Radical vs. Conservative
+
+*Radical* policies often explore higher rewards but violate more safety
+constraints, whereas *Conservative* policies are the opposite.
+Figure 1 illustrates this: during training, CPO and
+PPOLag consistently pursue the highest rewards among all algorithms, as
+depicted in the first row. However, as shown in the second row, they
+experience significant fluctuations in constraint violations, especially
+for PPOLag. So, they are relatively radical, *i.e.,* higher rewards but
+higher costs. In comparison, while P3O achieves slightly lower rewards
+than PPOLag, it maintains fewer oscillations in constraint violations,
+making it safer in adhering to safety constraints, evident from the
+smaller proportion of its distribution crossing the black dashed line. A
+similar pattern is also observed when comparing PCPO with CPO.
+Therefore, P3O and PCPO are relatively conservative, *i.e.,* lower costs
+but lower rewards.
+
+
+
+
+**Figure 1:** PPOLag, P3O, CPO, and PCPO training on four tasks in for 1e7 steps, showing the distribution of all episodic rewards and costs. All data covers over 5 random seeds and filters out data points over 3 standard deviations. The black dashed line in the graph represents the preset `cost_limit`.
+
+
+#### Oscillation vs. Stability
+
+The oscillations in the degree of constraint violations during the
+training process can indicate the performance of SafeRL algorithms.
+These oscillations are quantified by *Extremes*, *i.e.,* the maximum
+constraint violation, and *Distributions*, *i.e.,* the frequency of
+violations remaining below a predefined `cost_limit`. As shown in
+Figure 2, PPOLag, a popular baseline in SafeRL,
+utilizes the Lagrangian multiplier for constraint handling. Despite its
+simplicity and ease of implementation, PPOLag often suffers from
+significant oscillations due to challenges in setting appropriate
+initial values and learning rates. It consistently seeks higher rewards
+but always leads to larger extremes and unsafe distributions.
+Conversely, CPPOPID, which employs a PID controller for updating the
+Lagrangian multiplier, markedly reduces these extremes. CUP implements a
+two-stage projection method that constrains violations' distribution
+below the `cost_limit`. Lastly, PPOSaute integrates state observations
+with constraints, resulting in smaller extremes and safer distributions
+of violations.
+
+
+
+
+**Figure 2:** PPOLag, CPPOPID, CUP, and PPOSaute trained on four tasks in for all 1e7 steps, showing the distribution of all episodic rewards and costs. All data covers over 5 random seeds and filters out data points over 3 standard deviations. The black dashed line in the graph represents the preset `cost_limit`.
diff --git a/docs/source/benchmark/modelbased.md b/docs/source/benchmark/modelbased.md
new file mode 100644
index 000000000..abbc1fce2
--- /dev/null
+++ b/docs/source/benchmark/modelbased.md
@@ -0,0 +1,223 @@
+# Model-based Algorithms
+
+The OmniSafe Navigation Benchmark for model-based algorithms evaluates the effectiveness of OmniSafe's model-based algorithms across two different environments from the [Safety-Gymnasium](https://github.com/PKU-Alignment/safety-gymnasium) task suite. For each supported algorithm and environment, we offer the following:
+
+- Default hyperparameters used for the benchmark and scripts that enable result replication.
+- Graphs and raw data that can be utilized for research purposes.
+- Detailed logs obtained during training.
+
+Supported algorithms are listed below:
+
+- **[NeurIPS 2001]** [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models (PETS))](https://arxiv.org/abs/1805.12114)
+- **[CoRL 2021]** [Learning Off-Policy with Online Planning (LOOP and SafeLOOP)](https://arxiv.org/abs/2008.10066)
+- **[AAAI 2022]** [Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning (CAP)](https://arxiv.org/abs/2112.07701)
+- **[ICML 2022 Workshop]** [Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method (RCE)](https://arxiv.org/abs/2010.07968)
+- **[NeurIPS 2018]** [Constrained Cross-Entropy Method for Safe Reinforcement Learning (CCE)](https://proceedings.neurips.cc/paper/2018/hash/34ffeb359a192eb8174b6854643cc046-Abstract.html)
+
+## Safety-Gymnasium
+
+We highly recommend using **Safety-Gymnasium** to run the following experiments. To install, in a linux machine, type:
+
+```bash
+pip install safety_gymnasium
+```
+
+## Run the Benchmark
+
+You can set the main function of ``examples/benchmarks/experiment_grid.py`` as:
+
+```python
+if __name__ == '__main__':
+ eg = ExperimentGrid(exp_name='Model-Based-Benchmarks')
+
+ # set up the algorithms.
+ model_based_base_policy = ['LOOP', 'PETS']
+ model_based_safe_policy = ['SafeLOOP', 'CCEPETS', 'CAPPETS', 'RCEPETS']
+ eg.add('algo', model_based_base_policy + model_based_safe_policy)
+
+ # you can use wandb to monitor the experiment.
+ eg.add('logger_cfgs:use_wandb', [False])
+ # you can use tensorboard to monitor the experiment.
+ eg.add('logger_cfgs:use_tensorboard', [True])
+ eg.add('train_cfgs:total_steps', [1000000])
+
+ # set up the environment.
+ eg.add('env_id', [
+ 'SafetyPointGoal1-v0-modelbased',
+ 'SafetyCarGoal1-v0-modelbased',
+ ])
+ eg.add('seed', [0, 5, 10, 15, 20])
+
+ # total experiment num must can be divided by num_pool
+ # meanwhile, users should decide this value according to their machine
+ eg.run(train, num_pool=5)
+```
+
+After that, you can run the following command to run the benchmark:
+
+```bash
+cd examples/benchmarks
+python run_experiment_grid.py
+```
+
+You can set the path of ``examples/benchmarks/experiment_grid.py`` :
+example:
+
+```python
+path ='omnisafe/examples/benchmarks/exp-x/Model-Based-Benchmarks'
+```
+
+You can also plot the results by running the following command:
+
+```bash
+cd examples
+python analyze_experiment_results.py
+```
+
+**For a detailed usage of OmniSafe statistics tool, please refer to [this tutorial](https://omnisafe.readthedocs.io/en/latest/common/stastics_tool.html).**
+
+## OmniSafe Benchmark
+
+To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of `cost_limit=1.00`. The results are presented in Table 1 and Figure 1.
+
+### Performance Table
+
+
+
+
+ | PETS | +LOOP | +SafeLOOP | +|||
---|---|---|---|---|---|---|
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyCarGoal1-v0 | +33.07 ±1.33 | +61.20 ±7.23 | +25.41 ±1.23 | +62.64 ±8.34 | +22.09 ±0.30 | +0.16 ±0.15 | +
SafetyPointGoal1-v0 | +27.66 ±0.07 | +49.16 ±2.69 | +25.08 ±1.47 | +55.23 ±2.64 | +22.94 ±0.72 | +0.04 ±0.07 | +
+ | CCEPETS | +RCEPETS | +CAPPETS | +|||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyCarGoal1-v0 | +27.60 ±1.21 | +1.03 ±0.29 | +29.08 ±1.63 | +1.02 ±0.88 | +23.33 ±6.34 | +0.48 ±0.17 | +
SafetyPointGoal1-v0 | +24.98 ±0.05 | +1.87 ±1.27 | +25.39 ±0.28 | +2.46 ±0.58 | +9.45 ±8.62 | +0.64 ±0.77 | +
Table 1: The performance of OmniSafe model-based algorithms, encompassing both reward and cost, was assessed within the Safety-Gymnasium environments. It is crucial to highlight that all model-based algorithms underwent evaluation following 1e6 training steps.
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
Figure 1: Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms and safe learning algorithms mentioned in Table 1.
+ | DDPG | +TD3 | +SAC | +||||||
---|---|---|---|---|---|---|---|---|---|
Environment | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +
SafetyAntVelocity-v1 | +860.86 ± 198.03 | +308.60 ± 318.60 | +2654.58 ± 1738.21 | +5246.86 ± 580.50 | +5379.55 ± 224.69 | +3079.45 ± 1456.81 | +5456.31 ± 156.04 | +6012.30 ± 102.64 | +2404.50 ± 1152.65 | +
SafetyHalfCheetahVelocity-v1 | +11377.10 ± 75.29 | +12493.55 ± 437.54 | +7796.63 ± 3541.64 | +11246.12 ± 488.62 | +10246.77 ± 908.39 | +8631.27 ± 2869.15 | +11488.86 ± 513.09 | +12083.89 ± 564.51 | +7767.74 ± 3159.07 | +
SafetyHopperVelocity-v1 | +1462.56 ± 591.14 | +2018.97 ± 1045.20 | +2214.06 ± 1219.57 | +3404.41 ± 82.57 | +2682.53 ± 1004.84 | +2542.67 ± 1253.33 | +3597.70 ± 32.23 | +3546.59 ± 76 .00 | +2158.54 ± 1343.24 | +
SafetyHumanoidVelocity-v1 | +1537.39 ± 335.62 | +124.96 ± 61.68 | +2276.92 ± 2299.68 | +5798.01 ± 160.72 | +3838.06 ± 1832.90 | +3511.06 ± 2214.12 | +6039.77 ± 167.82 | +5424.55 ± 118.52 | +2713.60 ± 2256.89 | +
SafetySwimmerVelocity-v1 | +139.39 ± 11.74 | +138.98 ± 8.60 | +210.40 ± 148.01 | +98.39 ± 32.28 | +94.43 ±9.63 | +247.09 ± 131.69 | +46.44 ±1.23 | +44.34 ±2.01 | +247.33 ± 122.02 | +
SafetyWalker2dVelocity-v1 | +1911.70 ± 395.97 | +543.23 ± 316.10 | +3917.46 ± 1077.38 | +3034.83 ± 1374.72 | +4267.05 ± 678.65 | +4087.94 ± 755.10 | +4419.29 ± 232.06 | +4619.34 ± 274.43 | +3906.78 ± 795.48 | +
Table 1: The performance of OmniSafe, which was evaluated in relation to published baselines within the Safety-Gymnasium environments. Experimental outcomes, comprising mean and standard deviation, were derived from 10 assessment iterations encompassing multiple random seeds. A noteworthy distinction lies in the fact that Stable-Baselines3 employs distinct parameters tailored to each environment, while OmniSafe maintains a consistent parameter set across all environments.
+ | DDPG | +TD3 | +SAC | +|||
---|---|---|---|---|---|---|
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +860.86 ± 198.03 | +234.80 ± 40.63 | +5246.86 ± 580.50 | +912.90 ± 93.73 | +5456.31 ± 156.04 | +943.10 ± 47.51 | +
SafetyHalfCheetahVelocity-v1 | +11377.10 ± 75.29 | +980.93 ± 1.05 | +11246.12 ± 488.62 | +981.27 ± 0.31 | +11488.86 ± 513.09 | +981.93 ± 0.33 | +
SafetyHopperVelocity-v1 | +1462.56 ± 591.14 | +429.17 ± 220.05 | +3404.41 ± 82.57 | +973.80 ± 4.92 | +3537.70 ± 32.23 | +975.23 ± 2.39 | +
SafetyHumanoidVelocity-v1 | +1537.39 ± 335.62 | +48.79 ±13.06 | +5798.01 ± 160.72 | +255.43 ± 437.13 | +6039.77 ± 167.82 | +41.42 ±49.78 | +
SafetySwimmerVelocity-v1 | +139.39 ± 11.74 | +200.53 ± 43.28 | +98.39 ±32.28 | +115.27 ± 44.90 | +46.44 ±1.23 | +40.97 ±0.47 | +
SafetyWalker2dVelocity-v1 | +1911.70 ± 395.97 | +318.10 ± 71.03 | +3034.83 ± 1374.72 | +606.47 ± 337.33 | +4419.29 ± 232.06 | +877.70 ± 8.95 | +
SafetyCarCircle1-v0 | +44.64 ±2.15 | +371.93 ± 38.75 | +44.57 ±2.71 | +383.37 ± 62.03 | +43.46 ±4.39 | +406.87 ± 78.78 | +
SafetyCarGoal1-v0 | +36.99 ±1.66 | +57.13 ±38.40 | +36.26 ±2.35 | +69.70 ±52.18 | +35.71 ±2.24 | +54.73 ±46.74 | +
SafetyPointCircle1-v0 | +113.67 ± 1.33 | +421.53 ± 142.66 | +115.15 ± 2.24 | +391.07 ± 38.34 | +115.06 ± 2.04 | +403.43 ± 44.78 | +
SafetyPointGoal1-v0 | +25.55 ±2.62 | +41.60 ±37.17 | +27.28 ±1.21 | +51.43 ±33.05 | +27.04 ±1.49 | +67.57 ±32.13 | +
+ | DDPGLag | +TD3Lag | +SACLag | +|||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +1271.48 ± 581.71 | +33.27 ±13.34 | +1944.38 ± 759.20 | +63.27 ±46.89 | +1897.32 ± 1213.74 | +5.73 ±7.83 | +
SafetyHalfCheetahVelocity-v1 | +2743.06 ± 21.77 | +0.33 ±0.12 | +2741.08 ± 49.13 | +10.47 ±14.45 | +2833.72 ± 3.62 | +0.00 ±0.00 | +
SafetyHopperVelocity-v1 | +1093.25 ± 81.55 | +15.00 ±21.21 | +928.79 ± 389.48 | +40.67 ±30.99 | +963.49 ± 291.64 | +20.23 ±28.47 | +
SafetyHumanoidVelocity-v1 | +2059.96 ± 485.68 | +19.71 ±4.05 | +5751.99 ± 157.28 | +10.71 ±23.60 | +5940.04 ± 121.93 | +17.59 ±6.24 | +
SafetySwimmerVelocity-v1 | +13.18 ±20.31 | +28.27 ±32.27 | +15.58 ±16.97 | +13.27 ±17.64 | +11.03 ±11.17 | +22.70 ±32.10 | +
SafetyWalker2dVelocity-v1 | +2238.92 ± 400.67 | +33.43 ±20.08 | +2996.21 ± 74.40 | +22.50 ±16.97 | +2676.47 ± 300.43 | +30.67 ±32.30 | +
SafetyCarCircle1-v0 | +33.29 ±6.55 | +20.67 ±28.48 | +34.38 ±1.55 | +2.25 ±3.90 | +31.42 ±11.67 | +22.33 ±26.16 | +
SafetyCarGoal1-v0 | +22.80 ±8.75 | +17.33 ±21.40 | +7.31 ±5.34 | +33.83 ±31.03 | +10.83 ±11.29 | +22.67 ±28.91 | +
SafetyPointCircle1-v0 | +70.71 ±13.61 | +22.00 ±32.80 | +83.07 ±3.49 | +7.83 ±15.79 | +83.68 ±3.32 | +12.83 ±19.53 | +
SafetyPointGoal1-v0 | +17.17 ±10.03 | +20.33 ±31.59 | +25.27 ±2.74 | +28.00 ±15.75 | +21.45 ±6.97 | +19.17 ±9.72 | +
+ | DDPGPID | +TD3PID | +SACPID | +|||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +2078.27 ± 704.77 | +18.20 ±7.21 | +2410.46 ± 217.00 | +44.50 ±38.39 | +1940.55 ± 482.41 | +13.73 ±7.24 | +
SafetyHalfCheetahVelocity-v1 | +2737.61 ± 45.93 | +36.10 ±11.03 | +2695.64 ± 29.42 | +35.93 ±14.03 | +2689.01 ± 15.46 | +21.43 ±5.49 | +
SafetyHopperVelocity-v1 | +1034.42 ± 350.59 | +29.53 ±34.54 | +1225.97 ± 224.71 | +46.87 ±65.28 | +812.80 ± 381.86 | +92.23 ±77.64 | +
SafetyHumanoidVelocity-v1 | +1082.36 ± 486.48 | +15.00 ±19.51 | +6179.38 ± 105.70 | +5.60 ±6.23 | +6107.36 ± 113.24 | +6.20 ±10.14 | +
SafetySwimmerVelocity-v1 | +23.99 ±7.76 | +30.70 ±21.81 | +28.62 ±8.48 | +22.47 ±7.69 | +7.50 ±10.42 | +7.77 ±8.48 | +
SafetyWalker2dVelocity-v1 | +1378.75 ± 896.73 | +14.77 ±13.02 | +2769.64 ± 67.23 | +6.53 ±8.86 | +1251.87 ± 721.54 | +41.23 ±73.33 | +
SafetyCarCircle1-v0 | +26.89 ±11.18 | +31.83 ±33.59 | +34.77 ±3.24 | +47.00 ±39.53 | +34.41 ±7.19 | +5.00 ±11.18 | +
SafetyCarGoal1-v0 | +19.35 ±14.63 | +17.50 ±21.31 | +27.28 ±4.50 | +9.50 ±12.15 | +16.21 ±12.65 | +6.67 ±14.91 | +
SafetyPointCircle1-v0 | +71.63 ±8.39 | +0.00 ±0.00 | +70.95 ±6.00 | +0.00 ±0.00 | +75.15 ±6.65 | +4.50 ±4.65 | +
SafetyPointGoal1-v0 | +19.85 ±5.32 | +22.67 ±13.73 | +18.76 ±7.87 | +12.17 ±9.39 | +15.87 ±6.73 | +27.50 ±15.25 | +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetyCarCircle1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointCircle1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
Figure 1: Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetyCarCircle1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointCircle1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
Figure 2: Training curves in Safety-Gymnasium environments, covering lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.
+
+ +
+ SafetyAnt
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetyCarCircle1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointCircle1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
Figure 3: Training curves in Safety-Gymnasium environments, covering pid-lagrangian reinforcement learning algorithms mentioned in Table 1 and safe reinforcement learning algorithms.
+ | VAE-BC | +C-CRR | +BCQLag | +COptiDICE | +||||
---|---|---|---|---|---|---|---|---|
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyPointCircle1-v0(beta=0.25) | +43.66 ± 0.90 | +109.86 ± 13.24 | +45.48 ± 0.87 | +127.30 ± 12.60 | +43.31 ± 0.76 | +113.39 ± 12.81 | +40.68 ± 0.93 | +67.11 ± 13.15 | +
SafetyPointCircle1-v0(beta=0.50) | +42.84 ± 1.36 | +62.34 ± 14.84 | +45.99 ± 1.36 | +97.20 ± 13.57 | +44.68 ± 1.97 | +95.06 ± 33.07 | +39.55 ± 1.39 | +53.87 ± 13.27 | +
SafetyPointCircle1-v0(beta=0.75) | +40.23 ± 0.75 | +41.25 ± 10.12 | +40.66 ± 0.88 | +49.90 ± 10.81 | +42.94 ± 1.04 | +85.37 ± 23.41 | +40.98 ± 0.89 | +70.40 ± 12.14 | +
SafetyCarCircle1-v0(beta=0.25) | +19.62 ± 0.28 | +150.54 ± 7.63 | +18.53 ± 0.45 | +122.63 ± 13.14 | +18.88 ± 0.61 | +125.44 ± 15.68 | +17.25 ± 0.37 | +90.86 ± 10.75 | +
SafetyCarCircle1-v0(beta=0.50) | +18.69 ± 0.33 | +125.97 ± 10.36 | +17.24 ± 0.43 | +89.47 ± 11.55 | +18.14 ± 0.96 | +108.07 ± 20.70 | +16.38 ± 0.43 | +70.54 ± 12.36 | +
SafetyCarCircle1-v0(beta=0.75) | +17.31 ± 0.33 | +85.53 ± 11.33 | +15.74 ± 0.42 | +48.38 ± 10.31 | +17.10 ± 0.84 | +77.54 ± 14.07 | +15.58 ± 0.37 | +49.42 ± 8.70 | +
Table 1:The performance of OmniSafe offline algorithms, which was evaluated following 1e6 training steps and under the experimental setting of cost limit=25.00. We introduce a quantization parameter beta from the perspective of safe trajectories and control the trajectory distribution of the mixed dataset. This parameter beta indicates the difficulty of this dataset to a certain extent. When beta is smaller, it means that the number of safe trajectories in the current dataset is smaller, the less safe information can be available for the algorithm to learn.
+ | Policy +Gradient | +PPO | ||||
---|---|---|---|---|---|---|
Environment | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +
SafetyAntVelocity-v1 | +2769.45 ± 550.71 | +145.33 ± 127.55 | +- ±- | +4295.96 ± 658.2 | +2607.48 ± 1415.78 | +1780.61 ± 780.65 | +
SafetyHalfCheetahVelocity-v1 | +2625.44 ± 1079.04 | +707.56 ± 158.59 | +- ±- | +3507.47 ± 1563.69 | +6299.27 ± 1692.38 | +5074.85 ± 2225.47 | +
SafetyHopperVelocity-v1 | +1884.38 ± 825.13 | +343.88 ± 51.85 | +- ±- | +2679.98 ± 921.96 | +1834.7 ± 862.06 | +838.96 ± 351.10 | +
SafetyHumanoidVelocity-v1 | +647.52 ± 154.82 | +438.97 ± 123.68 | +- ±- | +1106.09 ± 607.6 | +677.43 ± 189.96 | +762.73 ± 170.22 | +
SafetySwimmerVelocity-v1 | +47.31 ± 16.19 | +27.12 ±7.47 | +- ±- | +113.28 ± 20.22 | +37.93 ±8.68 | +273.86 ± 87.76 | +
SafetyWalker2dVelocity-v1 | +1665 .00 ± 930.18 | +373.63 ± 129.2 | +- ±- | +3806.39 ± 1547.48 | +3748.26 ± 1832.83 | +3304.35 ± 706.13 | +
+ | NaturalPG | +TRPO | ||||
Environment | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +OmniSafe (Ours) | +Tianshou | +Stable-Baselines3 | +
SafetyAntVelocity-v1 | +3793.70 ± 583.66 | +2062.45 ± 876.43 | +- ±- | +4362.43 ± 640.54 | +2521.36 ± 1442.10 | +3233.58 ± 1437.16 | +
SafetyHalfCheetahVelocity-v1 | +4096.77 ± 1223.70 | +3430.9 ± 239.38 | +- ±- | +3313.31 ± 1048.78 | +4255.73 ± 1053.82 | +7185.06 ± 3650.82 | +
SafetyHopperVelocity-v1 | +2590.54 ± 631.05 | +993.63 ± 489.42 | +- ±- | +2698.19 ± 568.80 | +1346.94 ± 984.09 | +2467.10 ± 1160.25 | +
SafetyHumanoidVelocity-v1 | +3838.67 ± 1654.79 | +810.76 ± 270.69 | +- ±- | +1461.51 ± 602.23 | +749.42 ± 149.81 | +2828.18 ± 2256.38 | +
SafetySwimmerVelocity-v1 | +116.33 ± 5.97 | +29.75 ±12.00 | +- ±- | +105.08 ± 31.00 | +37.21 ±4.04 | +258.62 ± 124.91 | +
SafetyWalker2dVelocity-v1 | +4054.62 ± 1266.76 | +3372.59 ± 1049.14 | +- ±- | +4099.97 ± 409.05 | +3372.59 ± 961.74 | +4227.91 ± 760.93 | +
Table 1:The performance of OmniSafe, which was evaluated in relation to published baselines within the Safety-Gymnasium MuJoCo Velocity environments. Experimental outcomes, comprising mean and standard deviation, were derived from 10 assessment iterations encompassing multiple random seeds.
+ | Policy Gradient | +Natural PG | +TRPO | +PPO | +||||
---|---|---|---|---|---|---|---|---|
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +5292.29 ± 913.44 | +919.42 ± 158.61 | +5547.20 ± 807.89 | +895.56 ± 77.13 | +6026.79 ± 314.98 | +933.46 ± 41.28 | +5977.73 ± 885.65 | +958.13 ± 134.5 | +
SafetyHalfCheetahVelocity-v1 | +5188.46 ± 1202.76 | +896.55 ± 184.7 | +5878.28 ± 2012.24 | +847.74 ± 249.02 | +6490.76 ± 2507.18 | +734.26 ± 321.88 | +6921.83 ± 1721.79 | +919.2 ±173.08 | +
SafetyHopperVelocity-v1 | +3218.17 ± 672.88 | +881.76 ± 198.46 | +2613.95 ± 866.13 | +587.78 ± 220.97 | +2047.35 ± 447.33 | +448.12 ± 103.87 | +2337.11 ± 942.06 | +550.02 ± 237.70 | +
SafetyHumanoidVelocity-v1 | +7001.78 ± 419.67 | +834.11 ± 212.43 | +8055.20 ± 641.67 | +946.40 ± 9.11 | +8681.24 ± 3934.08 | +718.42 ± 323.30 | +9115.93 ± 596.88 | +960.44 ± 7.06 | +
SafetySwimmerVelocity-v1 | +77.05 ±33.44 | +107.1 ±60.58 | +120.19 ± 7.74 | +161.78 ± 17.51 | +124.91 ± 6.13 | +176.56 ± 15.95 | +119.77 ± 13.8 | +165.27 ± 20.15 | +
SafetyWalker2dVelocity-v1 | +4832.34 ± 685.76 | +866.59 ± 93.47 | +5347.35 ± 436.86 | +914.74 ± 32.61 | +6096.67 ± 723.06 | +914.46 ± 27.85 | +6239.52 ± 879.99 | +902.68 ± 100.93 | +
SafetyCarGoal1-v0 | +35.86 ±1.97 | +57.46 ±48.34 | +36.07 ±1.25 | +58.06 ±10.03 | +36.60 ±0.22 | +55.58 ±12.68 | +33.41 ±2.89 | +58.06 ±42.06 | +
SafetyCarButton1-v0 | +19.76 ±10.15 | +353.26 ± 177.08 | +22.16 ±4.48 | +333.98 ± 67.49 | +21.98 ±2.06 | +343.22 ± 24.60 | +17.51 ±9.46 | +373.98 ± 156.64 | +
SafetyCarGoal2-v0 | +29.43 ±4.62 | +179.2 ±84.86 | +30.26 ±0.38 | +209.62 ± 29.97 | +32.17 ±1.24 | +190.74 ± 21.05 | +29.88 ±4.55 | +194.16 ± 106.2 | +
SafetyCarButton2-v0 | +18.06 ±10.53 | +349.82 ± 187.07 | +20.85 ±3.14 | +313.88 ± 58.20 | +20.51 ±3.34 | +316.42 ± 35.28 | +21.35 ±8.22 | +312.64 ± 138.4 | +
SafetyPointGoal1-v0 | +26.19 ±3.44 | +201.22 ± 80.4 | +26.92 ±0.58 | +57.92 ±9.97 | +27.20 ±0.44 | +45.88 ±11.27 | +25.44 ±5.43 | +55.72 ±35.55 | +
SafetyPointButton1-v0 | +29.98 ±5.24 | +141.74 ± 75.13 | +31.95 ±1.53 | +123.98 ± 32.05 | +30.61 ±0.40 | +134.38 ± 22.06 | +27.03 ±6.14 | +152.48 ± 80.39 | +
SafetyPointGoal2-v0 | +25.18 ±3.62 | +204.96 ± 104.97 | +26.19 ±0.84 | +193.60 ± 18.54 | +25.61 ±0.89 | +202.26 ± 15.15 | +25.49 ±2.46 | +159.28 ± 87.13 | +
SafetyPointButton2-v0 | +26.88 ±4.38 | +153.88 ± 65.54 | +28.45 ±1.49 | +160.40 ± 20.08 | +28.78 ±2.05 | +170.30 ± 30.59 | +25.91 ±6.15 | +166.6 ±111.21 | +
+ | RCPO | +TRPOLag | +PPOLag | +P3O | +||||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +3139.52 ± 110.34 | +12.34 ±3.11 | +3041.89 ± 180.77 | +19.52 ±20.21 | +3261.87 ± 80.00 | +12.05 ±6.57 | +2636.62 ± 181.09 | +20.69 ±10.23 | +
SafetyHalfCheetahVelocity-v1 | +2440.97 ± 451.88 | +9.02 ±9.34 | +2884.68 ± 77.47 | +9.04 ±11.83 | +2946.15 ± 306.35 | +3.44 ±4.77 | +2117.84 ± 313.55 | +27.6 ±8.36 | +
SafetyHopperVelocity-v1 | +1428.58 ± 199.87 | +11.12 ±12.66 | +1391.79 ± 269.07 | +11.22 ±9.97 | +961.92 ± 752.87 | +13.96 ±19.33 | +1231.52 ± 465.35 | +16.33 ±11.38 | +
SafetyHumanoidVelocity-v1 | +6286.51 ± 151.03 | +19.47 ±7.74 | +6551.30 ± 58.42 | +59.56 ±117.37 | +6624.46 ± 25.9 | +5.87 ±9.46 | +6342.47 ± 82.45 | +126.4 ±193.76 | +
SafetySwimmerVelocity-v1 | +61.29 ±18.12 | +22.60 ±1.16 | +81.18 ±16.33 | +22.24 ±3.91 | +64.74 ±17.67 | +28.02 ±4.09 | +38.02 ±34.18 | +18.4 ±12.13 | +
SafetyWalker2dVelocity-v1 | +3064.43 ± 218.83 | +3.02 ±1.48 | +3207.10 ± 7.88 | +14.98 ±9.27 | +2982.27 ± 681.55 | +13.49 ±14.55 | +2713.57 ± 313.2 | +20.51 ±14.09 | +
SafetyCarGoal1-v0 | +18.71 ±2.72 | +23.10 ±12.57 | +27.04 ±1.82 | +26.80 ±5.64 | +13.27 ±9.26 | +21.72 ±32.06 | +-1.10 ±6.851 | +50.58 ±99.24 | +
SafetyCarButton1-v0 | +-2.04 ±2.98 | +43.48 ±31.52 | +-0.38 ±0.85 | +37.54 ±31.72 | +0.33 ±1.96 | +55.5 ±89.64 | +-2.06 ±7.2 | +43.78 ±98.01 | +
SafetyCarGoal2-v0 | +2.30 ±1.76 | +22.90 ±16.22 | +3.65 ±1.09 | +39.98 ±20.29 | +1.58 ±2.49 | +13.82 ±24.62 | +-0.07 ±1.62 | +43.86 ±99.58 | +
SafetyCarButton2-v0 | +-1.35 ±2.41 | +42.02 ±31.77 | +-1.68 ±2.55 | +20.36 ±13.67 | +0.76 ±2.52 | +47.86 ±103.27 | +0.11 ±0.72 | +85.94 ±122.01 | +
SafetyPointGoal1-v0 | +15.27 ±4.05 | +30.56 ±19.15 | +18.51 ±3.83 | +22.98 ±8.45 | +12.96 ±6.95 | +25.80 ±34.99 | +1.6 ±3.01 | +31.1 ±80.03 | +
SafetyPointButton1-v0 | +3.65 ±4.47 | +26.30 ±9.22 | +6.93 ±1.84 | +31.16 ±20.58 | +4.60 ±4.73 | +20.8 ±35.78 | +-0.34 ±1.53 | +52.86 ±85.62 | +
SafetyPointGoal2-v0 | +2.17 ±1.46 | +33.82 ±21.93 | +4.64 ±1.43 | +26.00 ±4.70 | +1.98 ±3.86 | +41.20 ±61.03 | +0.34 ±2.2 | +65.84 ±195.76 | +
SafetyPointButton2-v0 | +7.18 ±1.93 | +45.02 ±25.28 | +5.43 ±3.44 | +25.10 ±8.98 | +0.93 ±3.69 | +33.72 ±58.75 | +0.33 ±2.44 | +28.5 ±49.79 | +
+ | CUP | +PCPO | +FOCOPS | +CPO | +||||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +3215.79 ± 346.68 | +18.25 ±17.12 | +2257.07 ± 47.97 | +10.44 ±5.22 | +3184.48 ± 305.59 | +14.75 ±6.36 | +3098.54 ± 78.90 | +14.12 ±3.41 | +
SafetyHalfCheetahVelocity-v1 | +2850.6 ± 244.65 | +4.27 ±4.46 | +1677.93 ± 217.31 | +19.06 ±15.26 | +2965.2 ± 290.43 | +2.37 ±3.5 | +2786.48 ± 173.45 | +4.70 ±6.72 | +
SafetyHopperVelocity-v1 | +1716.08 ± 5.93 | +7.48 ±5.535 | +1551.22 ± 85.16 | +15.46 ±9.83 | +1437.75 ± 446.87 | +10.13 ±8.87 | +1713.71 ± 18.26 | +13.40 ±5.82 | +
SafetyHumanoidVelocity-v1 | +6109.94 ± 497.56 | +24.69 ±20.54 | +5852.25 ± 78.01 | +0.24 ±0.48 | +6489.39 ± 35.1 | +13.86 ±39.33 | +6465.34 ± 79.87 | +0.18 ±0.36 | +
SafetySwimmerVelocity-v1 | +63.83 ±46.45 | +21.95 ±11.04 | +54.42 ±38.65 | +17.34 ±1.57 | +53.87 ±17.9 | +29.75 ±7.33 | +65.30 ±43.25 | +18.22 ±8.01 | +
SafetyWalker2dVelocity-v1 | +2466.95 ± 1114.13 | +6.63 ±8.25 | +1802.86 ± 714.04 | +18.82 ±5.57 | +3117.05 ± 53.60 | +8.78 ±12.38 | +2074.76 ± 962.45 | +21.90 ±9.41 | +
SafetyCarGoal1-v0 | +6.14 ±6.97 | +36.12 ±89.56 | +21.56 ±2.87 | +38.42 ±8.36 | +15.23 ±10.76 | +31.66 ±93.51 | +25.52 ±2.65 | +43.32 ±14.35 | +
SafetyCarButton1-v0 | +1.49 ±2.84 | +103.24 ± 123.12 | +0.36 ±0.85 | +40.52 ±21.25 | +0.21 ±2.27 | +31.78 ±47.03 | +0.82 ±1.60 | +37.86 ±27.41 | +
SafetyCarGoal2-v0 | +1.78 ±4.03 | +95.4 ±129.64 | +1.62 ±0.56 | +48.12 ±31.19 | +2.09 ±4.33 | +31.56 ±58.93 | +3.56 ±0.92 | +32.66 ±3.31 | +
SafetyCarButton2-v0 | +1.49 ±2.64 | +173.68 ± 163.77 | +0.66 ±0.42 | +49.72 ±36.50 | +1.14 ±3.18 | +46.78 ±57.47 | +0.17 ±1.19 | +48.56 ±29.34 | +
SafetyPointGoal1-v0 | +14.42 ±6.74 | +19.02 ±20.08 | +18.57 ±1.71 | +22.98 ±6.56 | +14.97 ±9.01 | +33.72 ±42.24 | +20.46 ±1.38 | +28.84 ±7.76 | +
SafetyPointButton1-v0 | +3.5 ±7.07 | +39.56 ±54.26 | +2.66 ±1.83 | +49.40 ±36.76 | +5.89 ±7.66 | +38.24 ±42.96 | +4.04 ±4.54 | +40.00 ±4.52 | +
SafetyPointGoal2-v0 | +1.06 ±2.67 | +107.3 ±204.26 | +1.06 ±0.69 | +51.92 ±47.40 | +2.21 ±4.15 | +37.92 ±111.81 | +2.50 ±1.25 | +40.84 ±23.31 | +
SafetyPointButton2-v0 | +2.88 ±3.65 | +54.24 ±71.07 | +1.05 ±1.27 | +41.14 ±12.35 | +2.43 ±3.33 | +17.92 ±26.1 | +5.09 ±1.83 | +48.92 ±17.79 | +
+ | PPOSaute | +TRPOSaute | +PPOSimmerPID | +TRPOSimmerPID | +||||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +2978.74 ± 93.65 | +16.77 ±0.92 | +2507.65 ± 63.97 | +8.036 ±0.39 | +2944.84 ± 60.53 | +16.20 ±0.66 | +3018.95 ± 66.44 | +16.52 ±0.23 | +
SafetyHalfCheetahVelocity-v1 | +2901.40 ± 25.49 | +16.20 ± 0.60 | +2521.80 ± 477.29 | +7.61 ±0.39 | +2922.17 ± 24.84 | +16.14 ±0.14 | +2737.79 ± 37.53 | +16.44 ±0.21 | +
SafetyHopperVelocity-v1 | +1650.91 ± 152.65 | +17.87 ±1.33 | +1368.28 ± 576.08 | +10.38 ±4.38 | +1699.94 ± 24.25 | +17.04 ±0.41 | +1608.41 ± 88.23 | +16.30 ±0.30 | +
SafetyHumanoidVelocity-v1 | +6401.00 ± 32.23 | +17.10 ±2.41 | +5759.44 ± 75.73 | +15.84 ±1.42 | +6401.85 ± 57.62 | +11.06 ±5.35 | +6411.32 ± 44.26 | +13.04 ±2.68 | +
SafetySwimmerVelocity-v1 | +35.61 ±4.37 | +3.44 ±1.35 | +34.72 ±1.37 | +10.19 ±2.32 | +77.52 ±40.20 | +0.98 ±1.91 | +51.39 ±40.09 | +0.00 ±0.00 | +
SafetyWalker2dVelocity-v1 | +2410.89 ± 241.22 | +18.88 ±2.38 | +2548.82 ± 891.65 | +13.21 ±6.09 | +3187.56 ± 32.66 | +17.10 ±0.49 | +3156.99 ± 30.93 | +17.14 ±0.54 | +
SafetyCarGoal1-v0 | +7.12 ±5.41 | +21.68 ±29.11 | +16.67 ±10.57 | +23.58 ±26.39 | +8.45 ±7.16 | +18.98 ±25.63 | +15.08 ±13.41 | +23.22 ±19.80 | +
SafetyCarButton1-v0 | +-1.72 ±0.89 | +51.88 ±28.18 | +-2.03 ±0.40 | +6.24 ±6.14 | +-0.57 ±0.63 | +49.14 ±37.77 | +-1.24 ±0.47 | +17.26 ±16.13 | +
SafetyCarGoal2-v0 | +0.90 ±1.20 | +19.98 ±10.12 | +1.76 ±5.20 | +31.50 ±45.50 | +1.02 ±1.41 | +27.32 ±60.12 | +0.93 ±2.21 | +26.66 ±60.07 | +
SafetyCarButton2-v0 | +-1.89 ±1.86 | +47.33 ±28.90 | +-2.60 ±0.40 | +74.57 ±84.95 | +-1.31 ±0.93 | +52.33 ±19.96 | +-0.99 ±0.63 | +20.40 ±12.77 | +
SafetyPointGoal1-v0 | +7.06 ±5.85 | +20.04 ±21.91 | +16.18 ±9.55 | +29.94 ±26.68 | +8.30 ±6.03 | +25.32 ±31.91 | +11.64 ±8.46 | +30.00 ±27.67 | +
SafetyPointButton1-v0 | +-1.47 ±0.98 | +22.60 ±13.91 | +-3.13 ±3.51 | +9.04 ±3.94 | +-1.97 ±1.41 | +12.80 ±7.84 | +-1.36 ±0.37 | +2.14 ±1.73 | +
SafetyPointGoal2-v0 | +0.84 ±2.93 | +14.06 ±30.21 | +1.64 ±4.02 | +19.00 ±34.69 | +0.56 ±2.52 | +12.36 ±43.39 | +1.55 ±4.68 | +14.90 ±27.82 | +
SafetyPointButton2-v0 | +-1.38 ±0.11 | +12.00 ±8.60 | +-2.56 ±0.67 | +17.27 ±10.01 | +-1.70 ±0.29 | +7.90 ±3.30 | +-1.66 ±0.99 | +6.70 ±4.74 | +
+ | CPPOPID | +TRPOPID | +PPOEarlyTerminated | +TRPOEarlyTerminated | +||||
Environment | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +Reward | +Cost | +
SafetyAntVelocity-v1 | +3213.36 ± 146.78 | +14.30 ±7.39 | +3052.94 ± 139.67 | +15.22 ±3.68 | +2801.53 ± 19.66 | +0.23 ±0.09 | +3052.63 ± 58.41 | +0.40 ±0.23 | +
SafetyHalfCheetahVelocity-v1 | +2837.89 ± 398.52 | +8.06 ±9.62 | +2796.75 ± 190.84 | +11.16 ±9.80 | +2447.25 ± 346.84 | +3.47 ±4.90 | +2555.70 ± 368.17 | +0.06 ±0.08 | +
SafetyHopperVelocity-v1 | +1713.29 ± 10.21 | +8.96 ±4.28 | +1178.59 ± 646.71 | +18.76 ±8.93 | +1643.39 ± 2.58 | +0.77 ±0.26 | +1646.47 ± 49.95 | +0.42 ±0.84 | +
SafetyHumanoidVelocity-v1 | +6579.26 ± 55.70 | +3.76 ±3.61 | +6407.95 ± 254.06 | +7.38 ±11.34 | +6321.45 ± 35.73 | +0.00 ±0.00 | +6332.14 ± 89.86 | +0.00 ±0.00 | +
SafetySwimmerVelocity-v1 | +91.05 ±62.68 | +19.12 ±8.33 | +69.75 ±46.52 | +20.48 ±9.13 | +33.02 ±7.26 | +24.23 ±0.54 | +39.24 ±5.01 | +23.20 ±0.48 | +
SafetyWalker2dVelocity-v1 | +2183.43 ± 1300.69 | +14.12 ±10.28 | +2707.75 ± 980.56 | +9.60 ±8.94 | +2195.57 ± 1046.29 | +7.63 ±10.44 | +2079.64 ± 1028.73 | +13.74 ±15.94 | +
SafetyCarGoal1-v0 | +10.60 ±2.51 | +30.66 ±7.53 | +25.49 ±1.31 | +28.92 ±7.66 | +17.92 ±1.54 | +21.60 ±0.83 | +22.09 ±3.07 | +17.97 ±1.35 | +
SafetyCarButton1-v0 | +-1.36 ±0.68 | +14.62 ±9.40 | +-0.31 ±0.49 | +15.24 ±17.01 | +4.47 ±1.12 | +25.00 ±0.00 | +4.34 ±0.72 | +25.00 ±0.00 | +
SafetyCarGoal2-v0 | +0.13 ±1.11 | +23.50 ±1.22 | +1.77 ±1.20 | +17.43 ±12.13 | +6.59 ±0.58 | +25.00 ±0.00 | +7.12 ±4.06 | +23.37 ±1.35 | +
SafetyCarButton2-v0 | +-1.59 ±0.70 | +39.97 ±26.91 | +-2.95 ±4.03 | +27.90 ±6.37 | +4.86 ±1.57 | +25.00 ±0.00 | +5.07 ±1.24 | +25.00 ±0.00 | +
SafetyPointGoal1-v0 | +8.43 ±3.43 | +25.74 ±7.83 | +19.24 ±3.94 | +21.38 ±6.96 | +16.03 ±8.60 | +19.17 ±9.42 | +16.31 ±6.99 | +22.10 ±6.13 | +
SafetyPointButton1-v0 | +1.18 ±1.02 | +29.42 ±12.10 | +6.40 ±1.43 | +27.90 ±13.27 | +7.48 ±8.47 | +24.27 ±3.95 | +9.52 ±7.86 | +25.00 ±0.00 | +
SafetyPointGoal2-v0 | +-0.56 ±0.06 | +48.43 ±40.55 | +1.67 ±1.43 | +23.50 ±11.17 | +6.09 ±5.03 | +25.00 ±0.00 | +8.62 ±7.13 | +25.00 ±0.00 | +
SafetyPointButton2-v0 | +0.42 ±0.63 | +28.87 ±11.27 | +1.00 ±1.00 | +30.00 ±9.50 | +6.94 ±4.47 | +25.00 ±0.00 | +8.35 ±10.44 | +25.00 ±0.00 | +
Table 2: The performance of OmniSafe on-policy algorithms, encompassing both reward and cost, was assessed within the Safety-Gymnasium environments. It is crucial to highlight that all on-policy algorithms underwent evaluation following 1e7 training steps.
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 1.1: Training curves in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 1.2: Training curves in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps +
+
+ +
+ SafetyCarButton1-v0
+
+ |
+
+
+ +
+ SafetyCarButton2-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal2-v0
+
+ |
+
+
+ +
+ SafetyPointButton1-v0
+
+ |
+
+
+ +
+ SafetyPointButton2-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal2-v0
+
+ |
+
Figure 1.3: Training curves in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps + +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 2.1: Training curves of second order algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 2.2: Training curves of second order algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps +
+
+ +
+ SafetyCarButton1-v0
+
+ |
+
+
+ +
+ SafetyCarButton2-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal2-v0
+
+ |
+
+
+ +
+ SafetyPointButton1-v0
+
+ |
+
+
+ +
+ SafetyPointButton2-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal2-v0
+
+ |
+
Figure 2.3: Training curves of second order algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps + +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 3.1: Training curves of Saute MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 3.2: Training curves of Saute MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps +
+
+ +
+ SafetyCarButton1-v0
+
+ |
+
+
+ +
+ SafetyCarButton2-v0
+
+ |
+
+
+ +
+ SafetyCarCircle1-v0
+
+ |
+
+
+ +
+ SafetyCarCircle2-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal2-v0
+
+ |
+
+
+ +
+ SafetyPointButton1-v0
+
+ |
+
+
+ +
+ SafetyPointButton2-v0
+
+ |
+
+
+ +
+ SafetyPointCircle1-v0
+
+ |
+
+
+ +
+ SafetyPointCircle2-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal2-v0
+
+ |
+
Figure 3.3: Training curves of Saute MDP algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps + +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 4.1: Training curves of Simmer MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 4.2: Training curves of Simmer MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps +
+
+ +
+ SafetyCarButton1-v0
+
+ |
+
+
+ +
+ SafetyCarButton2-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal2-v0
+
+ |
+
+
+ +
+ SafetyPointButton1-v0
+
+ |
+
+
+ +
+ SafetyPointButton2-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal2-v0
+
+ |
+
Figure 4.3: Training curves of Simmer MDP algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps + +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 5.1: Training curves of PID-Lagrangian algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 5.2: Training curves of PID-Lagrangian algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps +
+
+ +
+ SafetyCarButton1-v0
+
+ |
+
+
+ +
+ SafetyCarButton2-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal2-v0
+
+ |
+
+
+ +
+ SafetyPointButton1-v0
+
+ |
+
+
+ +
+ SafetyPointButton2-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal2-v0
+
+ |
+
Figure 5.3: Training curves of PID-Lagrangian algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps. + +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 6.1: Training curves of early terminated MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e6 steps. +
+
+ +
+ SafetyAntVelocity-v1
+
+ |
+
+
+ +
+ SafetyHalfCheetahVelocity-v1
+
+ |
+
+
+ +
+ SafetyHopperVelocity-v1
+
+ |
+
+
+ +
+ SafetyHumanoidVelocity-v1
+
+ |
+
+
+ +
+ SafetyWalker2dVelocity-v1
+
+ |
+
+
+ +
+ SafetySwimmerVelocity-v1
+
+ |
+
Figure 6.2: Training curves of early terminated MDP algorithms in Safety-Gymnasium MuJoCo Velocity environments within 1e7 steps. +
+
+ +
+ SafetyCarButton1-v0
+
+ |
+
+
+ +
+ SafetyCarButton2-v0
+
+ |
+
+
+ +
+ SafetyCarGoal1-v0
+
+ |
+
+
+ +
+ SafetyCarGoal2-v0
+
+ |
+
+
+ +
+ SafetyPointButton1-v0
+
+ |
+
+
+ +
+ SafetyPointButton2-v0
+
+ |
+
+
+ +
+ SafetyPointGoal1-v0
+
+ |
+
+
+ +
+ SafetyPointGoal2-v0
+
+ |
+
Figure 6.3: Training curves of early terminated MDP algorithms in Safety-Gymnasium MuJoCo Navigation environments within 1e7 steps. + +
Domains | +Types | +Algorithms Registry | +
---|---|---|
On Policy | +Primal Dual | +TRPOLag; PPOLag; PDO; RCPO | +
TRPOPID; CPPOPID | +||
Convex Optimization | +CPO; PCPO; FOCOPS; CUP | +|
Penalty Function | +IPO; P3O | +|
Primal | +OnCRPO | +|
Off Policy | +Primal-Dual | +DDPGLag; TD3Lag; SACLag | +
DDPGPID; TD3PID; SACPID | +Control Barrier Function | +DDPGCBF, SACRCBF, CRABS | + +
Model-based | +Online Plan | +SafeLOOP; CCEPETS; RCEPETS | +
Pessimistic Estimate | +CAPPETS | +Offline | +Q-Learning Based | +BCQLag; C-CRR | + +
DICE Based | +COptDICE | +|
Other Formulation MDP | +ET-MDP | +PPOEarlyTerminated; TRPOEarlyTerminated | +
SauteRL | +PPOSaute; TRPOSaute | +|
SimmerRL | +PPOSimmerPID; TRPOSimmerPID | +
Table 1: OmniSafe supports varieties of SafeRL algorithms. From the perspective of classic RL, OmniSafe includes on-policy, off-policy, offline, and model-based algorithms; From the perspective of the SafeRL learning paradigm, OmniSafe supports primal-dual, projection, penalty function, primal, etc.
Features | +OmniSafe | +TianShou | +Stable-Baselines3 | +SafePO | +RL-Safety-Algorithms | +Safety-starter-agents | +
---|---|---|---|---|---|---|
Algorithm Tutorial | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
API Documentation | +✓ | +✓ | +✓ | +✓ | +✗ | +✗ | +
Command Line Interface | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
Custom Environment | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
Docker Support | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
GPU Support | +✓ | +✓ | +✓ | +✓ | +✗ | +✗ | +
Ipython / Notebook | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
PEP8 Code Style | +✓ | +✓ | +✓ | +✓ | +✓ | +✓ | +
Statistics Tools | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
Test Coverage | +97% | +91% | +96% | +91% | +- | +- | +
Type Hints | +✓ | +✓ | +✓ | +✓ | +✓ | +✓ | +
Vectorized Environments | +✓ | +✓ | +✓ | +✓ | +✗ | +✗ | +
Video Examples | +✓ | +✓ | +✓ | +✗ | +✗ | +✗ | +
+
+ + (a) SafetyPointGoal1-v0 + |
+
+
+ + (b) SafetyPointButton1-v0 + |
+
+
+ + (c) SafetyCarGoal1-v0 + |
+
+
+ + (d) SafetyCarButton1-v0 + |
+
Figure 4: An exemplification of OmniSafe's WandB
reports videos. This example supplies videos of PPO and PPOLag in SafetyPointGoal1-v0
, SafetyPointButton1-v0
, SafetyCarGoal1-v0
, and SafetyCarButton1-v0
environments. The left of each sub-figure is PPO, while the right is PPOLag. Through these videos, we can intuitively witness the difference between safe and unsafe behavior. This is exactly what OmniSafe pursues: not just the safety of the training curve, but the true safety in a real sense.