docs: update appendix (#350)

PKU-Alignment · Sep 3, 2024 · 5181870 · 5181870
1 parent 080e6b8
commit 5181870
Show file tree

Hide file tree

Showing 20 changed files with 4,842 additions and 22 deletions.
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -45,10 +45,6 @@ jobs:
         run: |
           make pre-commit
 
-      - name: ruff
-        run: |
-          make ruff
-
       - name: flake8
         run: |
           make flake8

diff --git a/docs/source/benchmark/case-study.md b/docs/source/benchmark/case-study.md
@@ -0,0 +1,54 @@
+# Case Study
+
+One important motivation for SafeRL is to enable agents to explore and
+learn safely. Therefore, evaluating algorithm performance concerning
+*procedural constraint violations* is also important. We have selected
+representative experimental results and report as shown in <a href="#analys">Figure 1</a> and <a href="#analys_ppo">Figure 2</a>:
+
+#### Radical vs. Conservative
+
+*Radical* policies often explore higher rewards but violate more safety
+constraints, whereas *Conservative* policies are the opposite.
+<a href="#analys">Figure 1</a> illustrates this: during training, CPO and
+PPOLag consistently pursue the highest rewards among all algorithms, as
+depicted in the first row. However, as shown in the second row, they
+experience significant fluctuations in constraint violations, especially
+for PPOLag. So, they are relatively radical, *i.e.,* higher rewards but
+higher costs. In comparison, while P3O achieves slightly lower rewards
+than PPOLag, it maintains fewer oscillations in constraint violations,
+making it safer in adhering to safety constraints, evident from the
+smaller proportion of its distribution crossing the black dashed line. A
+similar pattern is also observed when comparing PCPO with CPO.
+Therefore, P3O and PCPO are relatively conservative, *i.e.,* lower costs
+but lower rewards.
+
+<img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/on-policy/benchmarks/analys.png?raw=true" id="analys">
+<br>
+
+**Figure 1:** PPOLag, P3O, CPO, and PCPO training on four tasks in for 1e7 steps, showing the distribution of all episodic rewards and costs. All data covers over 5 random seeds and filters out data points over 3 standard deviations. The black dashed line in the graph represents the preset `cost_limit`.
+
+
+#### Oscillation vs. Stability
+
+The oscillations in the degree of constraint violations during the
+training process can indicate the performance of SafeRL algorithms.
+These oscillations are quantified by *Extremes*, *i.e.,* the maximum
+constraint violation, and *Distributions*, *i.e.,* the frequency of
+violations remaining below a predefined `cost_limit`. As shown in
+<a href="#analys_ppo">Figure 2</a>, PPOLag, a popular baseline in SafeRL,
+utilizes the Lagrangian multiplier for constraint handling. Despite its
+simplicity and ease of implementation, PPOLag often suffers from
+significant oscillations due to challenges in setting appropriate
+initial values and learning rates. It consistently seeks higher rewards
+but always leads to larger extremes and unsafe distributions.
+Conversely, CPPOPID, which employs a PID controller for updating the
+Lagrangian multiplier, markedly reduces these extremes. CUP implements a
+two-stage projection method that constrains violations' distribution
+below the `cost_limit`. Lastly, PPOSaute integrates state observations
+with constraints, resulting in smaller extremes and safer distributions
+of violations.
+
+<img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/on-policy/benchmarks/analys_ppo.png?raw=true"  id="analys_ppo">
+<br>
+
+**Figure 2:** PPOLag, CPPOPID, CUP, and PPOSaute trained on four tasks in for all 1e7 steps, showing the distribution of all episodic rewards and costs. All data covers over 5 random seeds and filters out data points over 3 standard deviations. The black dashed line in the graph represents the preset `cost_limit`.
diff --git a/docs/source/benchmark/modelbased.md b/docs/source/benchmark/modelbased.md
@@ -0,0 +1,223 @@
+# Model-based Algorithms
+
+The OmniSafe Navigation Benchmark for model-based algorithms evaluates the effectiveness of OmniSafe's model-based algorithms across two different environments from the [Safety-Gymnasium](https://github.com/PKU-Alignment/safety-gymnasium) task suite. For each supported algorithm and environment, we offer the following:
+
+- Default hyperparameters used for the benchmark and scripts that enable result replication.
+- Graphs and raw data that can be utilized for research purposes.
+- Detailed logs obtained during training.
+
+Supported algorithms are listed below:
+
+- **[NeurIPS 2001]** [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models (PETS))](https://arxiv.org/abs/1805.12114)
+- **[CoRL 2021]** [Learning Off-Policy with Online Planning (LOOP and SafeLOOP)](https://arxiv.org/abs/2008.10066)
+- **[AAAI 2022]** [Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning (CAP)](https://arxiv.org/abs/2112.07701)
+- **[ICML 2022 Workshop]** [Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method (RCE)](https://arxiv.org/abs/2010.07968)
+- **[NeurIPS 2018]** [Constrained Cross-Entropy Method for Safe Reinforcement Learning (CCE)](https://proceedings.neurips.cc/paper/2018/hash/34ffeb359a192eb8174b6854643cc046-Abstract.html)
+
+## Safety-Gymnasium
+
+We highly recommend using **Safety-Gymnasium** to run the following experiments. To install, in a linux machine, type:
+
+```bash
+pip install safety_gymnasium
+```
+
+## Run the Benchmark
+
+You can set the main function of ``examples/benchmarks/experiment_grid.py`` as:
+
+```python
+if __name__ == '__main__':
+    eg = ExperimentGrid(exp_name='Model-Based-Benchmarks')
+
+    # set up the algorithms.
+    model_based_base_policy = ['LOOP', 'PETS']
+    model_based_safe_policy = ['SafeLOOP', 'CCEPETS', 'CAPPETS', 'RCEPETS']
+    eg.add('algo', model_based_base_policy + model_based_safe_policy)
+
+    # you can use wandb to monitor the experiment.
+    eg.add('logger_cfgs:use_wandb', [False])
+    # you can use tensorboard to monitor the experiment.
+    eg.add('logger_cfgs:use_tensorboard', [True])
+    eg.add('train_cfgs:total_steps', [1000000])
+
+    # set up the environment.
+    eg.add('env_id', [
+        'SafetyPointGoal1-v0-modelbased',
+        'SafetyCarGoal1-v0-modelbased',
+        ])
+    eg.add('seed', [0, 5, 10, 15, 20])
+
+    # total experiment num must can be divided by num_pool
+    # meanwhile, users should decide this value according to their machine
+    eg.run(train, num_pool=5)
+```
+
+After that, you can run the following command to run the benchmark:
+
+```bash
+cd examples/benchmarks
+python run_experiment_grid.py
+```
+
+You can set the path of ``examples/benchmarks/experiment_grid.py`` :
+example:
+
+```python
+path ='omnisafe/examples/benchmarks/exp-x/Model-Based-Benchmarks'
+```
+
+You can also plot the results by running the following command:
+
+```bash
+cd examples
+python analyze_experiment_results.py
+```
+
+**For a detailed usage of OmniSafe statistics tool, please refer to [this tutorial](https://omnisafe.readthedocs.io/en/latest/common/stastics_tool.html).**
+
+## OmniSafe Benchmark
+
+To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of `cost_limit=1.00`. The results are presented in <a href="#performance_model_based">Table 1</a> and <a href="#curve_model_based">Figure 1</a>.
+
+### Performance Table
+
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<style>
+  .scrollable-container {
+    overflow-x: auto;
+    white-space: nowrap;
+    width: 100%;
+  }
+  table {
+    border-collapse: collapse;
+    width: auto;
+    font-size: 12px;
+  }
+  th, td {
+    padding: 8px;
+    text-align: center;
+    border: 1px solid #ddd;
+  }
+  th {
+    font-weight: bold;
+  }
+  caption {
+    font-size: 12px;
+    font-family: 'Times New Roman', Times, serif;
+  }
+</style>
+</head>
+<body>
+
+<div class="scrollable-container">
+<table id="performance_model_based">
+<thead>
+<tr class="header">
+<th style="text-align: left;"></th>
+<th colspan="2" style="text-align: center;"><strong>PETS</strong></th>
+<th colspan="2" style="text-align: center;"><strong>LOOP</strong></th>
+<th colspan="2"
+style="text-align: center;"><strong>SafeLOOP</strong></th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td style="text-align: left;"><strong>Environment</strong></td>
+<td style="text-align: center;"><strong>Reward</strong></td>
+<td style="text-align: center;"><strong>Cost</strong></td>
+<td style="text-align: center;"><strong>Reward</strong></td>
+<td style="text-align: center;"><strong>Cost</strong></td>
+<td style="text-align: center;"><strong>Reward</strong></td>
+<td style="text-align: center;"><strong>Cost</strong></td>
+</tr>
+<tr class="even">
+<td style="text-align: left;"><span
+class="smallcaps">SafetyCarGoal1-v0</span></td>
+<td style="text-align: center;">33.07 <span class="math inline">±</span>1.33</td>
+<td style="text-align: center;">61.20 <span class="math inline">±</span>7.23</td>
+<td style="text-align: center;">25.41 <span class="math inline">±</span>1.23</td>
+<td style="text-align: center;">62.64 <span class="math inline">±</span>8.34</td>
+<td style="text-align: center;">22.09 <span class="math inline">±</span>0.30</td>
+<td style="text-align: center;">0.16 <span class="math inline">±</span>0.15</td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;"><span
+class="smallcaps">SafetyPointGoal1-v0</span></td>
+<td style="text-align: center;">27.66 <span class="math inline">±</span>0.07</td>
+<td style="text-align: center;">49.16 <span class="math inline">±</span>2.69</td>
+<td style="text-align: center;">25.08 <span class="math inline">±</span>1.47</td>
+<td style="text-align: center;">55.23 <span class="math inline">±</span>2.64</td>
+<td style="text-align: center;">22.94 <span class="math inline">±</span>0.72</td>
+<td style="text-align: center;">0.04 <span class="math inline">±</span>0.07</td>
+</tr>
+<thead>
+<tr class="header">
+<th style="text-align: left;"></th>
+<th colspan="2" style="text-align: center;"><strong>CCEPETS</strong></th>
+<th colspan="2" style="text-align: center;"><strong>RCEPETS</strong></th>
+<th colspan="2" style="text-align: center;"><strong>CAPPETS</strong></th>
+</tr>
+</thead>
+<tr class="odd">
+<td style="text-align: left;"><strong>Environment</strong></td>
+<td style="text-align: center;"><strong>Reward</strong></td>
+<td style="text-align: center;"><strong>Cost</strong></td>
+<td style="text-align: center;"><strong>Reward</strong></td>
+<td style="text-align: center;"><strong>Cost</strong></td>
+<td style="text-align: center;"><strong>Reward</strong></td>
+<td style="text-align: center;"><strong>Cost</strong></td>
+</tr>
+<tr class="even">
+<td style="text-align: left;"><span
+class="smallcaps">SafetyCarGoal1-v0</span></td>
+<td style="text-align: center;">27.60 <span class="math inline">±</span>1.21</td>
+<td style="text-align: center;">1.03 <span class="math inline">±</span>0.29</td>
+<td style="text-align: center;">29.08 <span class="math inline">±</span>1.63</td>
+<td style="text-align: center;">1.02 <span class="math inline">±</span>0.88</td>
+<td style="text-align: center;">23.33 <span class="math inline">±</span>6.34</td>
+<td style="text-align: center;">0.48 <span class="math inline">±</span>0.17</td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;"><span
+class="smallcaps">SafetyPointGoal1-v0</span></td>
+<td style="text-align: center;">24.98 <span class="math inline">±</span>0.05</td>
+<td style="text-align: center;">1.87 <span class="math inline">±</span>1.27</td>
+<td style="text-align: center;">25.39 <span class="math inline">±</span>0.28</td>
+<td style="text-align: center;">2.46 <span class="math inline">±</span>0.58</td>
+<td style="text-align: center;">9.45 <span class="math inline">±</span>8.62</td>
+<td style="text-align: center;">0.64 <span class="math inline">±</span>0.77</td>
+</tr>
+</tbody>
+</table>
+</div>
+
+<caption><p><b>Table 1:</b> The performance of OmniSafe model-based algorithms, encompassing both reward and cost, was assessed within the Safety-Gymnasium environments. It is crucial to highlight that all model-based algorithms underwent evaluation following 1e6 training steps.</p></caption>
+
+### Performance Curves
+
+<table id="curve_model_based">
+  <tr>
+    <td style="text-align:center">
+      <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/model-based/benchmarks/SafetyCarGoal1-v0-modelbased.png?raw=True">
+      <br>
+      <div>
+        SafetyCarGoal1-v0
+      </div>
+    </td>
+  </tr>
+  <tr>
+    <td style="text-align:center">
+      <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="https://github.com/Gaiejj/omnisafe_benchmarks_cruve/blob/main/model-based/benchmarks/SafetyPointGoal1-v0-modelbased.png?raw=True">
+      <br>
+      <div>
+        SafetyPointGoal1-v0
+      </div>
+    </td>
+  </tr>
+</table>
+
+<caption><p><b>Figure 1:</b> Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms and safe learning algorithms mentioned in <a href="#performance_model_based">Table 1</a>.</p></caption>