The code is provided under a Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Under the license, the code is provided royalty free for non-commercial purposes only. The code may be covered by patents and if you want to use the code for commercial purposes, please contact us for a different license.
A collection of sequential statistical hypothesis testing methods for two-by-two contingency tables.
The basic environmental setup is shown below. A virtual / conda environment may be constructed; however, the requirements are quite lightweight and this is probably not needed.
$ cd <some_directory>
$ git clone [email protected]:TRI-ML/sequentialized_barnard_tests.git
$ cd sequentialized_barnard_tests
$ pip install -r requirements.txt
$ pip install -e .
For potential contributors and developers, we recommend a virtualenv:
$ cd <some_directory>
$ git clone [email protected]:TRI-ML/sequentialized_barnard_tests.git
$ virtualenv --python=python3.10 <env_name>
$ source <env_name>/bin/activate
$ cd sequentialized_barnard_tests
$ pip install -r requirements.txt
$ pip install -e .
$ pre-commit install
We assume that any specified virtual / conda environment has been activated for all subsequent code snippets.
We include key notes for understanding the core ideas of the STEP code. Quick-start resources are included in both shell script and notebook form.
In order to synthesize a STEP Policy for specific values of n_max and alpha, one additional set of parametric decisions will be required. The user will need to set the risk budget shape, which is specified by choice of function family (p-norm vs zeta-function) and particular shape parameter. The shape parameter is real-valued; it is used directly for zeta functions and is exponentiated for p-norms.
For p-norms
$$\text{Shape Parameter: } \lambda \in \mathbb{R}$$ $$\text{Accumulated Risk Budget}(n) = \alpha \cdot (\frac{n}{n_{max}})^{\exp{(\lambda)}}$$
For zeta function
$$\text{Shape Parameter: } \lambda \in \mathbb{R}$$ $$\text{Accumulated Risk Budget}(n) = \frac{\alpha}{Z(n_{max})} \cdot \sum_{i=1}^n (\frac{1}{i})^{\lambda}$$ $$Z(n_{max}) = \sum_{i=1}^{n_{max}} (\frac{1}{i})^{\lambda}$$
The user may confirm that in each case, evaluating the accumulated risk budget at
In the codebase, the value of
The family shape is set by the {use_p_norm} variable. This variable is Boolean.
- If it is True, then p-norm family is used.
- If it is False, the zeta-function family is used.
Generalizing the accepted risk budgets to arbitrary monotonic sequences
Having decided an appropriate form for the risk budget shape, policy synthesis is straightforward to run. From the base directory, the general command would be:
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha} -pz {shape_parameter} -up {use_p_norm}
We recommend using the default linear risk budget, which is the shape used in the paper. This corresponds to {shape_parameter}
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha}
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha} -pz {0.0} -up "True"
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha} -pz {0.0} -up "False"
Note: For {shape_parameter}
- Running the policy synthesis will save a durable policy to the user's local machine. This policy can be reused for all future settings requiring the same {n_max, alpha} combination. For {n_max}
$< 500$ , the amount of required memory is < 5Mb. The policy is saved under:
$ sequentialized_barnard_tests/policies/
- At present, we have not tested extensively beyond {n_max}
$=500$ . Going beyond this limit may lead to issues, and the likelihood will grow the larger {n_max} is set to be. The code will also require increasing amounts of RAM as {n_max} is increased.
We now assume that a STEP policy has been constructed for the target problem. This can either be one of the default policies, or a newly constructed one following the recipe in the preceding section.
The data should be formatted into a numpy array of shape
$ mkdir data/{new_project_dir}
$ cp path/to/{my_data_file.npy} data/{new_project_dir}/{my_data_file.npy}
We give an example that would have generated the included data:
$ mkdir data/example_clean_spill
$ cp some/path/to/TRI_CLEAN_SPILL_v4.npy data/example_clean_spill/TRI_CLEAN_SPILL_v4.npy
Then, the user need simply run the evaluation script, which requires the project directory and file in addition to the policy synthesis arguments:
$ python evaluation/run_step_on_evaluation_data.py -p "{new_project_dir}" -f "{my_data_file.npy}" -n {n_max} -a {alpha} -pz {shape_parameter} -up "{use_p_norm}"
This will print the evaluation result to the terminal, as well as save key information in a timestamped json file.
We illustrate this via an evaluation on the default data:
$ python evaluation/run_step_on_evaluation_data.py -p "example_clean_spill" -f "TRI_CLEAN_SPILL_v4.npy" -n {200} -a {0.05} -pz {0.0} -up "False"
Scripts to generate and visualize STEP policies are included under:
$ scripts/
Any resulting visualizations are stored in:
$ media/
The evaluation environment for real data is included in:
$ evaluation/
and the associated evaluation data is stored in:
$ data/
@inproceedings{snyder2025step,
title = {Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping},
author = {Snyder, David and Hancock, Asher James and Badithela, Apurva and Dixon, Emma and Miller, Patrick and Ambrus, Rares Andrei and Majumdar, Anirudha and Itkina, Masha and Nishimura, Haruki},
booktitle={Proceedings of the Robotics: Science and Systems Conference (RSS)},
year = {2025},
}