-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
v.0.1.2 - Final merge dev to main (#63)
* Enable linting in pipeline (#34) * configure and run flake8 * use sphinx-bibtext (#35) * kdqTree: Added compatibility with pandas dataframes (#40) Transferred changes for issue #133 on kdqtree pandas from GL to GH * Merge unit tests for example materials (#43) * split src / example tests, reorganize test directories * add notebook tester towards #8 * add venv/kernel steps to allow for nbconvert tests * Add separate coverage test, workflow improvemets (#44) * move coverage to separate workflow, fail under 100 * revert to 1 combined job for test + cov, fail under 99 (#5) * separate lint workflow, add isort * rework black step, remove isort * fix black version * run black, update README towards #44 * add badges to readme * fix badge section * Update .github/workflows/tests.yml * Update README.rst * Update README.rst * Update README.rst * Update README.rst Co-authored-by: Thomas Schill <[email protected]> * Add unit tests for kdqTree (#47) * Adds more unit tests for kdqTree * add validation unit tests * Add a unit test for KDQTreePartitioner * add reset to set_reference * use ref_data variable properly when drift occurs * Update hdm docs (#50) * updated docs * closes issue 33 * apply black formatting Co-authored-by: Thomas Schill <[email protected]> * Update README_dataCircleGSev3Sp3Train.txt (#52) * Adds notebook versions of the examples to RTD. (#53) * Started testing out python example notebooks with sphinx * update conf.py to enable pandoc, move examples around * Added example notebooks for data drift detectors * Added example notebooks for modules * remove extra notebooks, fix plotly plots Co-authored-by: Shashank Jarmale Co-authored-by: Shashank Jarmale * add tox to dev install * Fix kdq-tree batch example in documentation example notebook (#54) Duplicating the examples for the purpose of the documentation got us an older version pulled forward that I didn't catch during review of the PR. * Minor updates to constrain Python version used in installation (#55) * add tox python version checks towards #2 * Update .gitignore * fix syntax * add version notes * remove older versions from tox * Update README to include pyenv steps, towards #2 * remove pyenv section Co-authored-by: Thomas Schill <[email protected]> * Merge new data module and reorganize data files (#56) * add (untested) .python that duplicates make_example_data.R * add TODO items * reorganize tools => datasets towards #38 * further reorganize datasets module, add DataGenerator idea * split DataGenerator idea, and fix bugs in make_example_batch_data * update any example_data.csv script to now use function * consolidate dataset descriptions into one README * debug make_example_data * delete outdated data files towards #38 * remove TODO comment * satisfy formatting requirements * add unit tests and comment out untested code * comment out missing code, add single-line description towards #38 * minor formatting changes to trigger checks * debug unit tests, re-satisfy formatting requirements * update references in docs notebooks, add generator docstring also fixes some whitespace in cdbd.py Co-authored-by: Thomas Schill <[email protected]> * Merge new streaming, batch ABCs and refactor KdqTree detector (#62) * separate into streaming and batch detector ABCs (#46) * split kdqtree into streaming/batch versions, update tests * finish batch version of kdqtree * begin using multiple inheritance scheme for kdqtree detectors (#46) * establish commonly inherited functionality in new KdqTreeDetector class * establish commonly inherited functionality in new KdqTree detector class (#46) * deconstruct update to enable code reuse in KdqTreeDetector (#46) * debug all failing tests in test_kdqtree (#46) * update __init__, update refs in examples (#46) * update outdated data refs * add any missing docstrings (#46) * format with black * add unit test for new ABC drift setters * updated the data_drift_examples notebook * docstring formatting tweaks * fix typo Co-authored-by: Thomas Schill <[email protected]> * fix formatting in docstring Co-authored-by: Thomas Schill <[email protected]> * fix formatting in docstring Co-authored-by: Thomas Schill <[email protected]> * fix typo Co-authored-by: Thomas Schill <[email protected]> * fix description in docstring Co-authored-by: Thomas Schill <[email protected]> * formatting * remove double-documented attributes from docstring * provide useful information in child docstrings * move _drift_counter into KdqTreeStreaming * delete coverage file * toss ref data once processed * format with black Co-authored-by: Thomas Schill <[email protected]> * Switch to README.md for better rendering on github (#49) * switch to README.md for better rendering on github - removes reference links from table - adds placeholder mermaid flow diagram - makes some tweaks to the README text * update requirements in setup.cfg * test mermaid rendering * Add "Choosing a Detector" page to TOC * tweak README text * add RTD hyperlink * Merge with current dev * remove draft flow diagram * Add CHANGELOG and pypi actions for release. (#51) * add CHANGELOG, yaml * add Action to push to pypi upon published release * change name of workflow * test adding security linter * test bandit linting * comments * alphabetize setup.cfg.test * increment version number * change lint badge name * address comments for main-dev PR 63
- Loading branch information
1 parent
d754e07
commit 8bdac6d
Showing
76 changed files
with
6,144 additions
and
305,323 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# This workflow will set up the environment and run all scripts/notebooks found in /examples. | ||
|
||
name: examples | ||
|
||
on: | ||
push: | ||
branches: [ "main" ] | ||
pull_request: | ||
branches: [ "main" ] | ||
|
||
permissions: | ||
contents: read | ||
|
||
jobs: | ||
build: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Set up Python 3.10 | ||
uses: actions/setup-python@v3 | ||
with: | ||
python-version: "3.10" | ||
- name: Install dependencies | ||
run: | | ||
python -m venv ./venv | ||
source venv/bin/activate | ||
python -m pip install --upgrade pip | ||
pip install -e .[test] | ||
- name: Test examples | ||
run: | | ||
source venv/bin/activate | ||
ipython kernel install --name "venv" --user | ||
cd tests/examples | ||
pytest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# as described here: https://www.caktusgroup.com/blog/2021/02/11/automating-pypi-releases/ | ||
|
||
name: Upload Python Package | ||
|
||
on: | ||
release: | ||
types: [released] | ||
|
||
jobs: | ||
deploy: | ||
runs-on: ubuntu-20.04 | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
- uses: actions/setup-python@v2 | ||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install setuptools wheel twine | ||
- name: Build and publish | ||
env: | ||
TWINE_USERNAME: __token__ | ||
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} | ||
run: | | ||
python setup.py sdist bdist_wheel | ||
twine upload dist/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# As described on https://stefanzweifel.io/posts/2021/11/13/introducing-the-changelog-updater-action | ||
|
||
name: 'Update Changelog' | ||
|
||
on: | ||
release: | ||
types: [released] | ||
|
||
jobs: | ||
update: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v2 | ||
with: | ||
ref: main | ||
|
||
- name: Update Changelog | ||
uses: stefanzweifel/changelog-updater-action@v1 | ||
with: | ||
release-notes: ${{ github.event.release.body }} | ||
latest-version: ${{ github.event.release.name }} | ||
|
||
- name: Commit updated CHANGELOG | ||
uses: stefanzweifel/git-auto-commit-action@v4 | ||
with: | ||
branch: main | ||
commit_message: Update CHANGELOG | ||
file_pattern: CHANGELOG.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,3 +14,4 @@ _build | |
*.DS_Store | ||
.idea/ | ||
*.png | ||
*.tox* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,12 @@ | ||
version: 2 | ||
|
||
sphinx: | ||
configuration: docs/source/conf.py | ||
|
||
python: | ||
version: "3.8" | ||
setup_py_install: true | ||
version: "3.8" | ||
install: | ||
- method: pip | ||
path: . | ||
extra_requirements: | ||
- dev |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Changelog | ||
|
||
Notable changes to Menelaus will be documented here. | ||
|
||
## v0.1.1 - Jun 7, 2022 | ||
|
||
- Initial public release. | ||
- Published to pypi. | ||
- Published to readthedocs.io. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
[![tests](https://github.com/mitre/menelaus/actions/workflows/tests.yml/badge.svg)](https://github.com/mitre/menelaus/actions/workflows/tests.yml) | ||
[![Documentation Status](https://readthedocs.org/projects/menelaus/badge/?version=latest)](https://menelaus.readthedocs.io/en/latest/?badge=latest) | ||
[![examples](https://github.com/mitre/menelaus/actions/workflows/examples.yml/badge.svg?branch=main)](https://github.com/mitre/menelaus/actions/workflows/examples.yml) | ||
[![lint](https://github.com/mitre/menelaus/actions/workflows/format.yml/badge.svg)](https://github.com/mitre/menelaus/actions/workflows/format.yml) | ||
|
||
# Background | ||
|
||
Menelaus implements algorithms for the purposes of drift detection. Drift | ||
detection is a branch of machine learning focused on the detection of unforeseen | ||
shifts in data. The relationships between variables in a dataset are rarely | ||
static and can be affected by changes in both internal and external factors, | ||
e.g. changes in data collection techniques, external protocols, and/or | ||
population demographics. Both undetected changes in data and undetected model | ||
underperformance pose risks to the users thereof. The aim of this package is to | ||
enable monitoring of data and of model performance. | ||
|
||
The algorithms contained within this package were identified through a | ||
comprehensive literature survey. Menelaus\' aim was to implement drift detection | ||
algorithms that cover a range of statistical methodology. Of the algorithms | ||
identified, all are able to identify when drift is occurring; some can highlight | ||
suspicious regions of the data in which drift is more significant; and others | ||
can also provide model retraining recommendations. | ||
|
||
Menelaus implements drift detectors for both streaming and batch data. In a | ||
streaming setting, data is arriving continuously and is processed one | ||
observation at a time. Streaming detectors process the data with each new | ||
observation that arrives and are intended for use cases in which instant | ||
analytical results are desired. In a batch setting, information is collected | ||
over a period of time. Once the predetermined set is \"filled\", data is fed | ||
into and processed by the drift detection algorithm as a single batch. Within a | ||
batch, there is no meaningful ordering of the data with respect to time. Batch | ||
algorithms are typically used when it is more important to process large volumes | ||
of information simultaneously, where the speed of results after receiving data | ||
is of less concern. | ||
|
||
In The Odyssey, Menelaus seeks a prophecy known by the shapeshifter | ||
Proteus. Menelaus holds Proteus down as he takes the form of a lion, a | ||
serpent, water, and so on. Eventually, Proteus relents, and Menelaus | ||
gains the answers he sought. Accordingly, this library provides tools | ||
for \"holding\" data as it shifts. | ||
|
||
# Detector List | ||
|
||
Menelaus implements the following drift detectors. | ||
|
||
| Type | Detector | Abbreviation | Streaming | Batch | | ||
|------------------|---------------------------------------------------------------|--------------|-----------|-------| | ||
| Change detection | Cumulative Sum Test | CUSUM | x | | | ||
| Change detection | Page-Hinkley | PH | x | | | ||
| Concept drift | ADaptive WINdowing | ADWIN | x | | | ||
| Concept drift | Drift Detection Method | DDM | x | | | ||
| Concept drift | Early Drift Detection Method | EDDM | x | | | ||
| Concept drift | Linear Four Rates | LFR | x | | | ||
| Concept drift | Statistical Test of Equal Proportions to Detect concept drift | STEPD | x | | | ||
| Data drift | Confidence Distribution Batch Detection | CDBD | | x | | ||
| Data drift | Hellinger Distance Drift Detection Method | HDDDM | | x | | ||
| Data drift | kdq-Tree Detection Method | kdq-Tree | x | x | | ||
| Data drift | PCA-Based Change Detection | PCA-CD | x | | | ||
|
||
|
||
The three main types of detector are described below. More details, including | ||
references to the original papers, can be found in the respective module | ||
documentation on [ReadTheDocs](https://menelaus.readthedocs.io/en/latest/). | ||
|
||
- Change detectors monitor single variables in the streaming context, | ||
and alarm when that variable starts taking on values outside of a | ||
pre-defined range. | ||
- Concept drift detectors monitor the performance characteristics of a | ||
given model, trying to identify shifts in the joint distribution of | ||
the data\'s feature values and their labels. | ||
- Data drift detectors monitor the distribution of the features; in | ||
that sense, they are model-agnostic. Such changes in distribution | ||
might be to single variables or to the joint distribution of all the | ||
features. | ||
|
||
The detectors may be applied in two settings, as described in the Background | ||
section: | ||
|
||
- Streaming, in which each new observation that arrives is processed | ||
separately, as it arrives. | ||
- Batch, in which the data has no meaningful ordering with respect to time, | ||
and the goal is comparing two datasets as a whole. | ||
|
||
Additionally, the library implements a kdq-Tree partitioner, for support of the | ||
kdq-Tree Detection Method. This data structure partitions a given feature space, | ||
then maintains a count of the number of samples from the given dataset that fall | ||
into each section of that partition. More details are given in the respective | ||
module. | ||
|
||
A flowchart breaking down these contexts can be found on the ReadTheDocs page under "Choosing a Detector." | ||
|
||
# Installation | ||
|
||
Create a virtual environment as desired, then: | ||
|
||
```python | ||
# for read-only, install from pypi: | ||
pip install menelaus | ||
|
||
# to allow editing, running tests, generating docs, etc. | ||
# First, clone the git repo, then: | ||
cd ./menelaus/ | ||
pip install -e .[dev] | ||
``` | ||
|
||
Menelaus should work with Python 3.8 or higher. | ||
|
||
# Getting Started | ||
|
||
Each detector implements the API defined by `menelaus.drift_detector`: | ||
they have an `update` method which allows new data to be passed, a | ||
`drift_state` attribute which tells the user whether drift has been | ||
detected, and a `reset` method (generally called automatically by | ||
`update`) which clears the `drift_state` along with (usually) some other | ||
attributes specific to the detector class. | ||
|
||
Generally, the workflow for using a detector, given some data, is as | ||
follows: | ||
|
||
```python | ||
import pandas as pd | ||
from menelaus.concept_drift import ADWIN | ||
df = pd.read_csv('example.csv') | ||
detector = ADWIN() | ||
for i, row in df.iterrows(): | ||
detector.update(row['y_predicted'], row['y_true']) | ||
if detector.drift_state is not None: | ||
print("Drift has occurred!") | ||
``` | ||
|
||
For this example, because ADWIN is a concept drift detector, it requires | ||
both a predicted value (`y_predicted`) and a true value (`y_true`), at | ||
each update step. Note that this requirement is not true for the | ||
detectors in other modules. More detailed examples, including code for | ||
visualizating drift locations, may be found in the ``examples`` directory, as | ||
stand-alone python scripts. The examples along with output can also be viewed on | ||
the RTD website. | ||
|
||
# Contributing | ||
Install the library using the `[dev]` option, as above. | ||
|
||
- **Testing** | ||
|
||
Unit tests can be run with the command `pytest`. By default, a | ||
coverage report with highlighting will be generated in `htmlcov/index.html`. | ||
These default settings are specified in `setup.cfg` under `[tool:pytest]`. | ||
|
||
- **Documentation** | ||
|
||
HTML documentation can be generated at | ||
`menelaus/docs/build/html/index.html` with: | ||
```python | ||
cd docs/source | ||
sphinx-build . ../build | ||
``` | ||
|
||
- **Formatting**: | ||
|
||
This project uses `black`, `bandit`, and `flake8` for code formatting and | ||
linting, respectively. To satisfy these requirements when contributing, you | ||
may use them as the linter/formatter in your IDE, or manually run the | ||
following from the root directory: | ||
```python | ||
flake8 # linting | ||
bandit -r ./src # security checks | ||
black ./src/menelaus # formatting | ||
``` | ||
|
||
# Copyright | ||
|
||
Authors: Leigh Nicholl, Thomas Schill, India Lindsay, Anmol Srivastava, Kodie P McNamara, Shashank Jarmale.\ | ||
©2022 The MITRE Corporation. ALL RIGHTS RESERVED\ | ||
Approved for Public Release; Distribution Unlimited. Public Release\ | ||
Case Number 22-0244. |
Oops, something went wrong.