v.0.1.2 - Final merge dev to main (#63)

* Enable linting in pipeline (#34) * configure and run flake8 * use sphinx-bibtext (#35) * kdqTree: Added compatibility with pandas dataframes (#40) Transferred changes for issue #133 on kdqtree pandas from GL to GH * Merge unit tests for example materials (#43) * split src / example tests, reorganize test directories * add notebook tester towards #8 * add venv/kernel steps to allow for nbconvert tests * Add separate coverage test, workflow improvemets (#44) * move coverage to separate workflow, fail under 100 * revert to 1 combined job for test + cov, fail under 99 (#5) * separate lint workflow, add isort * rework black step, remove isort * fix black version * run black, update README towards #44 * add badges to readme * fix badge section * Update .github/workflows/tests.yml * Update README.rst * Update README.rst * Update README.rst * Update README.rst Co-authored-by: Thomas Schill <[email protected]> * Add unit tests for kdqTree (#47) * Adds more unit tests for kdqTree * add validation unit tests * Add a unit test for KDQTreePartitioner * add reset to set_reference * use ref_data variable properly when drift occurs * Update hdm docs (#50) * updated docs * closes issue 33 * apply black formatting Co-authored-by: Thomas Schill <[email protected]> * Update README_dataCircleGSev3Sp3Train.txt (#52) * Adds notebook versions of the examples to RTD. (#53) * Started testing out python example notebooks with sphinx * update conf.py to enable pandoc, move examples around * Added example notebooks for data drift detectors * Added example notebooks for modules * remove extra notebooks, fix plotly plots Co-authored-by: Shashank Jarmale Co-authored-by: Shashank Jarmale * add tox to dev install * Fix kdq-tree batch example in documentation example notebook (#54) Duplicating the examples for the purpose of the documentation got us an older version pulled forward that I didn't catch during review of the PR. * Minor updates to constrain Python version used in installation (#55) * add tox python version checks towards #2 * Update .gitignore * fix syntax * add version notes * remove older versions from tox * Update README to include pyenv steps, towards #2 * remove pyenv section Co-authored-by: Thomas Schill <[email protected]> * Merge new data module and reorganize data files (#56) * add (untested) .python that duplicates make_example_data.R * add TODO items * reorganize tools => datasets towards #38 * further reorganize datasets module, add DataGenerator idea * split DataGenerator idea, and fix bugs in make_example_batch_data * update any example_data.csv script to now use function * consolidate dataset descriptions into one README * debug make_example_data * delete outdated data files towards #38 * remove TODO comment * satisfy formatting requirements * add unit tests and comment out untested code * comment out missing code, add single-line description towards #38 * minor formatting changes to trigger checks * debug unit tests, re-satisfy formatting requirements * update references in docs notebooks, add generator docstring also fixes some whitespace in cdbd.py Co-authored-by: Thomas Schill <[email protected]> * Merge new streaming, batch ABCs and refactor KdqTree detector (#62) * separate into streaming and batch detector ABCs (#46) * split kdqtree into streaming/batch versions, update tests * finish batch version of kdqtree * begin using multiple inheritance scheme for kdqtree detectors (#46) * establish commonly inherited functionality in new KdqTreeDetector class * establish commonly inherited functionality in new KdqTree detector class (#46) * deconstruct update to enable code reuse in KdqTreeDetector (#46) * debug all failing tests in test_kdqtree (#46) * update __init__, update refs in examples (#46) * update outdated data refs * add any missing docstrings (#46) * format with black * add unit test for new ABC drift setters * updated the data_drift_examples notebook * docstring formatting tweaks * fix typo Co-authored-by: Thomas Schill <[email protected]> * fix formatting in docstring Co-authored-by: Thomas Schill <[email protected]> * fix formatting in docstring Co-authored-by: Thomas Schill <[email protected]> * fix typo Co-authored-by: Thomas Schill <[email protected]> * fix description in docstring Co-authored-by: Thomas Schill <[email protected]> * formatting * remove double-documented attributes from docstring * provide useful information in child docstrings * move _drift_counter into KdqTreeStreaming * delete coverage file * toss ref data once processed * format with black Co-authored-by: Thomas Schill <[email protected]> * Switch to README.md for better rendering on github (#49) * switch to README.md for better rendering on github - removes reference links from table - adds placeholder mermaid flow diagram - makes some tweaks to the README text * update requirements in setup.cfg * test mermaid rendering * Add "Choosing a Detector" page to TOC * tweak README text * add RTD hyperlink * Merge with current dev * remove draft flow diagram * Add CHANGELOG and pypi actions for release. (#51) * add CHANGELOG, yaml * add Action to push to pypi upon published release * change name of workflow * test adding security linter * test bandit linting * comments * alphabetize setup.cfg.test * increment version number * change lint badge name * address comments for main-dev PR 63
mitre · Jul 11, 2022 · 8bdac6d · 8bdac6d
1 parent d754e07
commit 8bdac6d
Show file tree

Hide file tree

Showing 76 changed files with 6,144 additions and 305,323 deletions.
diff --git a/.github/workflows/examples.yml b/.github/workflows/examples.yml
@@ -0,0 +1,34 @@
+# This workflow will set up the environment and run all scripts/notebooks found in /examples. 
+
+name: examples
+
+on:
+  push:
+    branches: [ "main" ] 
+  pull_request:
+    branches: [ "main" ]
+
+permissions:
+  contents: read
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python 3.10
+      uses: actions/setup-python@v3
+      with:
+        python-version: "3.10"
+    - name: Install dependencies
+      run: |
+        python -m venv ./venv
+        source venv/bin/activate
+        python -m pip install --upgrade pip
+        pip install -e .[test]
+    - name: Test examples
+      run: |
+        source venv/bin/activate
+        ipython kernel install --name "venv" --user
+        cd tests/examples
+        pytest
diff --git a/lint-test.yml → .github/workflows/format.yml b/lint-test.yml → .github/workflows/format.yml
@@ -1,11 +1,11 @@
-# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# This workflow will lint with a single version of Python
 # For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
 
-name: Lint and Test
+name: linting | security
 
 on:
   push:
-    branches: [ "main", "dev"]
+    branches: [ "main", "dev" ]
   pull_request:
     branches: [ "main", "dev" ]
 
@@ -18,25 +18,33 @@ jobs:
     runs-on: ubuntu-latest
 
     steps:
+
     - uses: actions/checkout@v3
     - name: Set up Python 3.10
       uses: actions/setup-python@v3
       with:
         python-version: "3.10"
+
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install -e .[dev]
-    - name: Lint with black
-      run: |
-        black ./src/menelaus
+        pip install -e .[format]
+    
     - name: Lint with flake8
       run: |
         # stop the build if there are Python syntax errors or undefined names
         flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
         # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
-        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
-    - name: Test with pytest
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics
+    
+    - name: Format with black
+      uses: psf/black@stable
+      with: 
+        options: "--check --verbose"
+        src: "./src/menelaus"
+        version: "22.3.0"
+
+    - name: Security check with bandit
       run: |
-        pytest --cov=src/ --cov-report term
-        coverage report -m
+        # exits with code 0 if there are no errors, otherwise complains
+        bandit -q -r ./src/
diff --git a/.github/workflows/pypi_push.yml b/.github/workflows/pypi_push.yml
@@ -0,0 +1,26 @@
+# as described here: https://www.caktusgroup.com/blog/2021/02/11/automating-pypi-releases/
+
+name: Upload Python Package
+
+on:
+  release:
+    types: [released]
+
+jobs:
+  deploy:
+    runs-on: ubuntu-20.04
+
+    steps:
+    - uses: actions/checkout@v2
+    - uses: actions/setup-python@v2
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install setuptools wheel twine
+    - name: Build and publish
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+      run: |
+        python setup.py sdist bdist_wheel
+        twine upload dist/*
diff --git a/.github/workflows/github_ci.yml → .github/workflows/tests.yml b/.github/workflows/github_ci.yml → .github/workflows/tests.yml
@@ -1,11 +1,11 @@
 # This workflow will install Python dependencies, run tests and lint with a single version of Python
 # For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
 
-name: Lint and Test
+name: tests | coverage
 
 on:
   push:
-    branches: [ "main", "dev"]
+    branches: [ "main", "dev" ]
   pull_request:
     branches: [ "main", "dev" ]
 
@@ -27,16 +27,7 @@ jobs:
       run: |
         python -m pip install --upgrade pip
         pip install -e .[dev]
-    - name: Format with black
-      run: |
-        black ./src/menelaus
-    - name: Lint with flake8
-      run: |
-        # stop the build if there are Python syntax errors or undefined names
-        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
-        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
-        # flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
-    - name: Test with pytest
+    - name: Unit and coverage tests
       run: |
-        pytest --cov=src/ --cov-report term
-        coverage report -m
+        pytest tests/menelaus --cov=src/ --cov-report term
+        coverage report -m --fail-under=100
diff --git a/.github/workflows/update-changelog.yaml b/.github/workflows/update-changelog.yaml
@@ -0,0 +1,30 @@
+# As described on https://stefanzweifel.io/posts/2021/11/13/introducing-the-changelog-updater-action
+
+name: 'Update Changelog'
+
+on:
+    release:
+        types: [released]
+
+jobs:
+    update:
+        runs-on: ubuntu-latest
+
+        steps:
+            - name: Checkout code
+              uses: actions/checkout@v2
+              with:
+                  ref: main
+
+            - name: Update Changelog
+              uses: stefanzweifel/changelog-updater-action@v1
+              with:
+                  release-notes: ${{ github.event.release.body }}
+                  latest-version: ${{ github.event.release.name }}
+
+            - name: Commit updated CHANGELOG
+              uses: stefanzweifel/git-auto-commit-action@v4
+              with:
+                  branch: main
+                  commit_message: Update CHANGELOG
+                  file_pattern: CHANGELOG.md
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,4 @@ _build
 *.DS_Store
 .idea/
 *.png
+*.tox*
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -1,3 +1,12 @@
+version: 2
+
+sphinx:
+   configuration: docs/source/conf.py
+
 python:
-   version: "3.8"
-   setup_py_install: true
+  version: "3.8"
+  install:
+    - method: pip
+      path: .
+      extra_requirements:
+        - dev
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,9 @@
+# Changelog
+
+Notable changes to Menelaus will be documented here.
+
+## v0.1.1 - Jun 7, 2022
+
+- Initial public release.
+- Published to pypi.
+- Published to readthedocs.io.
diff --git a/README.md b/README.md
@@ -0,0 +1,174 @@
+[![tests](https://github.com/mitre/menelaus/actions/workflows/tests.yml/badge.svg)](https://github.com/mitre/menelaus/actions/workflows/tests.yml)
+[![Documentation Status](https://readthedocs.org/projects/menelaus/badge/?version=latest)](https://menelaus.readthedocs.io/en/latest/?badge=latest)
+[![examples](https://github.com/mitre/menelaus/actions/workflows/examples.yml/badge.svg?branch=main)](https://github.com/mitre/menelaus/actions/workflows/examples.yml)
+[![lint](https://github.com/mitre/menelaus/actions/workflows/format.yml/badge.svg)](https://github.com/mitre/menelaus/actions/workflows/format.yml)
+
+# Background
+
+Menelaus implements algorithms for the purposes of drift detection. Drift
+detection is a branch of machine learning focused on the detection of unforeseen
+shifts in data. The relationships between variables in a dataset are rarely
+static and can be affected by changes in both internal and external factors,
+e.g. changes in data collection techniques, external protocols, and/or
+population demographics. Both undetected changes in data and undetected model
+underperformance pose risks to the users thereof. The aim of this package is to
+enable monitoring of data and of model performance.
+
+The algorithms contained within this package were identified through a
+comprehensive literature survey. Menelaus\' aim was to implement drift detection
+algorithms that cover a range of statistical methodology. Of the algorithms
+identified, all are able to identify when drift is occurring; some can highlight
+suspicious regions of the data in which drift is more significant; and others
+can also provide model retraining recommendations.
+
+Menelaus implements drift detectors for both streaming and batch data. In a
+streaming setting, data is arriving continuously and is processed one
+observation at a time. Streaming detectors process the data with each new
+observation that arrives and are intended for use cases in which instant
+analytical results are desired. In a batch setting, information is collected
+over a period of time. Once the predetermined set is \"filled\", data is fed
+into and processed by the drift detection algorithm as a single batch. Within a
+batch, there is no meaningful ordering of the data with respect to time. Batch
+algorithms are typically used when it is more important to process large volumes
+of information simultaneously, where the speed of results after receiving data
+is of less concern.
+
+In The Odyssey, Menelaus seeks a prophecy known by the shapeshifter
+Proteus. Menelaus holds Proteus down as he takes the form of a lion, a
+serpent, water, and so on. Eventually, Proteus relents, and Menelaus
+gains the answers he sought. Accordingly, this library provides tools
+for \"holding\" data as it shifts.
+
+# Detector List
+
+Menelaus implements the following drift detectors.
+
+| Type             | Detector                                                      | Abbreviation | Streaming | Batch |
+|------------------|---------------------------------------------------------------|--------------|-----------|-------|
+| Change detection | Cumulative Sum Test                                           | CUSUM        | x         |       |
+| Change detection | Page-Hinkley                                                  | PH           | x         |       |
+| Concept drift    | ADaptive WINdowing                                            | ADWIN        | x         |       |
+| Concept drift    | Drift Detection Method                                        | DDM          | x         |       |
+| Concept drift    | Early Drift Detection Method                                  | EDDM         | x         |       |
+| Concept drift    | Linear Four Rates                                             | LFR          | x         |       |
+| Concept drift    | Statistical Test of Equal Proportions to Detect concept drift | STEPD        | x         |       |
+| Data drift       | Confidence Distribution Batch Detection                       | CDBD         |           | x     |
+| Data drift       | Hellinger Distance Drift Detection Method                     | HDDDM        |           | x     |
+| Data drift       | kdq-Tree Detection Method                                     | kdq-Tree     | x         | x     |
+| Data drift       | PCA-Based Change Detection                                    | PCA-CD       | x         |       |
+
+
+The three main types of detector are described below. More details, including
+references to the original papers, can be found in the respective module
+documentation on [ReadTheDocs](https://menelaus.readthedocs.io/en/latest/).
+
+-   Change detectors monitor single variables in the streaming context,
+    and alarm when that variable starts taking on values outside of a
+    pre-defined range.
+-   Concept drift detectors monitor the performance characteristics of a
+    given model, trying to identify shifts in the joint distribution of
+    the data\'s feature values and their labels.
+-   Data drift detectors monitor the distribution of the features; in
+    that sense, they are model-agnostic. Such changes in distribution
+    might be to single variables or to the joint distribution of all the
+    features.
+
+The detectors may be applied in two settings, as described in the Background
+section:
+
+-   Streaming, in which each new observation that arrives is processed
+    separately, as it arrives.
+-   Batch, in which the data has no meaningful ordering with respect to time,
+    and the goal is comparing two datasets as a whole.
+
+Additionally, the library implements a kdq-Tree partitioner, for support of the
+kdq-Tree Detection Method. This data structure partitions a given feature space,
+then maintains a count of the number of samples from the given dataset that fall
+into each section of that partition. More details are given in the respective
+module.
+
+A flowchart breaking down these contexts can be found on the ReadTheDocs page under "Choosing a Detector."
+
+# Installation
+
+Create a virtual environment as desired, then:
+
+```python
+# for read-only, install from pypi:
+pip install menelaus
+
+# to allow editing, running tests, generating docs, etc.
+# First, clone the git repo, then:
+cd ./menelaus/
+pip install -e .[dev] 
+```
+
+Menelaus should work with Python 3.8 or higher. 
+
+# Getting Started
+
+Each detector implements the API defined by `menelaus.drift_detector`:
+they have an `update` method which allows new data to be passed, a
+`drift_state` attribute which tells the user whether drift has been
+detected, and a `reset` method (generally called automatically by
+`update`) which clears the `drift_state` along with (usually) some other
+attributes specific to the detector class.
+
+Generally, the workflow for using a detector, given some data, is as
+follows:
+
+```python
+import pandas as pd
+from menelaus.concept_drift import ADWIN
+df = pd.read_csv('example.csv')
+detector = ADWIN()
+for i, row in df.iterrows():
+   detector.update(row['y_predicted'], row['y_true'])
+   if detector.drift_state is not None:
+      print("Drift has occurred!")
+```
+
+For this example, because ADWIN is a concept drift detector, it requires
+both a predicted value (`y_predicted`) and a true value (`y_true`), at
+each update step. Note that this requirement is not true for the
+detectors in other modules. More detailed examples, including code for
+visualizating drift locations, may be found in the ``examples`` directory, as
+stand-alone python scripts. The examples along with output can also be viewed on
+the RTD website.
+
+# Contributing
+Install the library using the `[dev]` option, as above.
+
+- **Testing**
+
+  Unit tests can be run with the command `pytest`. By default, a
+  coverage report with highlighting will be generated in `htmlcov/index.html`.
+  These default settings are specified in `setup.cfg` under `[tool:pytest]`.
+
+- **Documentation**
+
+  HTML documentation can be generated at
+  `menelaus/docs/build/html/index.html` with:
+  ```python
+  cd docs/source
+  sphinx-build . ../build
+  ```
+
+- **Formatting**:
+
+  This project uses `black`, `bandit`, and `flake8` for code formatting and
+  linting, respectively. To satisfy these requirements when contributing, you
+  may use them as the linter/formatter in your IDE, or manually run the
+  following from the root directory:
+  ```python
+  flake8                 # linting
+  bandit -r ./src        # security checks
+  black ./src/menelaus   # formatting
+  ```  
+
+# Copyright
+
+Authors: Leigh Nicholl, Thomas Schill, India Lindsay, Anmol Srivastava, Kodie P McNamara, Shashank Jarmale.\
+©2022 The MITRE Corporation. ALL RIGHTS RESERVED\
+Approved for Public Release; Distribution Unlimited. Public Release\
+Case Number 22-0244.