Skip to content

Commit

Permalink
v.0.1.2 - Final merge dev to main (#63)
Browse files Browse the repository at this point in the history
* Enable linting in pipeline (#34)

* configure and run flake8

* use sphinx-bibtext (#35)

* kdqTree: Added compatibility with pandas dataframes (#40)

Transferred changes for issue #133 on kdqtree pandas from GL to GH

* Merge unit tests for example materials (#43)

* split src / example tests, reorganize test directories

* add notebook tester towards #8

* add venv/kernel steps to allow for nbconvert tests

* Add separate coverage test, workflow improvemets (#44)

* move coverage to separate workflow, fail under 100

* revert to 1 combined job for test + cov, fail under 99 (#5)

* separate lint workflow, add isort

* rework black step, remove isort

* fix black version

* run black, update README towards #44

* add badges to readme

* fix badge section

* Update .github/workflows/tests.yml

* Update README.rst

* Update README.rst

* Update README.rst

* Update README.rst

Co-authored-by: Thomas Schill <[email protected]>

* Add unit tests for kdqTree (#47)

* Adds more unit tests for kdqTree

* add validation unit tests

* Add a unit test for KDQTreePartitioner

* add reset to set_reference

* use ref_data variable properly when drift occurs

* Update hdm docs (#50)

* updated docs

* closes issue 33

* apply black formatting

Co-authored-by: Thomas Schill <[email protected]>

* Update README_dataCircleGSev3Sp3Train.txt (#52)

* Adds notebook versions of the examples to RTD. (#53)

* Started testing out python example notebooks with sphinx

* update conf.py to enable pandoc, move examples around

* Added example notebooks for data drift detectors

* Added example notebooks for modules

* remove extra notebooks, fix plotly plots

Co-authored-by: Shashank Jarmale
Co-authored-by: Shashank Jarmale

* add tox to dev install

* Fix kdq-tree batch example in documentation example notebook (#54)

Duplicating the examples for the purpose of the documentation got us an older version pulled forward that I didn't catch during review of the PR.

* Minor updates to constrain Python version used in installation (#55)

* add tox python version checks towards #2

* Update .gitignore

* fix syntax

* add version notes

* remove older versions from tox

* Update README to include pyenv steps, towards #2

* remove pyenv section

Co-authored-by: Thomas Schill <[email protected]>

* Merge new data module and reorganize data files (#56)

* add (untested) .python that duplicates make_example_data.R

* add TODO items

* reorganize tools => datasets towards #38

* further reorganize datasets module, add DataGenerator idea

* split DataGenerator idea, and fix bugs in make_example_batch_data

* update any example_data.csv script to now use function

* consolidate dataset descriptions into one README

* debug make_example_data

* delete outdated data files towards #38

* remove TODO comment

* satisfy formatting requirements

* add unit tests and comment out untested code

* comment out missing code, add single-line description towards #38

* minor formatting changes to trigger checks

* debug unit tests, re-satisfy formatting requirements

* update references in docs notebooks, add generator docstring
 also fixes some whitespace in cdbd.py

Co-authored-by: Thomas Schill <[email protected]>

* Merge new streaming, batch ABCs and refactor KdqTree detector (#62)

* separate into streaming and batch detector ABCs (#46)

* split kdqtree into streaming/batch versions, update tests

* finish batch version of kdqtree

* begin using multiple inheritance scheme for kdqtree detectors (#46)

* establish commonly inherited functionality in new KdqTreeDetector class

* establish commonly inherited functionality in new KdqTree detector class (#46)

* deconstruct update to enable code reuse in KdqTreeDetector (#46)

* debug all failing tests in test_kdqtree (#46)

* update __init__, update refs in examples (#46)

* update outdated data refs

* add any missing docstrings (#46)

* format with black

* add unit test for new ABC drift setters

* updated the data_drift_examples notebook

* docstring formatting tweaks

* fix typo

Co-authored-by: Thomas Schill <[email protected]>

* fix formatting in docstring

Co-authored-by: Thomas Schill <[email protected]>

* fix formatting in docstring

Co-authored-by: Thomas Schill <[email protected]>

* fix typo

Co-authored-by: Thomas Schill <[email protected]>

* fix description in docstring

Co-authored-by: Thomas Schill <[email protected]>

* formatting

* remove double-documented attributes from docstring

* provide useful information in child docstrings

* move _drift_counter into KdqTreeStreaming

* delete coverage file

* toss ref data once processed

* format with black

Co-authored-by: Thomas Schill <[email protected]>

* Switch to README.md for better rendering on github (#49)

* switch to README.md for better rendering on github
 - removes reference links from table
 - adds placeholder mermaid flow diagram
 - makes some tweaks to the README text

* update requirements in setup.cfg

* test mermaid rendering

* Add "Choosing a Detector" page to TOC

* tweak README text

* add RTD hyperlink

* Merge with current dev

* remove draft flow diagram

* Add CHANGELOG and pypi actions for release.  (#51)

* add CHANGELOG, yaml

* add Action to push to pypi upon published release

* change name of workflow

* test adding security linter

* test bandit linting

* comments

* alphabetize setup.cfg.test

* increment version number

* change lint badge name

* address comments for main-dev PR 63
  • Loading branch information
tms-bananaquit committed Jul 11, 2022
1 parent d754e07 commit 8bdac6d
Show file tree
Hide file tree
Showing 76 changed files with 6,144 additions and 305,323 deletions.
34 changes: 34 additions & 0 deletions .github/workflows/examples.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# This workflow will set up the environment and run all scripts/notebooks found in /examples.

name: examples

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"
- name: Install dependencies
run: |
python -m venv ./venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install -e .[test]
- name: Test examples
run: |
source venv/bin/activate
ipython kernel install --name "venv" --user
cd tests/examples
pytest
30 changes: 19 additions & 11 deletions lint-test.yml → .github/workflows/format.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# This workflow will lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Lint and Test
name: linting | security

on:
push:
branches: [ "main", "dev"]
branches: [ "main", "dev" ]
pull_request:
branches: [ "main", "dev" ]

Expand All @@ -18,25 +18,33 @@ jobs:
runs-on: ubuntu-latest

steps:

- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .[dev]
- name: Lint with black
run: |
black ./src/menelaus
pip install -e .[format]
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics
- name: Format with black
uses: psf/black@stable
with:
options: "--check --verbose"
src: "./src/menelaus"
version: "22.3.0"

- name: Security check with bandit
run: |
pytest --cov=src/ --cov-report term
coverage report -m
# exits with code 0 if there are no errors, otherwise complains
bandit -q -r ./src/
26 changes: 26 additions & 0 deletions .github/workflows/pypi_push.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# as described here: https://www.caktusgroup.com/blog/2021/02/11/automating-pypi-releases/

name: Upload Python Package

on:
release:
types: [released]

jobs:
deploy:
runs-on: ubuntu-20.04

steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
19 changes: 5 additions & 14 deletions .github/workflows/github_ci.yml → .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Lint and Test
name: tests | coverage

on:
push:
branches: [ "main", "dev"]
branches: [ "main", "dev" ]
pull_request:
branches: [ "main", "dev" ]

Expand All @@ -27,16 +27,7 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install -e .[dev]
- name: Format with black
run: |
black ./src/menelaus
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
# flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
- name: Unit and coverage tests
run: |
pytest --cov=src/ --cov-report term
coverage report -m
pytest tests/menelaus --cov=src/ --cov-report term
coverage report -m --fail-under=100
30 changes: 30 additions & 0 deletions .github/workflows/update-changelog.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# As described on https://stefanzweifel.io/posts/2021/11/13/introducing-the-changelog-updater-action

name: 'Update Changelog'

on:
release:
types: [released]

jobs:
update:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2
with:
ref: main

- name: Update Changelog
uses: stefanzweifel/changelog-updater-action@v1
with:
release-notes: ${{ github.event.release.body }}
latest-version: ${{ github.event.release.name }}

- name: Commit updated CHANGELOG
uses: stefanzweifel/git-auto-commit-action@v4
with:
branch: main
commit_message: Update CHANGELOG
file_pattern: CHANGELOG.md
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ _build
*.DS_Store
.idea/
*.png
*.tox*
13 changes: 11 additions & 2 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
version: 2

sphinx:
configuration: docs/source/conf.py

python:
version: "3.8"
setup_py_install: true
version: "3.8"
install:
- method: pip
path: .
extra_requirements:
- dev
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Changelog

Notable changes to Menelaus will be documented here.

## v0.1.1 - Jun 7, 2022

- Initial public release.
- Published to pypi.
- Published to readthedocs.io.
174 changes: 174 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
[![tests](https://github.com/mitre/menelaus/actions/workflows/tests.yml/badge.svg)](https://github.com/mitre/menelaus/actions/workflows/tests.yml)
[![Documentation Status](https://readthedocs.org/projects/menelaus/badge/?version=latest)](https://menelaus.readthedocs.io/en/latest/?badge=latest)
[![examples](https://github.com/mitre/menelaus/actions/workflows/examples.yml/badge.svg?branch=main)](https://github.com/mitre/menelaus/actions/workflows/examples.yml)
[![lint](https://github.com/mitre/menelaus/actions/workflows/format.yml/badge.svg)](https://github.com/mitre/menelaus/actions/workflows/format.yml)

# Background

Menelaus implements algorithms for the purposes of drift detection. Drift
detection is a branch of machine learning focused on the detection of unforeseen
shifts in data. The relationships between variables in a dataset are rarely
static and can be affected by changes in both internal and external factors,
e.g. changes in data collection techniques, external protocols, and/or
population demographics. Both undetected changes in data and undetected model
underperformance pose risks to the users thereof. The aim of this package is to
enable monitoring of data and of model performance.

The algorithms contained within this package were identified through a
comprehensive literature survey. Menelaus\' aim was to implement drift detection
algorithms that cover a range of statistical methodology. Of the algorithms
identified, all are able to identify when drift is occurring; some can highlight
suspicious regions of the data in which drift is more significant; and others
can also provide model retraining recommendations.

Menelaus implements drift detectors for both streaming and batch data. In a
streaming setting, data is arriving continuously and is processed one
observation at a time. Streaming detectors process the data with each new
observation that arrives and are intended for use cases in which instant
analytical results are desired. In a batch setting, information is collected
over a period of time. Once the predetermined set is \"filled\", data is fed
into and processed by the drift detection algorithm as a single batch. Within a
batch, there is no meaningful ordering of the data with respect to time. Batch
algorithms are typically used when it is more important to process large volumes
of information simultaneously, where the speed of results after receiving data
is of less concern.

In The Odyssey, Menelaus seeks a prophecy known by the shapeshifter
Proteus. Menelaus holds Proteus down as he takes the form of a lion, a
serpent, water, and so on. Eventually, Proteus relents, and Menelaus
gains the answers he sought. Accordingly, this library provides tools
for \"holding\" data as it shifts.

# Detector List

Menelaus implements the following drift detectors.

| Type | Detector | Abbreviation | Streaming | Batch |
|------------------|---------------------------------------------------------------|--------------|-----------|-------|
| Change detection | Cumulative Sum Test | CUSUM | x | |
| Change detection | Page-Hinkley | PH | x | |
| Concept drift | ADaptive WINdowing | ADWIN | x | |
| Concept drift | Drift Detection Method | DDM | x | |
| Concept drift | Early Drift Detection Method | EDDM | x | |
| Concept drift | Linear Four Rates | LFR | x | |
| Concept drift | Statistical Test of Equal Proportions to Detect concept drift | STEPD | x | |
| Data drift | Confidence Distribution Batch Detection | CDBD | | x |
| Data drift | Hellinger Distance Drift Detection Method | HDDDM | | x |
| Data drift | kdq-Tree Detection Method | kdq-Tree | x | x |
| Data drift | PCA-Based Change Detection | PCA-CD | x | |


The three main types of detector are described below. More details, including
references to the original papers, can be found in the respective module
documentation on [ReadTheDocs](https://menelaus.readthedocs.io/en/latest/).

- Change detectors monitor single variables in the streaming context,
and alarm when that variable starts taking on values outside of a
pre-defined range.
- Concept drift detectors monitor the performance characteristics of a
given model, trying to identify shifts in the joint distribution of
the data\'s feature values and their labels.
- Data drift detectors monitor the distribution of the features; in
that sense, they are model-agnostic. Such changes in distribution
might be to single variables or to the joint distribution of all the
features.

The detectors may be applied in two settings, as described in the Background
section:

- Streaming, in which each new observation that arrives is processed
separately, as it arrives.
- Batch, in which the data has no meaningful ordering with respect to time,
and the goal is comparing two datasets as a whole.

Additionally, the library implements a kdq-Tree partitioner, for support of the
kdq-Tree Detection Method. This data structure partitions a given feature space,
then maintains a count of the number of samples from the given dataset that fall
into each section of that partition. More details are given in the respective
module.

A flowchart breaking down these contexts can be found on the ReadTheDocs page under "Choosing a Detector."

# Installation

Create a virtual environment as desired, then:

```python
# for read-only, install from pypi:
pip install menelaus

# to allow editing, running tests, generating docs, etc.
# First, clone the git repo, then:
cd ./menelaus/
pip install -e .[dev]
```

Menelaus should work with Python 3.8 or higher.

# Getting Started

Each detector implements the API defined by `menelaus.drift_detector`:
they have an `update` method which allows new data to be passed, a
`drift_state` attribute which tells the user whether drift has been
detected, and a `reset` method (generally called automatically by
`update`) which clears the `drift_state` along with (usually) some other
attributes specific to the detector class.

Generally, the workflow for using a detector, given some data, is as
follows:

```python
import pandas as pd
from menelaus.concept_drift import ADWIN
df = pd.read_csv('example.csv')
detector = ADWIN()
for i, row in df.iterrows():
detector.update(row['y_predicted'], row['y_true'])
if detector.drift_state is not None:
print("Drift has occurred!")
```

For this example, because ADWIN is a concept drift detector, it requires
both a predicted value (`y_predicted`) and a true value (`y_true`), at
each update step. Note that this requirement is not true for the
detectors in other modules. More detailed examples, including code for
visualizating drift locations, may be found in the ``examples`` directory, as
stand-alone python scripts. The examples along with output can also be viewed on
the RTD website.

# Contributing
Install the library using the `[dev]` option, as above.

- **Testing**

Unit tests can be run with the command `pytest`. By default, a
coverage report with highlighting will be generated in `htmlcov/index.html`.
These default settings are specified in `setup.cfg` under `[tool:pytest]`.

- **Documentation**

HTML documentation can be generated at
`menelaus/docs/build/html/index.html` with:
```python
cd docs/source
sphinx-build . ../build
```

- **Formatting**:

This project uses `black`, `bandit`, and `flake8` for code formatting and
linting, respectively. To satisfy these requirements when contributing, you
may use them as the linter/formatter in your IDE, or manually run the
following from the root directory:
```python
flake8 # linting
bandit -r ./src # security checks
black ./src/menelaus # formatting
```

# Copyright

Authors: Leigh Nicholl, Thomas Schill, India Lindsay, Anmol Srivastava, Kodie P McNamara, Shashank Jarmale.\
©2022 The MITRE Corporation. ALL RIGHTS RESERVED\
Approved for Public Release; Distribution Unlimited. Public Release\
Case Number 22-0244.
Loading

0 comments on commit 8bdac6d

Please sign in to comment.