V0.3.0beta (#85)

* initial TeehrDataset class layout * initial TEEHR Dataset functionality * initial TEEHRDataset functionality * removed test code * removed comment * adds first api and web app that basically works * fixes attr add, ads new test data * remove some frontend code that was holding on * make a couple of small changes to dataclass * adds a few small bug fixes in new and existing code * updates web api/app to use new dataclass * fixes geometry join * adds gitignore for web work * small refactor, update gitignore, fix bug in get_metrics again * adds no qa * homepage draft * separate TEEHRDatasetAPI and TEEHRDatasetDB classes * pydantic v2, add geometry to queries, re-org * adding timeseries queries, fastapi endpoints * tests and cleanup * add filters * add operators endpoint * add timepicker * flex filters * update vscode settings * fix bug in get metrics query * make scripot work with new patterns * add pydantic>2 to req. * 81-integrate poetry (#83) * initial poetry integration * integrating poetry, upgrading pangeo, python3.11 * poetry.lock * revert back to python3.10 * readme update * minor edit * uncommenting dockerfile section after GTS fix * adds a .dockerignore --------- Co-authored-by: Sam Lamont <[email protected]> Co-authored-by: Matt Denno <[email protected]> * add v0.3.0beta to teehr-hub * update build action * v0.3.0b geometry issues (#91) * fixing include_geometry validation * version bump --------- Co-authored-by: Sam Lamont <[email protected]> * hack fix for build process * 88 comments on v030b dataset (#96) * Updated doc strings for teehr dataset class * Docstring updates, time series query deduplication * additional comment * typo * increment beta version * small update to get_timeseries() and get_timeseries_chars() * didn't quite get it fixed with last commit * timeseries_name docstring, profile_query update --------- Co-authored-by: Sam Lamont <[email protected]> Co-authored-by: Matt Denno <[email protected]> * update teehr-hub * fix pydantic 2 issues * remove test db from repo * update test to use temp db * updates release docs, info changelog.md * update teehr-hub config --------- Co-authored-by: Sam Lamont <[email protected]> Co-authored-by: Manuel Alvarado <[email protected]> Co-authored-by: samlamont <[email protected]>
RTIInternational · Dec 8, 2023 · 92b1f46 · 92b1f46
1 parent 5618138
commit 92b1f46
Show file tree

Hide file tree

Showing 77 changed files with 15,489 additions and 907 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,13 @@
+.github
+.ipynb_checkpoints
+.pytest_cache
+.vscode
+dashboards
+dist
+docs
+examples
+frontend
+playground
+study_template
+teehr-hub
+tests
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -54,7 +54,6 @@ jobs:
         with:
           cosign-release: 'v2.1.1' # optional
 
-
       # Workaround: https://github.com/docker/build-push-action/issues/461
       - name: Setup Docker buildx
         uses: docker/setup-buildx-action@79abd3f86f79a9d68a23c75a09a9a85889262adf

diff --git a/.gitignore b/.gitignore
@@ -129,4 +129,4 @@ dmypy.json
 .pyre/
 
 # Tests output
-temp/
+temp/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,19 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.3.0] - 2023-12-08
+
+### Added
+* Adds a dataclass and database that allows preprocessing of joined timeseries and attributes as well as the addition of user defined functions.
+* Adds an initial web service API that serves out `timeseries` and `metrics` along with some other supporting data.
+* Adds an initial interactive web application using the web service API.
+
+### Changed
+* Switches to poetry to manage Python venv
+* Upgrades to Pydantic 2+
+* Upgrades to Pangeo image `pangeo/pangeo-notebook:2023.09.11`
+
+
 ## [0.2.9] - 2023-12-08
 
 ### Added

diff --git a/Dockerfile b/Dockerfile
@@ -13,13 +13,14 @@ RUN TEEHR_VERSION=$(cat /teehr/version.txt) && \
 
 # Install TEEHR in the Pangeo Image
 # https://hub.docker.com/r/pangeo/pangeo-notebook/tags
-FROM pangeo/pangeo-notebook:2023.07.05
+# Subsequent images use python=3.11
+FROM pangeo/pangeo-notebook:2023.09.11
 
 USER root
 ENV DEBIAN_FRONTEND=noninteractive
 ENV PATH ${NB_PYTHON_PREFIX}/bin:$PATH
 
-# Needed for apt-key to work
+# Needed for apt-key to work -- Is this part needed?
 RUN apt-get update -qq --yes > /dev/null && \
     apt-get install --yes -qq gnupg2 > /dev/null && \
     rm -rf /var/lib/apt/lists/*

diff --git a/README.md b/README.md
@@ -14,29 +14,32 @@ assess their skill and performance.
 NOTE: THIS PROJECT IS UNDER DEVELOPMENT - EXPECT TO FIND BROKEN AND INCOMPLETE CODE.
 
 ## How to Install TEEHR
-Install with from source
-
+Install poetry
+```bash
+$ pipx install poetry
+```
+Install from source
 ```bash
 # Create and activate python environment, requires python >= 3.10
-$ python3 -m venv .venv
-$ source .venv/bin/activate
-$ python3 -m pip install --upgrade pip
-
-# Build and install from source
-$ python3 -m pip install --upgrade build
-$ python -m build
-$ python -m pip install dist/teehr-0.2.9.tar.gz
+$ poetry shell
+
+# Install from source
+$ poetry install
 ```
 
 Install from GitHub
 ```bash
+# Using pip
 $ pip install 'teehr @ git+https://github.com/RTIInternational/teehr@[BRANCH_TAG]'
+
+# Using poetry
+$ poetry add git+https://github.com/RTIInternational/teehr.git#[BRANCH TAG]
 ```
 
 Use Docker
 ```bash
-$ docker build -t teehr:v0.2.9 .
-$ docker run -it --rm --volume $HOME:$HOME -p 8888:8888 teehr:v0.2.9 jupyter lab --ip 0.0.0.0 $HOME
+$ docker build -t teehr:v0.3.0 .
+$ docker run -it --rm --volume $HOME:$HOME -p 8888:8888 teehr:v0.3.0 jupyter lab --ip 0.0.0.0 $HOME
 ```
 
 ## Examples

diff --git a/docs/release_process.md b/docs/release_process.md
@@ -4,6 +4,7 @@ This document describes the release process which has some manual steps to compl
 Create branch with the following updated to the new version (find and replace version number):
 - `version.txt`
 - `README.md`
+- `pyproject.toml`
 
 Update the `CHANGELOG.md` to reflect the changes included in the release.
 

diff --git a/examples/loading/create_database.py b/examples/loading/create_database.py
@@ -0,0 +1,221 @@
+"""
+This script provides and example of how to create a TEEHR datyabase
+and insert joined timeseries, append attributes, and add user
+defined fields.
+"""
+from pathlib import Path
+from teehr.database.teehr_dataset import TEEHRDatasetDB
+import time
+import datetime
+
+
+TEST_STUDY_DIR = Path("/home/matt/temp/huc1802_retro")
+PRIMARY_FILEPATH = Path(TEST_STUDY_DIR, "timeseries", "usgs.parquet")
+SECONDARY_FILEPATH = Path(TEST_STUDY_DIR, "timeseries", "nwm2*.parquet")
+CROSSWALK_FILEPATH = Path(TEST_STUDY_DIR, "geo", "usgs_nwm2*_crosswalk.parquet") # noqa
+ATTRIBUTES_FILEPATH = Path(TEST_STUDY_DIR, "geo", "usgs_attr_*.parquet")
+GEOMETRY_FILEPATH = Path(TEST_STUDY_DIR,  "geo", "usgs_geometry.parquet")
+DATABASE_FILEPATH = Path(TEST_STUDY_DIR, "huc1802_retro.db")
+
+# Test data
+# TEST_STUDY_DIR = Path("tests/data/test_study")
+# PRIMARY_FILEPATH = Path(TEST_STUDY_DIR, "timeseries", "test_short_obs.parquet") # noqa
+# SECONDARY_FILEPATH = Path(TEST_STUDY_DIR, "timeseries", "test_short_fcast.parquet") # noqa
+# CROSSWALK_FILEPATH = Path(TEST_STUDY_DIR, "geo", "crosswalk.parquet")
+# ATTRIBUTES_FILEPATH = Path(TEST_STUDY_DIR, "geo", "test_attr2.parquet")
+# GEOMETRY_FILEPATH = Path(TEST_STUDY_DIR,  "geo", "gages.parquet")
+# DATABASE_FILEPATH = Path(TEST_STUDY_DIR, "temp_test.db")
+
+
+def describe_inputs():
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+
+    # Check the parquet files and report some stats to the user (WIP)
+    df = tds.describe_inputs(
+        primary_filepath=PRIMARY_FILEPATH,
+        secondary_filepath=SECONDARY_FILEPATH
+    )
+
+    print(df)
+
+
+def create_db_add_timeseries():
+
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+
+    # Perform the join and insert into duckdb database
+    # NOTE: Right now this will re-join and overwrite
+    print("Creating joined table")
+    tds.insert_joined_timeseries(
+        primary_filepath=PRIMARY_FILEPATH,
+        secondary_filepath=SECONDARY_FILEPATH,
+        crosswalk_filepath=CROSSWALK_FILEPATH
+        )
+    tds.insert_geometry(geometry_filepath=GEOMETRY_FILEPATH)
+
+
+def add_attributes():
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+
+    # Join (one or more?) table(s) of attributes to the timeseries table
+    print("Adding attributes")
+    tds.join_attributes(ATTRIBUTES_FILEPATH)
+
+
+def add_fields():
+
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+
+    # Calculate and add a field based on some user-defined function (UDF).
+    def test_user_function(arg1: float, arg2: str) -> float:
+        """Function arguments are fields in joined_timeseries, and
+        should have the same data type.
+        Note: In the data model, attribute values are always str type"""
+        return float(arg1) / float(arg2)
+
+    parameter_names = ["primary_value", "upstream_area_km2"]
+    new_field_name = "primary_normalized_discharge"
+    new_field_type = "FLOAT"
+    tds.calculate_field(new_field_name=new_field_name,
+                        new_field_type=new_field_type,
+                        parameter_names=parameter_names,
+                        user_defined_function=test_user_function)
+
+    # Calculate and add a field based on some user-defined function (UDF).
+    def add_month_field(arg1: datetime.datetime) -> int:
+        """Function arguments are fields in joined_timeseries, and
+        should have the same data type.
+        Note: In the data model, attribute values are always str type"""
+        return arg1.month
+
+    parameter_names = ["value_time"]
+    new_field_name = "month"
+    new_field_type = "INTEGER"
+    tds.calculate_field(new_field_name=new_field_name,
+                        new_field_type=new_field_type,
+                        parameter_names=parameter_names,
+                        user_defined_function=add_month_field)
+
+    # Calculate and add a field based on some user-defined function (UDF).
+    def exceed_2yr_recurrence(arg1: float, arg2: float) -> bool:
+        """Function arguments are fields in joined_timeseries, and
+        should have the same data type.
+        Note: In the data model, attribute values are always str type"""
+        return float(arg1) > float(arg2)
+
+    parameter_names = ["primary_value", "retro_2yr_recurrence_flow_cms"]
+    new_field_name = "exceed_2yr_recurrence"
+    new_field_type = "BOOLEAN"
+    tds.calculate_field(new_field_name=new_field_name,
+                        new_field_type=new_field_type,
+                        parameter_names=parameter_names,
+                        user_defined_function=exceed_2yr_recurrence)
+    pass
+
+
+def run_metrics_query():
+
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+    # schema_df = tds.get_joined_timeseries_schema()
+    # print(schema_df[["column_name", "column_type"]])
+
+    # Get metrics
+    group_by = ["primary_location_id", "configuration"]
+    order_by = ["primary_location_id"]
+    include_metrics = ["mean_error", "bias"]
+    filters = [
+        # {
+        #     "column": "primary_location_id",
+        #     "operator": "=",
+        #     "value": "usgs-11337080"
+        # },
+        # {
+        #     "column": "month",
+        #     "operator": "=",
+        #     "value": 1
+        # },
+        # {
+        #     "column": "upstream_area_km2",
+        #     "operator": ">",
+        #     "value": 1000
+        # },
+        # {
+        #     "column": "exceed_2yr_recurrence",
+        #     "operator": "=",
+        #     "value": True
+        # }
+    ]
+
+    t1 = time.time()
+    df1 = tds.get_metrics(
+        group_by=group_by,
+        order_by=order_by,
+        filters=filters,
+        include_metrics=include_metrics,
+        include_geometry=True,
+        # return_query=True
+    )
+    print(df1)
+    print(f"Database query: {(time.time() - t1):.2f} secs")
+
+    pass
+
+
+def describe_database():
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+    df = tds.get_joined_timeseries_schema()
+    print(df)
+
+
+def run_raw_query():
+
+    tds = TEEHRDatasetDB(DATABASE_FILEPATH)
+    query = """
+        WITH joined as (
+            SELECT
+                *
+            FROM joined_timeseries
+        )
+        , metrics AS (
+            SELECT
+                joined.primary_location_id,joined.configuration
+                , sum(primary_value - secondary_value)/count(*) as bias
+                , sum(absolute_difference)/count(*) as mean_error
+            FROM
+                joined
+            GROUP BY
+                joined.primary_location_id,joined.configuration
+        )
+        SELECT
+            metrics.*
+            ,gf.geometry as geometry
+        FROM metrics
+        JOIN geometry gf
+            on primary_location_id = gf.id
+        ORDER BY
+            metrics.primary_location_id
+    ;
+    ;"""
+    # query = f"""
+    #     COPY (
+    #         SELECT * FROM joined_timeseries
+    #     )
+    #     TO '{str(Path(TEST_STUDY_DIR, "huc1802_retro.parquet"))}' (
+    #         FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000
+    #     )
+    # ;"""
+    df = tds.query(query, format="df")
+    print(df)
+
+
+if __name__ == "__main__":
+    # create_db_add_timeseries()
+    # describe_inputs()
+    # describe_database()
+    # add_attributes()
+    # describe_database()
+    # add_fields()
+    # describe_database()
+    # run_metrics_query()
+    # run_raw_query()
+    pass
diff --git a/frontend/teehr/.eslintrc.cjs b/frontend/teehr/.eslintrc.cjs
@@ -0,0 +1,20 @@
+module.exports = {
+  root: true,
+  env: { browser: true, es2020: true },
+  extends: [
+    'eslint:recommended',
+    'plugin:react/recommended',
+    'plugin:react/jsx-runtime',
+    'plugin:react-hooks/recommended',
+  ],
+  ignorePatterns: ['dist', '.eslintrc.cjs'],
+  parserOptions: { ecmaVersion: 'latest', sourceType: 'module' },
+  settings: { react: { version: '18.2' } },
+  plugins: ['react-refresh'],
+  rules: {
+    'react-refresh/only-export-components': [
+      'warn',
+      { allowConstantExport: true },
+    ],
+  },
+}
diff --git a/frontend/teehr/.gitignore b/frontend/teehr/.gitignore
@@ -1,2 +1,24 @@
-*
-!.gitignore
+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+pnpm-debug.log*
+lerna-debug.log*
+
+node_modules
+dist
+dist-ssr
+*.local
+
+# Editor directories and files
+.vscode/*
+!.vscode/extensions.json
+.idea
+.DS_Store
+*.suo
+*.ntvs*
+*.njsproj
+*.sln
+*.sw?
diff --git a/frontend/teehr/README.md b/frontend/teehr/README.md
@@ -0,0 +1,8 @@
+# React + Vite
+
+This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
+
+Currently, two official plugins are available:
+
+- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react/README.md) uses [Babel](https://babeljs.io/) for Fast Refresh
+- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
-Original file line number
+Diff line change
@@ Expand Up / @@ -129,4 +129,4 @@ dmypy.json @@
     .pyre/
     # Tests output
-    temp/
+    temp/