Skip to content

Code from the climate data challenge in the challenge module cdk1.

Notifications You must be signed in to change notification settings

julienkellerhals/klimadaten-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Climate API Readme

Requirements

The following programs and modules are required to run the climate API

  • Python 3.7
  • Postgres
  • Modules
modules version
dash_bootstrap_components 0.12.2
SQLAlchemy_Utils 0.36.8
Flask 1.1.2
dash 1.19.0
msedge_selenium_tools 3.141.3
statsmodels 0.12.0
urllib3 1.25.11
plotly 4.14.3
dash_core_components 1.3.1
numpy 1.19.2
selenium 3.141.0
SQLAlchemy 1.4.15
requests 2.24.0
dash_html_components 1.0.1
pytest 0.0.0
lxml 4.6.1
pandas 1.1.3
python_dateutil 2.8.1
scikit_learn 0.24.2

Python module installation with pip

If desired, install pipenv with following code:

pip install --user pipenv

Installing dependencies with pipenv is done as follows:

pipenv install requests

Pipenv guide

Installing dependencies with pip is done as follows:

pip install numpy

In case of issues while starting climate API or conflicting versions run:

pip freeze | %{$_.split('==')[0]} | %{pip install --upgrade $_}

Python module installation with conda

If desired, activate environment before installing dependencies with following code:

conda create --no-default-packages -n myenv python
conda activate ./envs

Anaconda environment documentation

Installing dependencies with conda is done as follows:

conda install numpy

In case of issues while starting climate API or conflicting versions run:

conda upgrade --all

Installation & Usage

With provided database

  1. Install Python
  2. Install all required modules for the climate API (Requirements)
  3. Install Postgres
  4. Start flask app (Instruction)
  5. Open the web browser
  6. Navigate to localhost:5000
  7. Configure database connection string (Instruction)
  8. Navigate to database tab of administration overview
  9. Press action button to connect to database
  10. Stop server
  11. Open pgAdmin or equivalent
  12. Import database.db
  13. Start server (Instruction)
  14. Navigate to localhost:5000/admin
  15. Use app

Without provided database

  1. Install Python
  2. Install all required modules for climate API (Requirements)
  3. Install Postgres
  4. Start flask app (Instruction)
  5. Open the web browser
  6. Navigate to localhost:5000
  7. Configure database connection string (Instruction)
  8. Navigate to database tab of administration overview
  9. Press action button to connect to database
  10. Press action button to create database tables
  11. Run ETL load
  12. Run Core load
  13. Use app

With VSCode and Edge

  1. Install Python
  2. Install all required modules for climate API (Requirements)
  3. Install Postgres
  4. Open VSCode
  5. Launch debug config FE + Flask found in doc/launch.json
  6. Configure database connection string (Instruction)
  7. Navigate to database tab of administration overview
  8. Press action button to connect to database
  9. Press action button to create database tables
  10. Run ETL load
  11. Run Core load
  12. Use app

Start flask app

Command Prompt

> set FLASK_APP=app.py
> set FLASK_ENV=production
> flask run

PowerShell

> $env:FLASK_APP = "app.py"
> $env:FLASK_ENV = "production"
> flask run

Linux (untested)

export FLASK_APP=app.py
export FLASK_ENV=production
flask run

Important, start anaconda / pip environment before starting the flask app


Install Browser driver

In order to get new data selenium requires a browser driver to scrape websites

Browser version is checked automatically

Installation as follows:

Edge

  1. Navigate to localhost:5000/admin
  2. Press Driver name
  3. Press Download driver

Chrome

  1. Navigate to localhost:5000/admin/driver/Chrome?headless=false

Configure database connection string

Postgres config page

  1. Navigate to localhost:5000/admin
  2. Choose database type in drop down (Only supports Postgres)
  3. Enter Database username
  4. Enter database password (Not encrypted!)
  5. Enter database location (Only supports localhost)
  6. Select port
  7. Submit form

Postgres connection string is save in plain text in config/config.json

Extensions

Add new parameters

  1. Open idawebConfig.xml
  2. Add new parameter with name, group and granularity
  3. Restart Server
  4. Navigate to localhost:5000/admin/database
  5. Click on idaweb_t
  6. Run increment load

New login information

In the case of a blocked idaweb account

  1. Open webscraping.py
  2. Change the login information at the start of the file

How it works

Scraping


webscraping.py contains both meteoschweiz and idaweb scraping functions

Both scraping methods utilize selenium to login and navigate webpages. Selenium is currently configured with displayed browser in order to check activity. Navigation and click events on page are done with either xpath or javascript paths.

Downloading of data on idaweb are done with the Python request module in headless mode. Sessions are passed as arguments for each requests.

API


app.py & API folder

  1. Instantiates the blueprint for all sub APIs in the API folder
  2. Contains the main routes for the API
API folder
  1. All blueprints for different parts of the API
  1. Contains the main admin page routes
  1. Contains database routes on the admin page
  2. Contains all database interface routes
  1. Contains all scraping routes
  1. Handles all sse streams to the front end

db.py does the following things:

  1. All interaction with database
    1. Database creation
    2. Table creation
    3. Selects
    4. Inserts
  2. Creates announcer for the front end
  3. Creates messages of database status and sends them over the sse to the front end

Helper file with functions for POST and GET requests

  1. Contains helper functions for idaweb file download
  1. Contains idaweb parameters to download and refresh
  1. Used for development as temporary storage of configurations

messageAnnouncer.py does the following things:

  1. sse
  2. queueing
  3. formatting

responseDict.py does the following things:

  1. Response sending for the front end
  2. Button disabling for the front end
  3. Creating a progressbar for the front end
  4. Starting materialized view refresh after data inserts

abstractDriver.py handles all selenium driver interactions

  1. Driver installation
  2. Creating front end information about driver status

dashboard.py does the following things:

  1. Creation of the dashboard its structure
  2. Selection of the data displayed on the dashboard
  3. Wrangling of the selected data
  4. Handling of user interaction using callbacks

story.py does the following things:

  1. Creation of the story its structure
  2. Selection of data displayed in the story
  3. Wrangling of the selected data

Contains all unit tests of the webscraping

Contains all unit tests of the database

Database implementation


  • Database is divided into two main schemas, Stage and Core
  • All tables have corresponding materialized views for number of rows and last update
  • Data is copied from left to right
    • Text files / Web into stage tables
    • Stage tables into Core tables

Stage

Stage schema contain all new data

  • Can contain duplicate entries
  • Has No primary keys
  • Contains raw data

Core

  • Cannot contain duplicate entries due to natural primary key violation
  • Data is indexed for faster selects
  • idaweb_t and meteoschweiz_t are merged into measurements_t table
  • Columns get added for the description of the data source
  • Data gets parsed into the format used in future analysis
  • Core data never gets deleted, can be used to add new data

ERD

databaseERD

About

Code from the climate data challenge in the challenge module cdk1.

Topics

Resources

Stars

Watchers

Forks