caviri
diff --git a/‎README.md
Lines changed: 57 additions & 288 deletions b/‎README.md
Lines changed: 57 additions & 288 deletions
diff --git a/‎docs/.buildinfo
Lines changed: 1 addition & 1 deletion b/‎docs/.buildinfo
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/_sources/database.rst.txt
Lines changed: 114 additions & 0 deletions b/‎docs/_sources/database.rst.txt
Lines changed: 114 additions & 0 deletions
diff --git a/‎docs/_sources/devlog.rst.txt
Lines changed: 180 additions & 0 deletions b/‎docs/_sources/devlog.rst.txt
Lines changed: 180 additions & 0 deletions
diff --git a/‎docs/_sources/index.rst.txt
Lines changed: 5 additions & 0 deletions b/‎docs/_sources/index.rst.txt
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/_sources/install.rst.txt
Lines changed: 50 additions & 0 deletions b/‎docs/_sources/install.rst.txt
Lines changed: 50 additions & 0 deletions
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: c3c9e7ff42763b0efa8d10576ea1f52c
+config: 66b8a8e310af0293b0ba29582ed6d5be
 tags: 645f666f9bcd5a90fca523b33c5a78b7
@@ -0,0 +1,114 @@
+Database
+========
+
+The Covid Tracking Project: Data API. It contains 2 main categories:
+National Data, and State & Territories Data.
+
+National Data
+-------------
+
+-  Historic US values:
+
+   -  field_definitions
+
+      -  Total test results
+      -  Hospital discharges
+      -  Confirmed Cases
+      -  Cumulative hospitalized/Ever hospitalized
+      -  Cumulative in ICU/Ever in ICU
+      -  Cumulative on ventilator/Ever on ventilator
+      -  Currently hospitalized/Now hospitalized
+      -  Currently in ICU/Now in ICU
+      -  Currently on ventilator/Now on ventilator
+      -  Deaths (probable)
+      -  Deaths (confirmed)
+      -  Deaths (confirmed and probable)
+      -  Probable Cases
+      -  Last Update (ET)
+      -  New deaths
+      -  Date
+      -  States (**Non reported**)
+
+Every field is organized in 3 categories: cases, testing, and outcomes.
+Then, every field can be accessed with aq dot after the category.
+
+-  Single Day of data:
+
+   -  Same information but you don’t need to download the whole dataset.
+      This can be useful in order to make the dataretrieval parallel.
+
+State & Terrtories Data
+-----------------------
+
+-  All state metadata: Basic information about all states, including
+   notes about our methodology and the websites we use to check for
+   data.
+
+   -  field_definitions
+
+      -  state_code
+      -  COVID Tracking Project preferred total test units
+      -  COVID Tracking Project preferred total test field
+      -  State population (2019 census)
+      -  Tertiary source for state COVID data
+      -  Secondary source for state COVID data
+      -  Primary source for state COVID data
+      -  FIPS code
+      -  State (or territory)
+
+-  Single State Metadata: Same but per state
+
+   -  field_definitions
+
+      -  state_code
+      -  COVID Tracking Project preferred total test units
+      -  COVID Tracking Project preferred total test field
+      -  State population (2019 census)
+      -  Tertiary source for state COVID data
+      -  Secondary source for state COVID data
+      -  Primary source for state COVID data
+      -  FIPS code
+      -  State (or territory)
+
+-  Historic data for a state or
+
+   -  field_definitions
+
+      -  Total test results
+      -  Hospital discharges
+      -  Confirmed Cases
+      -  Cumulative hospitalized/Ever hospitalized
+      -  Cumulative in ICU/Ever in ICU
+      -  Cumulative on ventilator/Ever on ventilator
+      -  Currently hospitalized/Now hospitalized
+      -  Currently in ICU/Now in ICU
+      -  Currently on ventilator/Now on ventilator
+      -  Deaths (probable)
+      -  Deaths (confirmed)
+      -  Deaths (confirmed and probable)
+      -  Probable Cases
+      -  Last Update (ET)
+      -  New deaths
+      -  Date
+
+-  Single day of data for a state or territory
+
+   -  field_definitions
+
+      -  Total test results
+      -  Hospital discharges
+      -  Confirmed Cases
+      -  Cumulative hospitalized/Ever hospitalized
+      -  Cumulative in ICU/Ever in ICU
+      -  Cumulative on ventilator/Ever on ventilator
+      -  Currently hospitalized/Now hospitalized
+      -  Currently in ICU/Now in ICU
+      -  Currently on ventilator/Now on ventilator
+      -  Deaths (probable)
+      -  Deaths (confirmed)
+      -  Deaths (confirmed and probable)
+      -  Probable Cases
+      -  Last Update (ET)
+      -  New deaths
+      -  Date
+
@@ -0,0 +1,180 @@
+DevLog
+======
+
+Ideas
+~~~~~
+
+-  Integrate map with color per variable.
+-  Integrate information of events related with measures. Introduce them
+   in the timeline.
+-  Accelerate the process
+-  Transforms can include moving average as an example.
+
+Dev log
+~~~~~~~
+
+23/10
+^^^^^
+
+Strategy A: Creating environment.
+
+-  ``python==3.9.13``
+-  PySpark uses java: ``conda install openjdk==17.0.3``
+-  ``conda install pyspark==3.3.0``
+-  ``conda install ipykernel``
+-  ``python -m ipykernel install --user --name covid19``
+
+On windows I have some issues starting the SparkSession. It last
+forever.
+
+Strategy B: Docker machine
+
+-  ``docker pull jupyter/pyspark-notebook``
+-  ``docker run -p 10000:8888 -p 4040:4040 jupyter/pyspark-notebook``
+
+The docker image works fine, and we have access to the dashboard in
+``localhost:4040``.
+
+.. _section-1:
+
+24/10
+^^^^^
+
+After checking the db structure, the project initially can be organized
+as follows:
+
+.. mermaid::
+
+   diagramSequence;
+       CT(Covid tracker JSON) --> PS(PySpark Loading)
+       subgraph one[ETL]
+       PS --> HV(HIVE DB)
+       HV --> GP(Geoparquet)
+       end
+       GP --> BH(Bokeh App)
+
+I found several boilerplates with good templates for data engineering
+project using PySpark.
+
+-  `PySpark Example
+   Project <https://github.com/AlexIoannides/pyspark-example-project>`__
+-  `PySpark Project
+   Template <https://github.com/hbaflast/pyspark-project-template>`__
+-  `PySpark Spotify
+   ETL <https://github.com/Amaguk2023/Pyspark_Spotify_ETL>`__
+
+Following the first example. This seems like a nice project structure to
+start with:
+
+.. code:: bash
+
+   root/
+    |-- configs/
+    |   |-- etl_config.json
+    |-- dependencies/
+    |   |-- logging.py
+    |   |-- spark.py
+    |-- jobs/
+    |   |-- etl_job.py
+    |-- tests/
+    |   |-- test_data/
+    |   |-- | -- employees/
+    |   |-- | -- employees_report/
+    |   |-- test_etl_job.py
+    |   build_dependencies.sh
+    |   packages.zip
+    |   Pipfile
+    |   Pipfile.lock
+
+I achieved to read without issues the scheme of the JSON for daily.json.
+However, when I try to create a DataFrame out of the json obtained via
+requests, the parsing of the schema is reduced to the first 2
+categories. As a minimum viable proof, I decided to extract manually two
+vairables ``total_cases`` and ``date`` in order to follow with the
+structure of the project.
+
+I’m exploring the idea of separate the project into 3 docker containers.
+One dedicated to the ETL, another to HIVE database, and a third for the
+interactive bokeh app. In the latter, I want to include 2 kind of
+visualizations, one map based visualization, and another one for the
+time series.
+
+.. _section-2:
+
+25/10
+^^^^^
+
+In order to simplify the development I took the decission to keep the 3
+docker idea as a future update and create a simpler version of the
+workflow. The output from the ETL pipeline will be saved in a parquet
+(geoparquet) file and this will be picked up by bopkeh in order to do
+the visualization.
+
+Ideas for quality control
+
+-  Missing values: interpolate values as approximation, or mean value.
+   df.col_name.interpolate df.col_name.fillna
+-  Missing values: Pyspark solution. pyspark.ml.feature import Imputer.
+   https://www.youtube.com/watch?v=K46pPG8Cepo&ab_channel=WebAgeSolutionsInc
+-  Data is in incosistent format
+-  Duplicate records
+-  Outliers
+-  Not normalized input data
+
+We can pass the SQL to the parquet.
+
+.. code:: python
+
+   parqDF.createOrReplaceTempView("ParquetTable")
+   parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
+
+.. _section-3:
+
+26/10
+^^^^^
+
+After dealing with some problems realted to the date/datetime format I
+got the first MVP of the pipeline. Now data is extracted, dates
+transformed into a proper datetime type, and data loaded into a parquet
+db. Bokeh app is able to read this data from the database and plot a
+simple time-series plot in html. This is the first candidate to the
+first release.
+
+.. _section-4:
+
+27/10
+^^^^^
+
+I included some exceptions for the API request. Now the database can be
+overwritten without duplicates issues. And I added another
+transformation: Rolling Mean. Now I will include some tests with a small
+dataset of those transformations for the unittest.
+
+Test working correctly for one transformation. Now tests needs to be
+generated for every transformation.
+
+Application running smoothly with ``python -m covid19_project`` but some
+warnings appeared:
+
+.. code:: python
+
+   /usr/local/spark/python/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
+     for column, series in pdf.iteritems():
+   /usr/local/spark/python/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
+     for column, series in pdf.iteritems():
+   /usr/local/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `to_spark`, the existing index is lost when converting to Spark DataFrame.
+     warnings.warn(message, PandasAPIOnSparkAdviceWarning)
+   22/10/27 11:56:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation
+
+In order to test: ``python -m unittest test/test_*.py``
+
+Still some work is required when using spark-submit with
+``$SPARK_HOME/bin/spark-submit --master local[*] --files configs/config.json covid19_project/__main__.py``
+
+.. _section-5:
+
+28/10
+^^^^^
+
+Applying some style corrections with flake8, and configuring correctly
+the docker container for mybinder.
@@ -6,11 +6,16 @@
 Welcome to Covid19 Visualization Project's documentation!
 =========================================================
 
+Text test.
+
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
 
+   install
+   database
    modules
+   devlog
 
 Indices and tables
 ==================
 
@@ -0,0 +1,50 @@
+Installation and run
+====================
+
+Installation
+------------
+
+Depdendencies can be installed using pip. 
+
+.. code-block:: bash
+
+    pip install -r requirements.txt
+
+Run
+---
+
+In order to run the analysis we simply execute the python program. This will create a folder named ``output`` containing the parquet database and the final html with the interactive visualization. 
+
+.. code-block:: bash
+
+    python -m COVID19_project
+
+Run tests
+---------
+
+Each transformation has its corresponding test. It is possible to run them with: 
+
+.. code-block:: bash
+
+    python -m unittest tests/test_*.py
+
+Run in docker
+-------------
+
+It is possible to build a docker container from the dockerfile suministrated in the repository. This docker image is build uppon the jupyter-spark image and it comes with a jupyter lab interface. In order to build the image you can run: 
+
+.. code-block:: bash
+
+    docker build caviri/covid19:latest .
+
+The in order to run the docker image you need to tunnel the ports. Jupyter uses ``8888`` port, and the pySpark UI uses ``4040`` ports. 
+
+.. code-block:: bash
+
+    docker run -p 10001:8888 -p 4041:4040 caviri/covid19
+
+As an alternative to build your own image it is possible to pull a image from `docker hub <https://hub.docker.com/r/caviri/covid19>`__: 
+
+.. code-block:: bash
+
+    docker pull caviri/covid19:latest