Skip to content

Commit dfa29b4

Browse files
author
Carlos Vivar
committed
Updated README and documentation.
1 parent 7a5ffa8 commit dfa29b4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+2585
-301
lines changed

README.md

Lines changed: 57 additions & 288 deletions
Large diffs are not rendered by default.

docs/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: c3c9e7ff42763b0efa8d10576ea1f52c
3+
config: 66b8a8e310af0293b0ba29582ed6d5be
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

docs/_sources/database.rst.txt

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
Database
2+
========
3+
4+
The Covid Tracking Project: Data API. It contains 2 main categories:
5+
National Data, and State & Territories Data.
6+
7+
National Data
8+
-------------
9+
10+
- Historic US values:
11+
12+
- field_definitions
13+
14+
- Total test results
15+
- Hospital discharges
16+
- Confirmed Cases
17+
- Cumulative hospitalized/Ever hospitalized
18+
- Cumulative in ICU/Ever in ICU
19+
- Cumulative on ventilator/Ever on ventilator
20+
- Currently hospitalized/Now hospitalized
21+
- Currently in ICU/Now in ICU
22+
- Currently on ventilator/Now on ventilator
23+
- Deaths (probable)
24+
- Deaths (confirmed)
25+
- Deaths (confirmed and probable)
26+
- Probable Cases
27+
- Last Update (ET)
28+
- New deaths
29+
- Date
30+
- States (**Non reported**)
31+
32+
Every field is organized in 3 categories: cases, testing, and outcomes.
33+
Then, every field can be accessed with aq dot after the category.
34+
35+
- Single Day of data:
36+
37+
- Same information but you don’t need to download the whole dataset.
38+
This can be useful in order to make the dataretrieval parallel.
39+
40+
State & Terrtories Data
41+
-----------------------
42+
43+
- All state metadata: Basic information about all states, including
44+
notes about our methodology and the websites we use to check for
45+
data.
46+
47+
- field_definitions
48+
49+
- state_code
50+
- COVID Tracking Project preferred total test units
51+
- COVID Tracking Project preferred total test field
52+
- State population (2019 census)
53+
- Tertiary source for state COVID data
54+
- Secondary source for state COVID data
55+
- Primary source for state COVID data
56+
- FIPS code
57+
- State (or territory)
58+
59+
- Single State Metadata: Same but per state
60+
61+
- field_definitions
62+
63+
- state_code
64+
- COVID Tracking Project preferred total test units
65+
- COVID Tracking Project preferred total test field
66+
- State population (2019 census)
67+
- Tertiary source for state COVID data
68+
- Secondary source for state COVID data
69+
- Primary source for state COVID data
70+
- FIPS code
71+
- State (or territory)
72+
73+
- Historic data for a state or
74+
75+
- field_definitions
76+
77+
- Total test results
78+
- Hospital discharges
79+
- Confirmed Cases
80+
- Cumulative hospitalized/Ever hospitalized
81+
- Cumulative in ICU/Ever in ICU
82+
- Cumulative on ventilator/Ever on ventilator
83+
- Currently hospitalized/Now hospitalized
84+
- Currently in ICU/Now in ICU
85+
- Currently on ventilator/Now on ventilator
86+
- Deaths (probable)
87+
- Deaths (confirmed)
88+
- Deaths (confirmed and probable)
89+
- Probable Cases
90+
- Last Update (ET)
91+
- New deaths
92+
- Date
93+
94+
- Single day of data for a state or territory
95+
96+
- field_definitions
97+
98+
- Total test results
99+
- Hospital discharges
100+
- Confirmed Cases
101+
- Cumulative hospitalized/Ever hospitalized
102+
- Cumulative in ICU/Ever in ICU
103+
- Cumulative on ventilator/Ever on ventilator
104+
- Currently hospitalized/Now hospitalized
105+
- Currently in ICU/Now in ICU
106+
- Currently on ventilator/Now on ventilator
107+
- Deaths (probable)
108+
- Deaths (confirmed)
109+
- Deaths (confirmed and probable)
110+
- Probable Cases
111+
- Last Update (ET)
112+
- New deaths
113+
- Date
114+

docs/_sources/devlog.rst.txt

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
DevLog
2+
======
3+
4+
Ideas
5+
~~~~~
6+
7+
- Integrate map with color per variable.
8+
- Integrate information of events related with measures. Introduce them
9+
in the timeline.
10+
- Accelerate the process
11+
- Transforms can include moving average as an example.
12+
13+
Dev log
14+
~~~~~~~
15+
16+
23/10
17+
^^^^^
18+
19+
Strategy A: Creating environment.
20+
21+
- ``python==3.9.13``
22+
- PySpark uses java: ``conda install openjdk==17.0.3``
23+
- ``conda install pyspark==3.3.0``
24+
- ``conda install ipykernel``
25+
- ``python -m ipykernel install --user --name covid19``
26+
27+
On windows I have some issues starting the SparkSession. It last
28+
forever.
29+
30+
Strategy B: Docker machine
31+
32+
- ``docker pull jupyter/pyspark-notebook``
33+
- ``docker run -p 10000:8888 -p 4040:4040 jupyter/pyspark-notebook``
34+
35+
The docker image works fine, and we have access to the dashboard in
36+
``localhost:4040``.
37+
38+
.. _section-1:
39+
40+
24/10
41+
^^^^^
42+
43+
After checking the db structure, the project initially can be organized
44+
as follows:
45+
46+
.. mermaid::
47+
48+
diagramSequence;
49+
CT(Covid tracker JSON) --> PS(PySpark Loading)
50+
subgraph one[ETL]
51+
PS --> HV(HIVE DB)
52+
HV --> GP(Geoparquet)
53+
end
54+
GP --> BH(Bokeh App)
55+
56+
I found several boilerplates with good templates for data engineering
57+
project using PySpark.
58+
59+
- `PySpark Example
60+
Project <https://github.com/AlexIoannides/pyspark-example-project>`__
61+
- `PySpark Project
62+
Template <https://github.com/hbaflast/pyspark-project-template>`__
63+
- `PySpark Spotify
64+
ETL <https://github.com/Amaguk2023/Pyspark_Spotify_ETL>`__
65+
66+
Following the first example. This seems like a nice project structure to
67+
start with:
68+
69+
.. code:: bash
70+
71+
root/
72+
|-- configs/
73+
| |-- etl_config.json
74+
|-- dependencies/
75+
| |-- logging.py
76+
| |-- spark.py
77+
|-- jobs/
78+
| |-- etl_job.py
79+
|-- tests/
80+
| |-- test_data/
81+
| |-- | -- employees/
82+
| |-- | -- employees_report/
83+
| |-- test_etl_job.py
84+
| build_dependencies.sh
85+
| packages.zip
86+
| Pipfile
87+
| Pipfile.lock
88+
89+
I achieved to read without issues the scheme of the JSON for daily.json.
90+
However, when I try to create a DataFrame out of the json obtained via
91+
requests, the parsing of the schema is reduced to the first 2
92+
categories. As a minimum viable proof, I decided to extract manually two
93+
vairables ``total_cases`` and ``date`` in order to follow with the
94+
structure of the project.
95+
96+
I’m exploring the idea of separate the project into 3 docker containers.
97+
One dedicated to the ETL, another to HIVE database, and a third for the
98+
interactive bokeh app. In the latter, I want to include 2 kind of
99+
visualizations, one map based visualization, and another one for the
100+
time series.
101+
102+
.. _section-2:
103+
104+
25/10
105+
^^^^^
106+
107+
In order to simplify the development I took the decission to keep the 3
108+
docker idea as a future update and create a simpler version of the
109+
workflow. The output from the ETL pipeline will be saved in a parquet
110+
(geoparquet) file and this will be picked up by bopkeh in order to do
111+
the visualization.
112+
113+
Ideas for quality control
114+
115+
- Missing values: interpolate values as approximation, or mean value.
116+
df.col_name.interpolate df.col_name.fillna
117+
- Missing values: Pyspark solution. pyspark.ml.feature import Imputer.
118+
https://www.youtube.com/watch?v=K46pPG8Cepo&ab_channel=WebAgeSolutionsInc
119+
- Data is in incosistent format
120+
- Duplicate records
121+
- Outliers
122+
- Not normalized input data
123+
124+
We can pass the SQL to the parquet.
125+
126+
.. code:: python
127+
128+
parqDF.createOrReplaceTempView("ParquetTable")
129+
parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
130+
131+
.. _section-3:
132+
133+
26/10
134+
^^^^^
135+
136+
After dealing with some problems realted to the date/datetime format I
137+
got the first MVP of the pipeline. Now data is extracted, dates
138+
transformed into a proper datetime type, and data loaded into a parquet
139+
db. Bokeh app is able to read this data from the database and plot a
140+
simple time-series plot in html. This is the first candidate to the
141+
first release.
142+
143+
.. _section-4:
144+
145+
27/10
146+
^^^^^
147+
148+
I included some exceptions for the API request. Now the database can be
149+
overwritten without duplicates issues. And I added another
150+
transformation: Rolling Mean. Now I will include some tests with a small
151+
dataset of those transformations for the unittest.
152+
153+
Test working correctly for one transformation. Now tests needs to be
154+
generated for every transformation.
155+
156+
Application running smoothly with ``python -m covid19_project`` but some
157+
warnings appeared:
158+
159+
.. code:: python
160+
161+
/usr/local/spark/python/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
162+
for column, series in pdf.iteritems():
163+
/usr/local/spark/python/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
164+
for column, series in pdf.iteritems():
165+
/usr/local/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `to_spark`, the existing index is lost when converting to Spark DataFrame.
166+
warnings.warn(message, PandasAPIOnSparkAdviceWarning)
167+
22/10/27 11:56:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation
168+
169+
In order to test: ``python -m unittest test/test_*.py``
170+
171+
Still some work is required when using spark-submit with
172+
``$SPARK_HOME/bin/spark-submit --master local[*] --files configs/config.json covid19_project/__main__.py``
173+
174+
.. _section-5:
175+
176+
28/10
177+
^^^^^
178+
179+
Applying some style corrections with flake8, and configuring correctly
180+
the docker container for mybinder.

docs/_sources/index.rst.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,16 @@
66
Welcome to Covid19 Visualization Project's documentation!
77
=========================================================
88

9+
Text test.
10+
911
.. toctree::
1012
:maxdepth: 2
1113
:caption: Contents:
1214

15+
install
16+
database
1317
modules
18+
devlog
1419

1520
Indices and tables
1621
==================

docs/_sources/install.rst.txt

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Installation and run
2+
====================
3+
4+
Installation
5+
------------
6+
7+
Depdendencies can be installed using pip.
8+
9+
.. code-block:: bash
10+
11+
pip install -r requirements.txt
12+
13+
Run
14+
---
15+
16+
In order to run the analysis we simply execute the python program. This will create a folder named ``output`` containing the parquet database and the final html with the interactive visualization.
17+
18+
.. code-block:: bash
19+
20+
python -m COVID19_project
21+
22+
Run tests
23+
---------
24+
25+
Each transformation has its corresponding test. It is possible to run them with:
26+
27+
.. code-block:: bash
28+
29+
python -m unittest tests/test_*.py
30+
31+
Run in docker
32+
-------------
33+
34+
It is possible to build a docker container from the dockerfile suministrated in the repository. This docker image is build uppon the jupyter-spark image and it comes with a jupyter lab interface. In order to build the image you can run:
35+
36+
.. code-block:: bash
37+
38+
docker build caviri/covid19:latest .
39+
40+
The in order to run the docker image you need to tunnel the ports. Jupyter uses ``8888`` port, and the pySpark UI uses ``4040`` ports.
41+
42+
.. code-block:: bash
43+
44+
docker run -p 10001:8888 -p 4041:4040 caviri/covid19
45+
46+
As an alternative to build your own image it is possible to pull a image from `docker hub <https://hub.docker.com/r/caviri/covid19>`__:
47+
48+
.. code-block:: bash
49+
50+
docker pull caviri/covid19:latest

0 commit comments

Comments
 (0)