You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+27-3Lines changed: 27 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,10 +16,14 @@ Set up a data processing and visualization pipeline for COVID data. You will ret
16
16
17
17
## Database
18
18
19
-
- Data API:https://covidtracking.com/data/api/version-2
19
+
The covid tracking project compiled US COVID-19-related data from 02/2022 until 03/2021. It provides data by day and state in three main areas: testing, hospitalization, and patient outcomes. Data is provided via an [API](https://covidtracking.com/data/api/version-2) that can be used to retrieve a `json` file.
20
+
21
+
As a proof of concept, this tool takes from that database the total number of COVID-19 cases by day. This metric is accumulative; therefore, if we want to visualize daily COVID-19 cases, we need to transform the original data and calculate the difference. Some US health bureaus reported cases only on weekdays, while others did it uninterrupted daily. This explains why we see a drop in the number of cases during the weekend. To correct this "noise", the tool calculates a rolling mean with a window of 7 days. This transformation smoothes the signal and corrects this artifact; however, it tends to hamper the detection of fast changes in the signal.
20
22
21
23
## Solution:
22
24
25
+
The schema of the solution proposed for this task is represented in the graph below. `PySpark` will be used for the ETL job and `Bokeh` for generating the interactive visualization. Selected data is extracted from the database API. Then, after a data validation checking, several transformations are applied to the data, such as the conversion of dates, calculation of daily differences, and time-series sequence smoothing via rolling mean. Next, data is loaded into a parquet database. This loading process checks for duplicates, and therefore it can be run repeatedly without affecting the database.
26
+
23
27
```mermaid
24
28
graph LR;
25
29
CT(Covid tracker JSON) --> PE(PySpark Extraction)
@@ -30,10 +34,12 @@ graph LR;
30
34
L --> BH(Bokeh Interactive Visualization)
31
35
```
32
36
33
-
## Output
37
+
Finally, the data previously loaded in the parquet database is used to generate a bokeh interactive plot in `html`.
34
38
35
39

36
40
41
+
In this proof of concept, I used the total number of COVID-19 cases. However, it can be adapted to any of the metrics available in the API.
42
+
37
43
## How to run this project
38
44
39
45
### Run in MyBinder
@@ -56,6 +62,8 @@ In order to run the analysis we simply execute the python program. This will cre
56
62
python -m COVID19_project
57
63
```
58
64
65
+
Different parameters can be passed to the tool using the config file in `configs/config.json`. This argument then can be passed to the different transformation methods. This functionality needs to be further developed.
66
+
59
67
### Run tests
60
68
61
69
Each transformation has its corresponding test. It is possible to run them with:
@@ -83,7 +91,9 @@ As an alternative to build your own image it is possible to pull a image from do
83
91
docker pull caviri/covid19:latest
84
92
```
85
93
86
-
## Structure of the project
94
+
## Project Structure
95
+
96
+
The structure of the project is inspired from this [repository](https://github.com/AlexIoannides/pyspark-example-project).
87
97
88
98
```bash
89
99
root/
@@ -107,6 +117,20 @@ root/
107
117
| requirements.txt
108
118
```
109
119
120
+
The ETL task and visualization tool are contained in `COVID19_project`. There, each file contains the methods required for each part of the project: extraction, transformation, loading, and visualization. Different parameters can be configured in `configs/configs.json` and then used in transformation methods. Additional modules that support the pySpark session and logging can be found in `dependencies`. Finally, unit test modules are stored in tests next to small representative portions of input and output data. Each of the transformations methods has its own test function.
121
+
122
+
## Ideas for further development
123
+
124
+
### Visualization
125
+
- Extract daily cases per state.
126
+
- Integrate states databases with their geoboundary in a geoparquet file.
127
+
- Develop a map visualization of US with a colormap depending on the cases.
128
+
129
+
### Database
130
+
- Implement a Hadoop/HIVE database to test performance.
131
+
132
+
### Querying
133
+
- Allow custom SQL queries to retrieve information from the database.
0 commit comments