Skip to content

Commit 364a963

Browse files
author
Carlos Vivar
committed
Update README.md
1 parent b957bdb commit 364a963

File tree

2 files changed

+28
-4
lines changed

2 files changed

+28
-4
lines changed

README.md

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,14 @@ Set up a data processing and visualization pipeline for COVID data. You will ret
1616

1717
## Database
1818

19-
- Data API:https://covidtracking.com/data/api/version-2
19+
The covid tracking project compiled US COVID-19-related data from 02/2022 until 03/2021. It provides data by day and state in three main areas: testing, hospitalization, and patient outcomes. Data is provided via an [API](https://covidtracking.com/data/api/version-2) that can be used to retrieve a `json` file.
20+
21+
As a proof of concept, this tool takes from that database the total number of COVID-19 cases by day. This metric is accumulative; therefore, if we want to visualize daily COVID-19 cases, we need to transform the original data and calculate the difference. Some US health bureaus reported cases only on weekdays, while others did it uninterrupted daily. This explains why we see a drop in the number of cases during the weekend. To correct this "noise", the tool calculates a rolling mean with a window of 7 days. This transformation smoothes the signal and corrects this artifact; however, it tends to hamper the detection of fast changes in the signal.
2022

2123
## Solution:
2224

25+
The schema of the solution proposed for this task is represented in the graph below. `PySpark` will be used for the ETL job and `Bokeh` for generating the interactive visualization. Selected data is extracted from the database API. Then, after a data validation checking, several transformations are applied to the data, such as the conversion of dates, calculation of daily differences, and time-series sequence smoothing via rolling mean. Next, data is loaded into a parquet database. This loading process checks for duplicates, and therefore it can be run repeatedly without affecting the database.
26+
2327
```mermaid
2428
graph LR;
2529
CT(Covid tracker JSON) --> PE(PySpark Extraction)
@@ -30,10 +34,12 @@ graph LR;
3034
L --> BH(Bokeh Interactive Visualization)
3135
```
3236

33-
## Output
37+
Finally, the data previously loaded in the parquet database is used to generate a bokeh interactive plot in `html`.
3438

3539
![]('sphinx/imgs/covid_plot.gif')
3640

41+
In this proof of concept, I used the total number of COVID-19 cases. However, it can be adapted to any of the metrics available in the API.
42+
3743
## How to run this project
3844

3945
### Run in MyBinder
@@ -56,6 +62,8 @@ In order to run the analysis we simply execute the python program. This will cre
5662
python -m COVID19_project
5763
```
5864

65+
Different parameters can be passed to the tool using the config file in `configs/config.json`. This argument then can be passed to the different transformation methods. This functionality needs to be further developed.
66+
5967
### Run tests
6068

6169
Each transformation has its corresponding test. It is possible to run them with:
@@ -83,7 +91,9 @@ As an alternative to build your own image it is possible to pull a image from do
8391
docker pull caviri/covid19:latest
8492
```
8593

86-
## Structure of the project
94+
## Project Structure
95+
96+
The structure of the project is inspired from this [repository](https://github.com/AlexIoannides/pyspark-example-project).
8797

8898
```bash
8999
root/
@@ -107,6 +117,20 @@ root/
107117
| requirements.txt
108118
```
109119

120+
The ETL task and visualization tool are contained in `COVID19_project`. There, each file contains the methods required for each part of the project: extraction, transformation, loading, and visualization. Different parameters can be configured in `configs/configs.json` and then used in transformation methods. Additional modules that support the pySpark session and logging can be found in `dependencies`. Finally, unit test modules are stored in tests next to small representative portions of input and output data. Each of the transformations methods has its own test function.
121+
122+
## Ideas for further development
123+
124+
### Visualization
125+
- Extract daily cases per state.
126+
- Integrate states databases with their geoboundary in a geoparquet file.
127+
- Develop a map visualization of US with a colormap depending on the cases.
128+
129+
### Database
130+
- Implement a Hadoop/HIVE database to test performance.
131+
132+
### Querying
133+
- Allow custom SQL queries to retrieve information from the database.
110134

111135
## License
112136

configs/config.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
"temporal_window": 5
2+
"temporal_window": 7
33
}

0 commit comments

Comments
 (0)