Update README.md

Carlos Vivar · Carlos Vivar · commit 364a96320f5a · 2022-10-28T14:18:12.000+02:00
diff --git a/README.md b/README.md
@@ -16,10 +16,14 @@ Set up a data processing and visualization pipeline for COVID data. You will ret
 
 ## Database
 
-- Data API:https://covidtracking.com/data/api/version-2
+The covid tracking project compiled US COVID-19-related data from 02/2022 until 03/2021. It provides data by day and state in three main areas: testing, hospitalization, and patient outcomes. Data is provided via an [API](https://covidtracking.com/data/api/version-2) that can be used to retrieve a `json` file. 
+
+As a proof of concept, this tool takes from that database the total number of COVID-19 cases by day. This metric is accumulative; therefore, if we want to visualize daily COVID-19 cases, we need to transform the original data and calculate the difference. Some US health bureaus reported cases only on weekdays, while others did it uninterrupted daily. This explains why we see a drop in the number of cases during the weekend. To correct this "noise", the tool calculates a rolling mean with a window of 7 days. This transformation smoothes the signal and corrects this artifact; however, it tends to hamper the detection of fast changes in the signal. 
 
 ## Solution:
 
+The schema of the solution proposed for this task is represented in the graph below. `PySpark` will be used for the ETL job and `Bokeh` for generating the interactive visualization. Selected data is extracted from the database API. Then, after a data validation checking, several transformations are applied to the data, such as the conversion of dates, calculation of daily differences, and time-series sequence smoothing via rolling mean. Next, data is loaded into a parquet database. This loading process checks for duplicates, and therefore it can be run repeatedly without affecting the database. 
+
 ```mermaid
 graph LR;
     CT(Covid tracker JSON) --> PE(PySpark Extraction)
@@ -30,10 +34,12 @@ graph LR;
     L --> BH(Bokeh Interactive Visualization)
 ```
 
-## Output
+Finally, the data previously loaded in the parquet database is used to generate a bokeh interactive plot in `html`. 
 
 ![]('sphinx/imgs/covid_plot.gif')
 
+In this proof of concept, I used the total number of COVID-19 cases. However, it can be adapted to any of the metrics available in the API. 
+
 ## How to run this project
 
 ### Run in MyBinder
@@ -56,6 +62,8 @@ In order to run the analysis we simply execute the python program. This will cre
 python -m COVID19_project
 ```
 
+Different parameters can be passed to the tool using the config file in `configs/config.json`. This argument then can be passed to the different transformation methods. This functionality needs to be further developed. 
+
 ### Run tests
 
 Each transformation has its corresponding test. It is possible to run them with:
@@ -83,7 +91,9 @@ As an alternative to build your own image it is possible to pull a image from do
 docker pull caviri/covid19:latest
 ``` 
 
-## Structure of the project
+## Project Structure
+
+The structure of the project is inspired from this [repository](https://github.com/AlexIoannides/pyspark-example-project). 
 
 ```bash
 root/
@@ -107,6 +117,20 @@ root/
  |   requirements.txt
  ```
 
+The ETL task and visualization tool are contained in `COVID19_project`. There, each file contains the methods required for each part of the project: extraction, transformation, loading, and visualization. Different parameters can be configured in `configs/configs.json` and then used in transformation methods. Additional modules that support the pySpark session and logging can be found in `dependencies`. Finally, unit test modules are stored in tests next to small representative portions of input and output data. Each of the transformations methods has its own test function.  
+
+## Ideas for further development
+
+### Visualization
+- Extract daily cases per state.
+- Integrate states databases with their geoboundary in a geoparquet file. 
+- Develop a map visualization of US with a colormap depending on the cases. 
+
+### Database
+- Implement a Hadoop/HIVE database to test performance. 
+
+### Querying
+- Allow custom SQL queries to retrieve information from the database. 
 
 ## License 
 
diff --git a/configs/config.json b/configs/config.json
@@ -1,3 +1,3 @@
 {
-    "temporal_window": 5
+    "temporal_window": 7
 }

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`{`
`2`		`- "temporal_window": 5`
	`2`	`+ "temporal_window": 7`
`3`	`3`	`}`