|
| 1 | +DevLog |
| 2 | +====== |
| 3 | + |
| 4 | +Ideas |
| 5 | +~~~~~ |
| 6 | + |
| 7 | +- Integrate map with color per variable. |
| 8 | +- Integrate information of events related with measures. Introduce them |
| 9 | + in the timeline. |
| 10 | +- Accelerate the process |
| 11 | +- Transforms can include moving average as an example. |
| 12 | + |
| 13 | +Dev log |
| 14 | +~~~~~~~ |
| 15 | + |
| 16 | +23/10 |
| 17 | +^^^^^ |
| 18 | + |
| 19 | +Strategy A: Creating environment. |
| 20 | + |
| 21 | +- ``python==3.9.13`` |
| 22 | +- PySpark uses java: ``conda install openjdk==17.0.3`` |
| 23 | +- ``conda install pyspark==3.3.0`` |
| 24 | +- ``conda install ipykernel`` |
| 25 | +- ``python -m ipykernel install --user --name covid19`` |
| 26 | + |
| 27 | +On windows I have some issues starting the SparkSession. It last |
| 28 | +forever. |
| 29 | + |
| 30 | +Strategy B: Docker machine |
| 31 | + |
| 32 | +- ``docker pull jupyter/pyspark-notebook`` |
| 33 | +- ``docker run -p 10000:8888 -p 4040:4040 jupyter/pyspark-notebook`` |
| 34 | + |
| 35 | +The docker image works fine, and we have access to the dashboard in |
| 36 | +``localhost:4040``. |
| 37 | + |
| 38 | +.. _section-1: |
| 39 | + |
| 40 | +24/10 |
| 41 | +^^^^^ |
| 42 | + |
| 43 | +After checking the db structure, the project initially can be organized |
| 44 | +as follows: |
| 45 | + |
| 46 | +.. mermaid:: |
| 47 | + |
| 48 | + diagramSequence; |
| 49 | + CT(Covid tracker JSON) --> PS(PySpark Loading) |
| 50 | + subgraph one[ETL] |
| 51 | + PS --> HV(HIVE DB) |
| 52 | + HV --> GP(Geoparquet) |
| 53 | + end |
| 54 | + GP --> BH(Bokeh App) |
| 55 | + |
| 56 | +I found several boilerplates with good templates for data engineering |
| 57 | +project using PySpark. |
| 58 | + |
| 59 | +- `PySpark Example |
| 60 | + Project <https://github.com/AlexIoannides/pyspark-example-project>`__ |
| 61 | +- `PySpark Project |
| 62 | + Template <https://github.com/hbaflast/pyspark-project-template>`__ |
| 63 | +- `PySpark Spotify |
| 64 | + ETL <https://github.com/Amaguk2023/Pyspark_Spotify_ETL>`__ |
| 65 | + |
| 66 | +Following the first example. This seems like a nice project structure to |
| 67 | +start with: |
| 68 | + |
| 69 | +.. code:: bash |
| 70 | +
|
| 71 | + root/ |
| 72 | + |-- configs/ |
| 73 | + | |-- etl_config.json |
| 74 | + |-- dependencies/ |
| 75 | + | |-- logging.py |
| 76 | + | |-- spark.py |
| 77 | + |-- jobs/ |
| 78 | + | |-- etl_job.py |
| 79 | + |-- tests/ |
| 80 | + | |-- test_data/ |
| 81 | + | |-- | -- employees/ |
| 82 | + | |-- | -- employees_report/ |
| 83 | + | |-- test_etl_job.py |
| 84 | + | build_dependencies.sh |
| 85 | + | packages.zip |
| 86 | + | Pipfile |
| 87 | + | Pipfile.lock |
| 88 | +
|
| 89 | +I achieved to read without issues the scheme of the JSON for daily.json. |
| 90 | +However, when I try to create a DataFrame out of the json obtained via |
| 91 | +requests, the parsing of the schema is reduced to the first 2 |
| 92 | +categories. As a minimum viable proof, I decided to extract manually two |
| 93 | +vairables ``total_cases`` and ``date`` in order to follow with the |
| 94 | +structure of the project. |
| 95 | + |
| 96 | +I’m exploring the idea of separate the project into 3 docker containers. |
| 97 | +One dedicated to the ETL, another to HIVE database, and a third for the |
| 98 | +interactive bokeh app. In the latter, I want to include 2 kind of |
| 99 | +visualizations, one map based visualization, and another one for the |
| 100 | +time series. |
| 101 | + |
| 102 | +.. _section-2: |
| 103 | + |
| 104 | +25/10 |
| 105 | +^^^^^ |
| 106 | + |
| 107 | +In order to simplify the development I took the decission to keep the 3 |
| 108 | +docker idea as a future update and create a simpler version of the |
| 109 | +workflow. The output from the ETL pipeline will be saved in a parquet |
| 110 | +(geoparquet) file and this will be picked up by bopkeh in order to do |
| 111 | +the visualization. |
| 112 | + |
| 113 | +Ideas for quality control |
| 114 | + |
| 115 | +- Missing values: interpolate values as approximation, or mean value. |
| 116 | + df.col_name.interpolate df.col_name.fillna |
| 117 | +- Missing values: Pyspark solution. pyspark.ml.feature import Imputer. |
| 118 | + https://www.youtube.com/watch?v=K46pPG8Cepo&ab_channel=WebAgeSolutionsInc |
| 119 | +- Data is in incosistent format |
| 120 | +- Duplicate records |
| 121 | +- Outliers |
| 122 | +- Not normalized input data |
| 123 | + |
| 124 | +We can pass the SQL to the parquet. |
| 125 | + |
| 126 | +.. code:: python |
| 127 | +
|
| 128 | + parqDF.createOrReplaceTempView("ParquetTable") |
| 129 | + parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ") |
| 130 | +
|
| 131 | +.. _section-3: |
| 132 | + |
| 133 | +26/10 |
| 134 | +^^^^^ |
| 135 | + |
| 136 | +After dealing with some problems realted to the date/datetime format I |
| 137 | +got the first MVP of the pipeline. Now data is extracted, dates |
| 138 | +transformed into a proper datetime type, and data loaded into a parquet |
| 139 | +db. Bokeh app is able to read this data from the database and plot a |
| 140 | +simple time-series plot in html. This is the first candidate to the |
| 141 | +first release. |
| 142 | + |
| 143 | +.. _section-4: |
| 144 | + |
| 145 | +27/10 |
| 146 | +^^^^^ |
| 147 | + |
| 148 | +I included some exceptions for the API request. Now the database can be |
| 149 | +overwritten without duplicates issues. And I added another |
| 150 | +transformation: Rolling Mean. Now I will include some tests with a small |
| 151 | +dataset of those transformations for the unittest. |
| 152 | + |
| 153 | +Test working correctly for one transformation. Now tests needs to be |
| 154 | +generated for every transformation. |
| 155 | + |
| 156 | +Application running smoothly with ``python -m covid19_project`` but some |
| 157 | +warnings appeared: |
| 158 | + |
| 159 | +.. code:: python |
| 160 | +
|
| 161 | + /usr/local/spark/python/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. |
| 162 | + for column, series in pdf.iteritems(): |
| 163 | + /usr/local/spark/python/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. |
| 164 | + for column, series in pdf.iteritems(): |
| 165 | + /usr/local/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `to_spark`, the existing index is lost when converting to Spark DataFrame. |
| 166 | + warnings.warn(message, PandasAPIOnSparkAdviceWarning) |
| 167 | + 22/10/27 11:56:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation |
| 168 | +
|
| 169 | +In order to test: ``python -m unittest test/test_*.py`` |
| 170 | +
|
| 171 | +Still some work is required when using spark-submit with |
| 172 | +``$SPARK_HOME/bin/spark-submit --master local[*] --files configs/config.json covid19_project/__main__.py`` |
| 173 | +
|
| 174 | +.. _section-5: |
| 175 | +
|
| 176 | +28/10 |
| 177 | +^^^^^ |
| 178 | +
|
| 179 | +Applying some style corrections with flake8, and configuring correctly |
| 180 | +the docker container for mybinder. |
0 commit comments