This challenge requires you to run a Docker container with Jupyter, Spark, and MySQL, then interact with the MySQL database using PySpark.
This docker imagem already contains Jupyter, Spark, Mysql and Mysql drivers. Be aware of the environment, an eventual troubleshooting it is part of the challenge.
Execute the following command to start the container:
docker run -e NB_UID=1000 -e NB_GID=100 -p 8888:8888 pjunior1/jupyter-spark-data-enginerring
This will start Jupyter Notebook, Spark, and MySQL inside the container.
Once the container is running, open Jupyter Notebook in your browser by navigating to:
http://localhost:8888
Use the token provided in the terminal to log in.
Inside Jupyter Notebook, create a new notebook and start a Spark session that allows connecting to MySQL.
Use PySpark to establish a connection to the MySQL database using the following details:
- Database URL:
jdbc:mysql://localhost:3306/test_db?useSSL=false&allowPublicKeyRetrieval=true
- User:
jovyan
- Password:
password
- Driver:
com.mysql.jdbc.Driver
- List all available tables in the
test_db
database. - Create a DataFrame that calculates the total spending per user.
- You should write the DataFrame to the MysqlDB creating a new table called Results. Please output the results on the notebook
- You should write the DataFrame to the local disk creating a Delta Table.
-
Install Git if it's not already installed.
-
Fork this repo and add your notebook and the folder files from delta writing.