- 🎥 5.1.1 Introduction to Batch Processing
- 🎥 5.1.2 Introduction to Spark
Follow these intructions to install Spark:
And follow this to run PySpark in Jupyter
- 🎥 5.2.1 (Optional) Installing Spark (Linux)
Alternatively, if the setups above don't work, you can run Spark in Google Colab.
Note
It's advisable to invest some time in setting things up locally rather than immediately jumping into this solution
- 🎥 5.3.1 First Look at Spark/PySpark
- 🎥 5.3.2 Spark Dataframes
- 🎥 5.3.3 (Optional) Preparing Yellow and Green Taxi Data
Script to prepare the Dataset download_data.sh
Note
The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema
option to true
while reading the files in Spark.
- 🎥 5.3.4 SQL with Spark
- 🎥 5.4.1 Anatomy of a Spark Cluster
- 🎥 5.4.2 GroupBy in Spark
- 🎥 5.4.3 Joins in Spark
- 🎥 5.5.1 Operations on Spark RDDs
- 🎥 5.5.2 Spark RDD mapPartition
- 🎥 5.6.1 Connecting to Google Cloud Storage
- 🎥 5.6.2 Creating a Local Spark Cluster
- 🎥 5.6.3 Setting up a Dataproc Cluster
- 🎥 5.6.4 Connecting Spark to Big Query
Did you take notes? You can share them here.
- Notes by Alvaro Navas
- Sandy's DE Learning Blog
- Notes by Alain Boisvert
- Alternative : Using docker-compose to launch spark by rafik
- Marcos Torregrosa's blog (spanish)
- Notes by Victor Padilha
- Notes by Oscar Garcia
- Notes by HongWei
- 2024 videos transcript by Maria Fisher
- Add your notes here (above this line)