The project just got its own article at Towards Data Science Medium blog! ✨
This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.
curl -LO https://raw.githubusercontent.com/andre-marcos-perez/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up| Application | URL | Description | 
|---|---|---|
| JupyterLab | localhost:8888 | Cluster interface with built-in Jupyter notebooks | 
| Apache Spark Master | localhost:8080 | Spark Master node | 
| Apache Spark Worker I | localhost:8081 | Spark Worker node with 1 core and 512m of memory (default) | 
| Apache Spark Worker II | localhost:8082 | Spark Worker node with 1 core and 512m of memory (default) | 
- Install Docker and Docker Compose, check infra supported versions
- Download the source code or clone the repository;
- Edit the docker compose file with your favorite tech stack version, check apps supported versions;
- Build the cluster;
docker-compose up- Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
- Stop the cluster by typing ctrl+c.
Note: Local build is currently only supported on Linux OS distributions.
- Download the source code or clone the repository;
- Move to the build directory;
cd build- Edit the build.yml file with your favorite tech stack version;
- Match those version on the docker compose file;
- Build the images;
chmod +x build.sh ; ./build.sh- Build the cluster;
docker-compose up- Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
- Stop the cluster by typing ctrl+c.
- Infrastructure
| Component | Version | 
|---|---|
| Docker Engine | 1.13.0+ | 
| Docker Compose | 1.10.0+ | 
| Python | 3.7.3 | 
| Scala | 2.12.11 | 
| R | 3.5.2 | 
- Jupyter Kernels
| Component | Version | Provider | 
|---|---|---|
| Python | 2.1.4 | Jupyter | 
| Scala | 0.10.0 | Almond | 
| R | 1.1.1 | IRkernel | 
- Applications
| Component | Version | Docker Tag | 
|---|---|---|
| Apache Spark | 2.4.0 | 2.4.4 | 3.0.0 | <spark-version>-hadoop-2.7 | 
| JupyterLab | 2.1.4 | <jupyterlab-version>-spark-<spark-version> | 
Apache Spark R API (SparkR) is only supported on version 2.4.4. Full list can be found here.
| Image | Size | Downloads | 
|---|---|---|
| JupyterLab | ||
| Spark Master | ||
| Spark Worker | 
We'd love some help. To contribute, please read this file.
Staring us on GitHub is also an awesome way to show your support ⭐
- André Perez - dekoperez - [email protected]
