This project demonstrates the distributed execution of a machine learning task using Apache Spark on a cluster of 4 instances. The task involves training a model to evaluate wine quality and achieving an F1 score of 0.7634 on the validation dataset using an SVC model. The instructions below guide you through the complete setup, from configuring instances to running the Spark job using Docker.
Link to Docker Image : https://hub.docker.com/r/shreyasshende/wine-quality-eval
Log into your 4 instances using SSH. Replace <instance-ip> with the IP address of each instance.
ssh -i /path/to/your/private-key.pem ubuntu@<instance-ip>On each instance, generate an SSH key pair to enable passwordless communication.
ssh-keygen -t rsa -N "" -f /home/ubuntu/.ssh/id_rsa
cat ~/.ssh/id_rsa.pubCopy the public key from each instance and add it to the authorized_keys file of all other instances.
On each instance, map the hostnames of all instances in the /etc/hosts file.
sudo vim /etc/hostsAdd the following entries (replace <ip-address> with actual instance IPs):
<ip-address> nn
<ip-address> dd1
<ip-address> dd2
<ip-address> dd3
Install Java, Maven, and Spark on all instances.
Install Java:
sudo apt update
sudo apt install openjdk-8-jdk -yInstall Maven:
sudo apt install maven -yInstall Spark:
- Download and extract Spark:
wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
tar -xvzf spark-3.4.1-bin-hadoop3.tgz- Set environment variables:
echo "export SPARK_HOME=/home/ubuntu/spark-3.4.1-bin-hadoop3" >> ~/.bashrc
echo "export PATH=\$SPARK_HOME/bin:\$PATH" >> ~/.bashrc
source ~/.bashrcCopy the workers.template file to workers and update it:
cp $SPARK_HOME/conf/workers.template $SPARK_HOME/conf/workers
vim $SPARK_HOME/conf/workersAdd the following lines:
localhost
dd1/ip-address
dd2/ip-address
dd3/ip-address
Create Training and Eval directories on all instances:
mkdir ~/Training
mkdir ~/EvalPlace the Java code files for training and evaluation into these directories.
Use the following command to execute the training code with Spark:
spark-submit --master spark://<master-ip>:7077 --class com.example.WineQualityEval /home/ubuntu/Training/wine-quality-train-1.0-SNAPSHOT.jarReplace <master-ip> with the Spark master instance's IP address.
Create a Docker image to package your application.
Dockerfile:
# Use the official Spark image as a base image
FROM bitnami/spark:3.4.1
# Set the working directory inside the container
WORKDIR /app
# Copy WineQualityEval (containing the JAR) to the container
COPY WineQualityEval /app/WineQualityEval
# Copy WineQualityPredictionModel to /home/ubuntu
COPY WineQualityPredictionModel /home/ubuntu/WineQualityPredictionModel
# Copy ValidationDataset.csv to /home/ubuntu
COPY ValidationDataset.csv /home/ubuntu/ValidationDataset.csv
# Set the command to run your Spark job
CMD ["spark-submit", "--master", "local", "--class", "com.example.WineQualityEval", "/app/WineQualityEval/target/wine-quality-eval-1.0-SNAPSHOT.jar"]Build and Push Docker Image:
sudo docker build -t shreyasshende/wine-quality-eval:latest .
sudo docker push shreyasshende/wine-quality-eval:latestPull the Docker image on each instance:
sudo docker pull shreyasshende/wine-quality-eval:latestRun the container:
sudo docker run shreyasshende/wine-quality-eval:latestThe F1 score achieved on the validation dataset is:
F1 Score: 0.7634