Here we'll show you how to install Spark 3.3.2 for Windows. We tested it on Windows 10 and 11 Home edition, but it should work for other versions distros as well
In this tutorial, we'll use MINGW/Gitbash for command line
If you use WSL, follow the instructions from linux.md
Spark needs Java 11. Download it from here: https://www.oracle.com/de/java/technologies/javase/jdk11-archive-downloads.html. Select “Windows x64 Compressed Archive” (you may have to create an oracle account for that)
Unpack it to a folder with no space in the path. We use C:/tools
- so the full path to JDK is /c/tools/jdk-11.0.13
Now let’s configure it and add it to PATH
:
export JAVA_HOME="/c/tools/jdk-11.0.13"
export PATH="${JAVA_HOME}/bin:${PATH}"
Check that Java works correctly:
java --version
Output:
java 11.0.13 2021-10-19 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.13+10-LTS-370)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.13+10-LTS-370, mixed mode)
Next, we need to have Hadoop binaries.
We'll need Hadoop 3.2 which we'll get from here.
Create a folder (/c/tools/hadoop-3.2.0
) and put the files there
HADOOP_VERSION="3.2.0"
PREFIX="https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-${HADOOP_VERSION}/bin/"
FILES="hadoop.dll hadoop.exp hadoop.lib hadoop.pdb libwinutils.lib winutils.exe winutils.pdb"
for FILE in ${FILES}; do
wget "${PREFIX}/${FILE}"
done
If you don't have wget, you can use curl:
HADOOP_VERSION="3.2.0"
PREFIX="https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-${HADOOP_VERSION}/bin/"
FILES="hadoop.dll hadoop.exp hadoop.lib hadoop.pdb libwinutils.lib winutils.exe winutils.pdb"
for FILE in ${FILES}; do
curl -o "${FILE}" "${PREFIX}/${FILE}";
done
Add it to PATH
:
export HADOOP_HOME="/c/tools/hadoop-3.2.0"
export PATH="${HADOOP_HOME}/bin:${PATH}"
Now download Spark. Select version 3.3.2
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
Unpack it in some location without spaces, e.g. c:/tools/
:
tar xzfv spark-3.3.2-bin-hadoop3.tgz
Let's also add it to PATH
:
export SPARK_HOME="/c/tools/spark-3.3.2-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"
Go to this directory
cd spark-3.3.2-bin-hadoop3
And run spark-shell:
./bin/spark-shell.cmd
At this point you may get a message from windows firewall — allow it.
There could be some warnings (like this):
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/tools/spark-3.3.2-bin-hadoop3/jars/spark-unsafe_2.12-3.3.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
You can safely ignore them.
Now let's run this:
val data = 1 to 10000
val distData = sc.parallelize(data)
distData.filter(_ < 10).collect()
It's the same for all platforms. Go to pyspark.md.