-
Notifications
You must be signed in to change notification settings - Fork 0
[3] Dataproc in Google Cloud
My slave nodes have quite typical hardware(HW) specifications; 16 vCores with 64 GB memory. Most of mid-size data calculations are quite done well using my 4 nodes spark/hadoop cluster, along with free interactive scratches using Jupyter Notebook.
==
But it is quite sure that I need various cluster configurations, not available in my custom Spark cluster.
I have found that GCP provides a good suite of hardwares (even, Gargantua-memory machines):
Google Cloud Node HW Specifications
Here are very theoretical basics for total beginners in GCP.
- Make a GCP account
Yeh, free $300 balance in your billing account!
-
Create a "project" and link your billing to this new project
-
Configure the gcloud:
$ gcloud auth login
,$ gcloud config set project VALUE
and$ gcloud config set account ACCOUNT
-
Create a "cluster" instance in the
dataproc
option.
You need to check the "version" of "image" to properly select the compatible pyspark (at least, if there are some lib version issues). If you like to use a Jupyter notebook on GCP, there is an option launching with the notebook. Here is the link for
dataproc
image list.
- There are two (three?) ways to use your new cluster.
[1]
ssh
to the master node and run the pyspark code on it
[2] use command-line SDKs such as
gcloud
gsutil
on your local terminal. I prefer this. Do not install these viaconda
due to someroot
privilages. Install them following this link
- Here is a basic example to run "the" PI program:
gcloud dataproc jobs submit pyspark --cluster spark-mini --region asia-northeast1 ./pi.py
- For enabling jupyter notebook, please check this link
It seems the
initialization
action has changed after releasing the image 1.3. Here is the newcreate
script:
gcloud beta dataproc clusters create spark-mini \
--optional-components=ANACONDA,JUPYTER \
--metadata 'CONDA_PACKAGES="scipy numpy pandas pyarrow matplotlib seaborn",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/python/conda-install.sh \
--enable-component-gateway \
--bucket shongdata \
--project pyspark-multiverse \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-6-30720 --master-boot-disk-size 32GB \
--worker-machine-type n1-highmem-8 --worker-boot-disk-size 32GB --num-workers 2 \
--image-version 1.4 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.7.0-spark2.4-s_2.11
I need a large dataproc cluster with spark/graphframes.
The basic quota is too small, 24 cores with 256 GB(?!)
I had to submit a query for increasing my quota to 192 cores with 2048 GB. It took only a day for me to get the quota raise. So... they work hard!
gcloud dataproc clusters create spark-mini \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-6-30720 --master-boot-disk-size 64GB \
--worker-machine-type n1-highmem-8 --worker-boot-disk-size 64GB --num-workers 2 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
gcloud dataproc clusters create spark-mid \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-6-30720 --master-boot-disk-size 64GB \
--worker-machine-type n1-highmem-32 --worker-boot-disk-size 64GB --num-workers 3 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
gcloud dataproc clusters create spark-large \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-10-65536 --master-boot-disk-size 64GB \
--worker-machine-type n1-highmem-32 --worker-boot-disk-size 128GB --num-workers 6 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
gcloud dataproc clusters create spark-optimal \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-10-65536 --master-boot-disk-size 128GB \
--worker-machine-type n1-highmem-16 --worker-boot-disk-size 128GB --num-workers 4 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
The option of
--package
in standalone cluster should be applied using the--properties
option when creating a Dataproc cluster in GCP.
INFO: all options for the command gcloud dataproc clusters create
INFO: release notes for all previous images
[Annoying Episode (1)] : I found that spark 2.2 had a bug, not to pass
functools.partial
to UDF. So, something likeudf(partial(preudf, v1=bcastv1, v2=bcastv2))
causes ERROR in 2.2. Spark 2.3.x has resolved this issue. Arrrgghhh...
INFO: Apache Arrow for faster conversions between pandas/dataframe
and spark/dataframe
Each linux instance starts with very clean void settings.
The below option will initialize instances with basic packages in conda
environment.
--metadata 'CONDA_PACKAGES="scipy numpy pandas pyarrow",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
Submit jobs like this:
gcloud dataproc jobs submit pyspark --cluster spark-optimal --region asia-northeast1 gcloud-run-vanilla-fullscale.py
INFO: options for submitting pyspark jobs
The total vetices are 300 millions. The edges are roughly a billion.
create-dataproc-cluster.sh
gcloud dataproc clusters create spark-large \
--metadata 'CONDA_PACKAGES="scipy numpy pandas pyarrow",MINICONDA_VARIANT=2' \
--initialization-actions gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a --master-machine-type n1-highmem-16 --master-boot-disk-size 128GB --worker-machine-type n1-highmem-32 --worker-boot-disk-size 128GB --num-workers 5 \
--image-version 1.3 --properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
Apache Arrow
is now a built-in package inSpark 2.3.x
. Enabling thearrow
in theSpark Session
:
spark = SparkSession.builder.appName("largeScaleGstat")\
.config("spark.driver.maxResultSize","8g")\
.config("spark.sql.execution.arrow.enabled","true")\
.config("spark.executor.memoryOverhead","42GB")\
.getOrCreate()
This will reduce the running time for ".toPandas()".
Another annoying fact is found:
Spark has a notorious 2GB issue, related to
ByteBuffer
size. This problem is quite ubiquitous. Here is a thread for this. Various 2GB limits in Spark.
This 2G limit is gone in the release of spark 2.4.0. They work really hard. Now,
spark
better supports vectorized UDF and Kubernates; plus, new journey for better supporting Deep Learnings. Yay! Thoughspark
is a good Big Data framework, it lacks for features of Deep Learning A.I. Hopefully,keras-tensorflow-gpu
will work flawlessly in thespark-dataframe
eventually.
I guess many things have changed. I will write down all new followups below.
- To install
gsutil
, follow this website - Run
gcloud init
to set up your CLI gcp-sdk
- APIs & Services > Library
- Then, search
dataproc
and enable it.