Skip to content

[3] Dataproc in Google Cloud

Sungryong Hong edited this page Aug 17, 2022 · 79 revisions

============= Written in 2019 ==============

Let's try out Google Cloud Platform (GCP)

Why do I need Google Cloud ?

My slave nodes have quite typical hardware(HW) specifications; 16 vCores with 64 GB memory. Most of mid-size data calculations are quite done well using my 4 nodes spark/hadoop cluster, along with free interactive scratches using Jupyter Notebook.

==

But it is quite sure that I need various cluster configurations, not available in my custom Spark cluster. I have found that GCP provides a good suite of hardwares (even, Gargantua-memory machines):
Google Cloud Node HW Specifications

Basic Steps Using Dataproc in Google Cloud Platforms

Here are very theoretical basics for total beginners in GCP.

  • Make a GCP account

Yeh, free $300 balance in your billing account!

  • Create a "project" and link your billing to this new project

  • Configure the gcloud: $ gcloud auth login, $ gcloud config set project VALUE and $ gcloud config set account ACCOUNT

  • Create a "cluster" instance in the dataproc option.

You need to check the "version" of "image" to properly select the compatible pyspark (at least, if there are some lib version issues). If you like to use a Jupyter notebook on GCP, there is an option launching with the notebook. Here is the link for dataproc image list.

  • There are two (three?) ways to use your new cluster.

[1] ssh to the master node and run the pyspark code on it

[2] use command-line SDKs such as gcloud gsutil on your local terminal. I prefer this. Do not install these via conda due to some root privilages. Install them following this link

  • Here is a basic example to run "the" PI program:

gcloud dataproc jobs submit pyspark --cluster spark-mini --region asia-northeast1 ./pi.py

  • For enabling jupyter notebook, please check this link

It seems the initialization action has changed after releasing the image 1.3. Here is the new create script:

gcloud beta dataproc clusters create spark-mini \
--optional-components=ANACONDA,JUPYTER \
--metadata 'CONDA_PACKAGES="scipy numpy pandas pyarrow matplotlib seaborn",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/python/conda-install.sh \
--enable-component-gateway \
--bucket shongdata \
--project pyspark-multiverse \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-6-30720 --master-boot-disk-size 32GB \
--worker-machine-type n1-highmem-8 --worker-boot-disk-size 32GB --num-workers 2 \
--image-version 1.4 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.7.0-spark2.4-s_2.11

My Real Walk-through for Pyspark in GCP

1. Creating an instance

I need a large dataproc cluster with spark/graphframes.

The basic quota is too small, 24 cores with 256 GB(?!)

I had to submit a query for increasing my quota to 192 cores with 2048 GB. It took only a day for me to get the quota raise. So... they work hard!

1.1 a mini cluster for testing purpose

gcloud dataproc clusters create spark-mini \ 
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-6-30720 --master-boot-disk-size 64GB \
--worker-machine-type n1-highmem-8 --worker-boot-disk-size 64GB --num-workers 2 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11

1.2 a mid-size cluster for mid-calculations

gcloud dataproc clusters create spark-mid \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-6-30720 --master-boot-disk-size 64GB \
--worker-machine-type n1-highmem-32 --worker-boot-disk-size 64GB --num-workers 3 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11

1.3 a large-size cluster for handling full-scale multiverse data

gcloud dataproc clusters create spark-large \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-10-65536 --master-boot-disk-size 64GB \
--worker-machine-type n1-highmem-32 --worker-boot-disk-size 128GB --num-workers 6 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11

1.4 an optimal cluster for handling full-scale multiverse data

gcloud dataproc clusters create spark-optimal \
--metadata 'CONDA_PACKAGES="scipy numpy pandas",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a \
--master-machine-type custom-10-65536 --master-boot-disk-size 128GB \
--worker-machine-type n1-highmem-16 --worker-boot-disk-size 128GB --num-workers 4 \
--image-version 1.3 \
--properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11

The option of --package in standalone cluster should be applied using the --properties option when creating a Dataproc cluster in GCP.

INFO: all options for the command gcloud dataproc clusters create

INFO: release notes for all previous images

[Annoying Episode (1)] : I found that spark 2.2 had a bug, not to pass functools.partial to UDF. So, something like udf(partial(preudf, v1=bcastv1, v2=bcastv2)) causes ERROR in 2.2. Spark 2.3.x has resolved this issue. Arrrgghhh...

INFO: Apache Arrow for faster conversions between pandas/dataframe and spark/dataframe

2. Installing basic softwares

Each linux instance starts with very clean void settings. The below option will initialize instances with basic packages in conda environment.

--metadata 'CONDA_PACKAGES="scipy numpy pandas pyarrow",MINICONDA_VARIANT=2' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,\
gs://dataproc-initialization-actions/conda/install-conda-env.sh \

3. Running pyspark codes

Submit jobs like this:

gcloud dataproc jobs submit pyspark --cluster spark-optimal --region asia-northeast1 gcloud-run-vanilla-fullscale.py

INFO: options for submitting pyspark jobs

Graph Analyses of Horizon Run 4 data

1. Estimating the cluster size

The total vetices are 300 millions. The edges are roughly a billion.

1.1 Create a dataproc cluster

create-dataproc-cluster.sh

gcloud dataproc clusters create spark-large \
--metadata 'CONDA_PACKAGES="scipy numpy pandas pyarrow",MINICONDA_VARIANT=2' \ 
--initialization-actions gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh \
--region asia-northeast1 --zone asia-northeast1-a --master-machine-type n1-highmem-16 --master-boot-disk-size 128GB --worker-machine-type n1-highmem-32 --worker-boot-disk-size 128GB --num-workers 5 \
--image-version 1.3 --properties spark:spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11

Apache Arrow is now a built-in package in Spark 2.3.x. Enabling the arrow in the Spark Session:

    spark = SparkSession.builder.appName("largeScaleGstat")\
    .config("spark.driver.maxResultSize","8g")\
    .config("spark.sql.execution.arrow.enabled","true")\
    .config("spark.executor.memoryOverhead","42GB")\
    .getOrCreate()

This will reduce the running time for ".toPandas()".

1.2 Getting the result

Another annoying fact is found:

Spark has a notorious 2GB issue, related to ByteBuffer size. This problem is quite ubiquitous. Here is a thread for this. Various 2GB limits in Spark.

This 2G limit is gone in the release of spark 2.4.0. They work really hard. Now, spark better supports vectorized UDF and Kubernates; plus, new journey for better supporting Deep Learnings. Yay! Though spark is a good Big Data framework, it lacks for features of Deep Learning A.I. Hopefully, keras-tensorflow-gpu will work flawlessly in the spark-dataframe eventually.

============= UPDATE: July 2022 ==============

Anything New in GCP since 2019?

I guess many things have changed. I will write down all new followups below.

1. Install gsutil and Run the basic setup commands

  • To install gsutil, follow this website
  • Run gcloud init to set up your CLI gcp-sdk

2. Enable API services

  • APIs & Services > Library
  • Then, search dataproc and enable it.

3. Create a Spark Cluster via Dataproc

  • Available HW specs : All, M1, E2
  • Dashboard > Dataproc > Cluster > Create Cluster
dataproc-create-cli