Skip to content

Latest commit

 

History

History

notebooks

Dataproc Templates (Notebooks)

Getting Started

Notebooks in this folder demonstrate how to run Dataproc Templates from Jupyter Notebooks using Vertex AI.

Overview

Recently, Google made Serverless Spark even more powerful, by enabling serverless interactive development through Dataproc Sessions in Jupyter notebooks, natively integrated with Vertex AI Workbench.

Additionally, a data scientist can automate a Dataproc Template execution with Vertex AI Pipelines and Serverless Spark Kubeflow components.

Deploying Dataproc Templates to Vertex AI

The best way to get started is to clone the Dataproc Templates repository to your Jupyter environment in Vertex AI, and run the notebook.

  1. Enable Compute Engine API, Dataproc API, Vertex-AI API and Vertex Notebooks API in your GCP project.

  2. Create a User-Managed Notebook in Vertex AI Workbench

    workbench

    In this example, a User-Managed notebook is created using the Compute Engine default service account.

  3. Open the created notebook, clone the Dataproc Templates GitHub repository and run the desired notebook located in the /notebooks folder

    clone

Run notebooks programmatically

Alternatively to running the notebook manually, we developed a "parameterize" script, using the papermill lib, to allow running notebooks programmatically from a Python script, with parameters.
You can see each specific parameters in each notebook type README.

It is currently available for the following notebooks:

USAGE:

export GCP_PROJECT=<project>
export REGION=<region>
export GCS_STAGING_LOCATION=<gs://bucket-name>
export SUBNET=<subnet>

python run_notebook.py --script=<NOTEBOOK_NAME> \
   --log_level=<LOG_LEVEL> \
   --notebook.paramter1="<>" \
   --notebook.paramter2="<>"

Deploying Notebook Directly to Colab

If you directly open notebook to colab environment then it will pull out notebook only. Our notebooks requires certain packages which comes along with repository such as util. Please follow below steps for each packages which requires in imported notebook.

!git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
!mv /content/dataproc-templates/notebooks/util /content/
!mv /content/dataproc-templates/java/ /content/