- HiveToBigQuery (blogpost link)
- MsSqlToBigQuery (blogpost link)
- MySQLToSpanner (blogpost link)
- OracleToBigQuery
- OracleToPostgres (blogpost Link)
- OracleToSpanner (blogpost Link)
- SQLServerToPostgres
Notebooks in this folder demonstrate how to run Dataproc Templates from Jupyter Notebooks using Vertex AI.
Recently, Google made Serverless Spark even more powerful, by enabling serverless interactive development through Dataproc Sessions in Jupyter notebooks, natively integrated with Vertex AI Workbench.
Additionally, a data scientist can automate a Dataproc Template execution with Vertex AI Pipelines and Serverless Spark Kubeflow components.
The best way to get started is to clone the Dataproc Templates repository to your Jupyter environment in Vertex AI, and run the notebook.
-
Enable Compute Engine API, Dataproc API, Vertex-AI API and Vertex Notebooks API in your GCP project.
-
Create a User-Managed Notebook in Vertex AI Workbench
In this example, a User-Managed notebook is created using the Compute Engine default service account.
-
Open the created notebook, clone the Dataproc Templates GitHub repository and run the desired notebook located in the /notebooks folder
Alternatively to running the notebook manually, we developed a "parameterize" script, using the papermill lib, to allow running notebooks programmatically from a Python script, with parameters.
You can see each specific parameters in each notebook type README.
It is currently available for the following notebooks:
USAGE:
export GCP_PROJECT=<project>
export REGION=<region>
export GCS_STAGING_LOCATION=<gs://bucket-name>
export SUBNET=<subnet>
python run_notebook.py --script=<NOTEBOOK_NAME> \
--log_level=<LOG_LEVEL> \
--notebook.paramter1="<>" \
--notebook.paramter2="<>"
If you directly open notebook to colab environment then it will pull out notebook only. Our notebooks requires certain packages which comes along with repository such as util. Please follow below steps for each packages which requires in imported notebook.
!git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
!mv /content/dataproc-templates/notebooks/util /content/
!mv /content/dataproc-templates/java/ /content/