From c07c56bdcef4cc43c3dc4852eb63514f1dc214f0 Mon Sep 17 00:00:00 2001 From: Louis Auneau Date: Mon, 12 Aug 2024 16:27:29 -0400 Subject: [PATCH 1/4] Update python-models.md with Dataproc Serverless custom image usage. Add description on how to setup dataproc serverless with a custom image in order to use third-party packages. --- website/docs/docs/build/python-models.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md index 811379a0d2c..70faf4967f6 100644 --- a/website/docs/docs/build/python-models.md +++ b/website/docs/docs/build/python-models.md @@ -763,13 +763,17 @@ storage.objects.create storage.objects.delete ``` -**Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster. +**Installing packages:** -Google recommends installing Python packages on Dataproc clusters via initialization actions: -- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) -- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) - -You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. +Depending on if you use Dataproc cluster or serverless, third-party packages installation is done differently. +- When running in a **cluster**: + Google recommends installing Python packages while creating the cluster via initialization actions: + - [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) + - [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) + + You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. +- When running **serverless**: + Google recommends using a [custom docker image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) to install thrid-party packages. The image needs to be hosted in [Google Artifact Registry](https://cloud.google.com/artifact-registry/docs). It can then be used by providing the image path (using format `{hostname}/{project-id}/{image}:{tag}`) in dbt profiles, under the target key `dataproc_batch.runtime_config.container_image`. From 3d77d585341cc5429199ebe15b7e17fd741b97af Mon Sep 17 00:00:00 2001 From: LouisAuneau Date: Mon, 12 Aug 2024 17:00:48 -0400 Subject: [PATCH 2/4] improve dataproc severless package installation doc --- website/docs/docs/build/python-models.md | 37 ++++++++++++++++++------ 1 file changed, 28 insertions(+), 9 deletions(-) diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md index 70faf4967f6..e9d5a2743ac 100644 --- a/website/docs/docs/build/python-models.md +++ b/website/docs/docs/build/python-models.md @@ -765,15 +765,34 @@ storage.objects.delete **Installing packages:** -Depending on if you use Dataproc cluster or serverless, third-party packages installation is done differently. -- When running in a **cluster**: - Google recommends installing Python packages while creating the cluster via initialization actions: - - [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) - - [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) - - You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. -- When running **serverless**: - Google recommends using a [custom docker image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) to install thrid-party packages. The image needs to be hosted in [Google Artifact Registry](https://cloud.google.com/artifact-registry/docs). It can then be used by providing the image path (using format `{hostname}/{project-id}/{image}:{tag}`) in dbt profiles, under the target key `dataproc_batch.runtime_config.container_image`. +Depending on if you use Dataproc cluster or serverless, third-party packages installation is done differently. + +- **Dataproc Cluster** — Google recommends installing Python packages while creating the cluster via initialization actions: + - [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) + - [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) + + You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. + +- **Dataproc Serverless** — Google recommends using a [custom docker image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) to install thrid-party packages. The image needs to be hosted in [Google Artifact Registry](https://cloud.google.com/artifact-registry/docs). It can then be used by providing the image path in dbt profiles: + + ```yml + my-profile: + target: dev + outputs: + dev: + type: bigquery + method: oauth + project: abc-123 + dataset: my_dataset + + # for dbt Python models to be run on Dataproc Serverless + gcs_bucket: dbt-python + dataproc_region: us-central1 + submission_method: serverless + dataproc_batch: + runtime_config: + container_image: {HOSTNAME}/{PROJECT_ID}/{IMAGE}:{TAG} + ``` From e2ed75ab2ad91d06a41166d45d530b83fcaa03c7 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Wed, 14 Aug 2024 16:11:30 -0400 Subject: [PATCH 3/4] Update website/docs/docs/build/python-models.md --- website/docs/docs/build/python-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md index e9d5a2743ac..ec652dc0938 100644 --- a/website/docs/docs/build/python-models.md +++ b/website/docs/docs/build/python-models.md @@ -765,7 +765,7 @@ storage.objects.delete **Installing packages:** -Depending on if you use Dataproc cluster or serverless, third-party packages installation is done differently. +Installation of third-party packages on Dataproc varies depending on whether it's a [cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) or [serverless](https://cloud.google.com/dataproc-serverless/docs). - **Dataproc Cluster** — Google recommends installing Python packages while creating the cluster via initialization actions: - [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) From d2778f978cfb7bb00929aeb1ceb4c29ec8fcf5fd Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Wed, 14 Aug 2024 16:29:01 -0400 Subject: [PATCH 4/4] Update website/docs/docs/build/python-models.md --- website/docs/docs/build/python-models.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md index ec652dc0938..09ed7a1c881 100644 --- a/website/docs/docs/build/python-models.md +++ b/website/docs/docs/build/python-models.md @@ -794,6 +794,8 @@ Installation of third-party packages on Dataproc varies depending on whether it' container_image: {HOSTNAME}/{PROJECT_ID}/{IMAGE}:{TAG} ``` + + **Docs:**