Skip to content

Notebook-scoped libraries feature not enable in dbt-databricks #1320

@fedemgp

Description

@fedemgp

Describe the bug

We are trying to use All-Purpose clusters for dbt Python models. The PythonCommandSubmitter class is currently agnostic of packages defined in the model config.

We were able to work around this by using the upload_notebook flag (internally handled by PythonNotebookUploader), which installs the packages at the cluster system-scope.

This represents a significant problem for us: as the project grows, different Python models will require different packages or versions, eventually leading to dependency collisions. While this is less of an issue in production (where we use ephemeral Job Clusters), it severely hinders the Development Experience (DX). We want to use shared All-Purpose clusters in development to speed up iteration without package conflicts.

Databricks provides Notebook-Scoped libraries specifically to avoid this. Since dbt can already compile the code into a Databricks Notebook, the submitters should be updated to leverage this feature.

Steps To Reproduce

By running this simple model you can test the issue:

import pandas as pd
from rapidfuzz import fuzz

def get_fuzz_ration(a,b):
    return fuzz.ratio(a,b)

def model(dbt, session):
    """
    Simple test model to verify rapidfuzz works on the cluster.
    Should complete in under 5 seconds.
    """
    # Configuration - using same cluster and package
    dbt.config(
        materialized="table",
        cluster_id=<an_all_purpose_cluster_id>,
        packages=['rapidfuzz==3.13.0'],
        submission_method='all_purpose_cluster',
        create_notebook=True,
        timeout=600
    )    

    # Create a tiny test dataset (just 3 rows)
    test_data = pd.DataFrame({
        'id': [1, 2, 3],
        'text1': ['hello world', 'databricks test', 'rapidfuzz check'],
        'text2': ['hello world!', 'databrick test', 'rapid fuzz check']
    })

    # Compute simple fuzzy scores
    test_data['similarity_score'] = [
        get_fuzz_ration(a, b)
        for a, b in zip(test_data['text1'], test_data['text2'])
    ]

    return test_data
  1. Toggle Notesbook to False using a "vanilla" cluster: The model fails because the package is not found.

  2. Set Notesbook to True: The model works, but the package is installed at the cluster system level (affecting all other users and models).

Expected behavior

The compiled code should include a %pip install <package> command at the beginning of the notebook/script.

Ideally, we could introduce a flag like use_notebook_scoped_libraries: true. This would ensure that:

  1. Packages are isolated to the specific dbt run.

  2. The solution works across all compute types (Job Clusters, Serverless, and All-Purpose) since %pip is the standard for modern Databricks runtimes.

  3. The solution will work no matter the data security access (Shared or single user)

Screenshots and log output

Running the model with use_notebook=True works, but the dependency persists in the cluster environment even after the job ends:

Image Image

System information

The output of dbt --version:

(dbt-data-model) ➜  dbt_databricks git:(test/python-model) ✗ uv run dbt --version
Core:
  - installed: 1.11.2
  - latest:    1.11.2 - Up to date!

Plugins:
  - databricks: 1.11.4 - Up to date!
  - spark:      1.10.0 - Up to date!

The operating system you're using: MacOS Tahoe Version 26.2

The output of python --version: Python 3.11.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions