Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UC Volumes to DABs #1762

Open
wants to merge 54 commits into
base: main
Choose a base branch
from
Open

Conversation

shreyas-goenka
Copy link
Contributor

@shreyas-goenka shreyas-goenka commented Sep 9, 2024

Changes

This PR adds support for UC volumes to DABs.

Can I use a UC Volume managed by DABs in artifact_path?

Yes, but we require the Volume to exist before being referenced in artifact_path. Otherwise you'll see an error that the Volume does not exist. For this case, this PR also adds a warning if we detect that the UC Volume is defined in the DAB itself, which informs the user to deploy the UC Volume in a separate deployment first before using it in artifact_path.

We cannot create the UC Volume and then upload the artifacts to it in the same bundle deploy because bundle deploy always uploads the artifacts to artifact_path before materializing any resources defined in the bundle. Supporting this in a single deployment requires us to migrate away from our dependency on the Databricks Terraform provider to manage the CRUD lifecycle of DABs resources.

Why do we not support preset.name_prefix for UC Volumes?

UV volumes will not have a dev_shreyas_goenka prefix added in mode: development. Configuring presets.name_prefix will be a no-op for UC Volumes. We have decided not to support prefixing for UC resources. This is because:

  1. UC provides its own namespace hierarchy that is independent of DABs.
  2. Users can always manually use ${workspace.current_user.short_name} to configure the prefixes manually.

Customers often manually set up a UC hierarchy for dev and prod, including a schema or catalog per developer. Thus, it's often unnecessary for us to add prefixing in mode: development by default for UC resources.

In retrospect, supporting prefixing for UC schemas and registered_models was a mistake and will be removed in a future release of DABs.

Tests

Unit, integration test and manually.

Manual Testing cases:

  1. UC volume does not exist:
➜  bundle-playground git:(master) ✗ cli bundle deploy
Error: failed to fetch metadata for the UC volume /Volumes/main/caps/my_volume that is configured in the artifact_path: Not Found
  1. UC Volume does not exist, but is defined in the DAB
➜  bundle-playground git:(master) ✗ cli bundle deploy
Error: failed to fetch metadata for the UC volume /Volumes/main/caps/managed_by_dab that is configured in the artifact_path: Not Found

Warning: You might be using a UC volume in your artifact_path that is managed by this bundle but which has not been deployed yet. Please deploy the UC volume in a separate bundle deploy before using it in the artifact_path.
  at resources.volumes.bar
  in databricks.yml:24:7

@shreyas-goenka shreyas-goenka marked this pull request as ready for review September 16, 2024 02:10

// TODO: Are there fields in the edit API or terraform that are not in this struct?
// If so call it out in the PR.
*catalog.CreateVolumeRequestContent
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably OWNER is missing from the schema here. Terraform allows managing the owner, but it's not available in the create API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do owner assignment elsewhere, can you clarify the significance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real significance for DABs. I just wanted to call this out since it's a divergence from Terraform. In fact, this is nice because it clearly guarantees that the owner of the UC volume is the deploying user. We do not allow modifying the owner of any of our UC resources.


// We don't need to display any prompts in this case.
if len(dltActions) == 0 && len(schemaActions) == 0 {
if len(schemaActions) == 0 && len(dltActions) == 0 && len(volumeActions) == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces yet again interactive flow to DABs. At this point, I think it's worth it to invest some time in adding proper end to end tests for this. I'd like to pick it up in the next ~month.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pick this next quarter as in scope for terminal UX improvements.

@shreyas-goenka
Copy link
Contributor Author

Nightlies passed once. Triggered another round after some minor updates.

Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

  • PR number: 1762
  • Commit SHA: 039057fdd7af97e3307b4fe74f4bf01ebc1b3519

Checks will be approved automatically on success.

@eng-dev-ecosystem-bot
Copy link
Collaborator

Test Details: go/deco-tests/11897511217

@shreyas-goenka shreyas-goenka changed the title Add support for UC volumes in DABs Add support for UC Volumes to DABs Nov 19, 2024
}

// set location of volume definition in config.
bundletest.SetLocation(b, "resources.volumes.foo", []dyn.Location{{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We moved the bundletest package outside the internal because of it's use here. Otherwise go's import system by design would not let me import bundletest

@stefan-novak-brt
Copy link

Hello! Thanks for adding this new feature -- I look forward to adopting it for many future projects.

One use-case I have: create a DAB-managed UC Volume for storing checkpoints directories for various Spark structured streams. I expect to to have a bunch of code that looks like this:

.option('checkpointLocation', os.path.join('/', 'Volumes', catalog, schema, 'checkpoints', 'my_structured_stream_name'))

It would be nice if I could have something like this:

.option('checkpointLocation', os.path.join(dbutils.wigest.get('checkpoints_volume'), 'my_structured_stream_name'))

The checkpoints_volume parameter could be passed in via the DAB job configuration:

        parameters:
            - name: checkpoints_volume
              default: ${resources.volumes.checkpoints.mount_location} # or similar

Is it possible to include a property like resources.volumes.checkpoints.mount_location to the DAB configuration so that we don't have to reconstruct the volume mount path with other partition parameters (e.g, catalog and schema)?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants