Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] First-class catalog and schema setting #1979

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

lennartkats-db
Copy link
Contributor

@lennartkats-db lennartkats-db commented Dec 9, 2024

Changes

This is the latest, experimental way to add first-class 'catalog' and 'schema' notion.

The basic idea is that databricks.yml can say

targets:
  dev:
    ...
    presets:
      catalog: dev
      schema: ${workspace.current_user.short_name} # the user's name, e.g. lennart_kats

  prod:
    ...
    presets:
      catalog: prod
      schema: finance

which will then configure the default schema for all resources in the bundle (pipelines, jobs, model serving endpoints, etc.)

A caveat exists for notebooks, which need use parameters to configure the catalog and schema. While the catalog and schema parameter values are automatically passed to all job tasks, notebooks need to consume the parameter values. We check whether they do this, and otherwise show a recommendation:

Recommendation: Use the 'catalog' and 'schema' parameters provided via 'presets.catalog' and 'presets.schema' using

  dbutils.widgets.text('catalog')
  dbutils.widgets.text('schema')
  catalog = dbutils.widgets.get('catalog')
  schema = dbutils.widgets.get('schema')
  spark.sql(f'USE {catalog}.{schema}')

  in src/notebook.ipynb:1:1

Note that the code above also helps for interactive notebook development scenarios: users can use the parameter widgets to set the catalog and schema they use during development.

Similarly, for Python and Wheel tasks, users must add some extra code to process a catalog and schema parameter. For Python tasks we show a similar recommendation; for wheel tasks we can't directly check for this.

Tests

Tests based on reflection, making sure we have coverage for all current/future resources:

  • Each resource property that has a name like catalog, schema, parameters, etc. must have one test case or ignore rule
  • Each catalog/schema related property is tested, making sure they change to the default catalog/schema based on presets
  • Most tests are executed based on a declarative specification of the expected target state.

@lennartkats-db lennartkats-db changed the title [RFC] Presets catalog schema as params [RFC] First-class catalog and schema setting Dec 9, 2024
Copy link
Contributor

@fjakobs fjakobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having first-class support for catalog/schema. I don't have a strong opinion if it should be under presets or a global setting. I think both could work.

@eng-dev-ecosystem-bot
Copy link
Collaborator

Test Details: go/deco-tests/12253819228

@lennartkats-db lennartkats-db changed the title [RFC] First-class catalog and schema setting [WIP] First-class catalog and schema setting Dec 16, 2024
…esets-catalog-schema-as-params' by 67 commits. # (use "git push" to publish your

local commits) # # Changes to be committed: # modified:  dbt-sql/databricks_template_schema.json # modified:
default-python/databricks_template_schema.json # modified:  default-python/template/{{.project_name}}/databricks.yml.tmpl # modified:
default-python/template/{{.project_name}}/resources/{{.project_name}}.job.yml.tmpl # modified:
default-python/template/{{.project_name}}/resources/{{.project_name}}.pipeline.yml.tmpl # modified:
default-python/template/{{.project_name}}/scratch/exploration.ipynb.tmpl # modified:
default-python/template/{{.project_name}}/src/notebook.ipynb.tmpl # modified:
default-python/template/{{.project_name}}/src/{{.project_name}}/main.py.tmpl # modified:  default-sql/databricks_template_schema.json # # Untracked
files: # ../../../.cursorrules # ../../../bundle/config/resources/:tmp:tmp.py # ../../../delme.py # ../../../pr-cache-current-user-me #
../../../pr-cleanup-warnings.md # ../../../pr-contrib-templates.md # ../../../pr-cp-diag-ids-for-all.md # ../../../pr-cp-serverless-templates.md #
../../../pr-presets-catalog-schema-using-params.md # ../../../pr-update-sync-command-help.md #
Revert template changes for now
@lennartkats-db lennartkats-db changed the title [WIP] First-class catalog and schema setting First-class catalog and schema setting Dec 21, 2024
Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

  • PR number: 1979
  • Commit SHA: 204d3b08d13b90b71f3236313412f72b5333908e

Checks will be approved automatically on success.

@lennartkats-db lennartkats-db marked this pull request as ready for review December 21, 2024 08:49
@@ -22,6 +22,8 @@ type applyPresets struct{}

// Apply all presets, e.g. the prefix presets that
// adds a prefix to all names of all resources.
//
// Note that the catalog/schema presets are applied in ApplyPresetsCatalogSchema.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_presets is becoming a bit too large. For the new addition, I moved everything to a separate file. Lots of boring but easy to read/test code there.

As a followup we should look into a further refactoring: either factoring the logic by preset feature or by resource. For some of the overarching tests it's certainly been better to do it by feature, but we could end up with some combination for the tests.

@pietern pietern requested a review from denik January 2, 2025 15:49
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend splitting out the GetLocalPath changes into a separate PR. This change should be purely additive and not refactor an unrelated piece of code.

As for the core of the change, I'm not yet comfortable with the magic of this change. I can imagine it being a positive in 90% of cases, but worried about the long tail where it isn't. E.g. where setting these fields to non-empty values has unintended side effects on the backend.

if pl.Schema == "" && pl.Target == "" {
// As of 2024-12, the Schema field isn't broadly supported yet in the pipelines API.
// Until it is, we set the Target field.
pl.Target = p.Schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a breaking change when swapped out?

if p.Catalog == "" && p.Schema == "" {
return diags
}
if (p.Schema == "" && p.Catalog != "") || (p.Catalog == "" && p.Schema != "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify:

Suggested change
if (p.Schema == "" && p.Catalog != "") || (p.Catalog == "" && p.Schema != "") {
if !(p.Catalog != "" && p.Schema != "") {

// Until it is, we set the Target field.
pl.Target = p.Schema
}
if allSameCatalog && pl.Catalog == p.Catalog {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky because the code above changes pl.Catalog. Recommendations based on existing config should really be done outside the loop that mutates it.

// Check for existing catalog/schema parameters
hasCatalog := false
hasSchema := false
if job.Parameters != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfluous; nil slice is equivalent to empty slice.


if !fileIncludesPattern(ctx, localPath, expected) {
diags = diags.Extend(diag.Diagnostics{{
Summary: "Use the 'catalog' and 'schema' parameters provided via 'presets.catalog' and 'presets.schema' using\n\n" + fix,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary can be "Add 'catalog' and 'schema' parameters to task %q"

With the multi-line details and fix suggestion in the Detail section.

return true
}

matched, err := regexp.MatchString(expected, string(content))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match takes a []byte

if p == "" {
return "", "", fmt.Errorf("path cannot be empty")
}
if path.IsAbs(p) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comment from move:

	// We assume absolute paths point to a location in the workspace

if err != nil {
// Apparently this path is not a URL; this can happen for paths that
// have non-URL characters like './profit-%.csv'.
log.Warnf(ctx, "Failed to parse path as a URL '%s': %v", p, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like a debug level message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return.

// If the file path has scheme, it's a full path and we don't need to transform it
if url.Scheme != "" {
if localPath == "" {
// Skip absolute paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that it is absolute?

recordedPaths[path[:i]] = struct{}{}
path = path[i+1:]
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the intent of this, nor the unbounded loop.

@lennartkats-db lennartkats-db changed the title First-class catalog and schema setting [RFC] First-class catalog and schema setting Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants