Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Monorepo Setup & Separate by Role #133

Merged
merged 33 commits into from
Feb 6, 2024
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
ba02050
Update monorepo
arpitjasa-db Jan 25, 2024
45369ae
Fix docs
arpitjasa-db Jan 25, 2024
38c9896
Update CLI version
arpitjasa-db Jan 25, 2024
0e94fe8
minor fixes
arpitjasa-db Jan 25, 2024
49b0780
Update Feature Store
arpitjasa-db Jan 26, 2024
9d1cae3
Update ADO
arpitjasa-db Jan 26, 2024
9af651c
Apply comments
arpitjasa-db Feb 2, 2024
ecf542b
Update tests
arpitjasa-db Feb 2, 2024
5739ecb
Add CICD Zip to Pull Request
Feb 2, 2024
f9c8975
Update README
arpitjasa-db Feb 2, 2024
6076b7f
revert run-checks
arpitjasa-db Feb 3, 2024
983dab3
Pin pytest
arpitjasa-db Feb 3, 2024
27380ac
Fix tests
arpitjasa-db Feb 3, 2024
41bc995
Fix tests
arpitjasa-db Feb 3, 2024
56dbbdb
Add CICD Zip to Pull Request
Feb 3, 2024
c1ed5b2
Update zip generation
arpitjasa-db Feb 3, 2024
073d17b
test
arpitjasa-db Feb 3, 2024
72e8671
fix
arpitjasa-db Feb 3, 2024
613daa2
Fix more tests
arpitjasa-db Feb 3, 2024
85c2764
Fix more tests
arpitjasa-db Feb 3, 2024
857c0ab
Fix even more tests good thing we have them
arpitjasa-db Feb 3, 2024
15e627d
sigh
arpitjasa-db Feb 3, 2024
0f55f04
Could this be it
arpitjasa-db Feb 3, 2024
208943c
Sigh it wasnt
arpitjasa-db Feb 3, 2024
add1738
pls work
arpitjasa-db Feb 3, 2024
c0f2ab8
finally
arpitjasa-db Feb 3, 2024
20a8f43
fix generate zip workflow
arpitjasa-db Feb 3, 2024
18ddd4b
Update to glob pattern
arpitjasa-db Feb 3, 2024
4ccfac8
Fix ADO deploy CICD pipeline
arpitjasa-db Feb 5, 2024
25ae3e6
Apply deploy cicd nits
arpitjasa-db Feb 5, 2024
0df9ce5
Fix tests
arpitjasa-db Feb 6, 2024
82ede31
quotes
arpitjasa-db Feb 6, 2024
cdd39a4
more quotes i guess
arpitjasa-db Feb 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/generate-cicd-zip.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Generate CICD Zip
arpitjasa-db marked this conversation as resolved.
Show resolved Hide resolved
on:
push:
branches:
- main
paths:
- 'template/{{.input_root_dir}}/.github/workflows/{{.input_project_name}}**'
- 'template/{{.input_root_dir}}/.azure/devops-pipelines/{{.input_project_name}}**'

defaults:
run:
working-directory: template/{{.input_root_dir}}


jobs:
generate-zip:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
persist-credentials: false
- name: Generate CICD Zip
run: |
cp --parents .github/workflows/\{\{.input_project_name\}\}-* cicd/template
cp --parents .azure/devops-pipelines/\{\{.input_project_name\}\}-* cicd/template
tar -czvf cicd.tar.gz cicd
- name: Commit Zip Back to Branch
env:
GITHUB_TOKEN: ${{ secrets.ARPIT_TOKEN }}
run: |
git config --global user.name "GitHub Actions Bot"
git config --global user.email "[email protected]"
git config --global url.https://${{ secrets.ARPIT_TOKEN }}@github.com/.insteadOf https://github.com/
git add cicd.tar.gz
git commit -m "Generate CICD Zip File for ${{ github.sha }}"
git push
8 changes: 8 additions & 0 deletions .github/workflows/run-checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install -r dev-requirements.txt
- name: Generate CICD Zip
run: |
cd template/{{.input_root_dir}}
cp --parents .github/workflows/\{\{.input_project_name\}\}-* cicd/template
cp --parents .azure/devops-pipelines/\{\{.input_project_name\}\}-* cicd/template
tar -czvf cicd.tar.gz cicd
rm -rf cicd/template/.github cicd/template/.azure
cd ../..
- name: Run tests with pytest
run: |
export GITHUB_TOKEN=${{ secrets.GITHUB_TOKEN }}
Expand Down
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,9 @@ https://github.com/databricks/mlops-stacks/assets/87999496/0d220d55-465e-4a69-bd

### Prerequisites
- Python 3.8+
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) >= v0.211.0
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) >= v0.212.2

[Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) v0.211.0 contains [Databricks asset bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html) for the purpose of project creation.
[Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) contains [Databricks asset bundle templates](https://docs.databricks.com/en/dev-tools/bundles/templates.html) for the purpose of project creation.

Please follow [the instruction](https://docs.databricks.com/en/dev-tools/cli/databricks-cli-ref.html#install-the-cli) to install and set up databricks CLI. Releases of databricks CLI can be found in the [releases section](https://github.com/databricks/cli/releases) of databricks/cli repository.

Expand All @@ -68,16 +68,18 @@ To create a new project, run:

databricks bundle init mlops-stacks

This will prompt for parameters for project initialization. Some of these parameters are required to get started:
* ``input_project_name``: name of the current project
* ``input_root_dir``: name of the root directory. It is recommended to use the name of the current project as the root directory name, except in the case of a monorepo with other projects where the name of the monorepo should be used instead.
This will prompt for parameters for initialization. Some of these parameters are required to get started:
* ``input_setup_cicd_and_project`` : If both CI/CD and the project should be set up, or only one of them.
* ``CICD_and_Project`` - Setup both CI/CD and project, the default option.
* ``Project_Only`` - Setup project only, easiest for Data Scientists to get started with.
* ``CICD_Only`` - Setup CI/CD only, likely for monorepo setups or setting up CI/CD on an already initialized project.
We expect Data Scientists to specify ``Project_Only`` to get
started in a development capacity, and when ready to move the project to Staging/Production, CI/CD can be set up. We expect that step to be done by Machine Learning Engineers (MLEs) who can specify ``CICD_Only`` during initialization and use the provided workflow to setup CI/CD for one or more projects.
* ``input_root_dir``: name of the root directory. When initializing with ``CICD_and_Project``, this field will automatically be set to ``input_project_name``.
* ``input_cloud``: Cloud provider you use with Databricks (AWS or Azure), note GCP is not supported at this time.
* ``input_cicd_platform`` : CI/CD platform of choice (GitHub Actions or GitHub Actions for GitHub Enterprise Servers or Azure DevOps)

Others must be correctly specified for CI/CD to work, and so can be left at their default values until you're
ready to productionize a model. We recommend specifying any known parameters upfront (e.g. if you know
``input_databricks_staging_workspace_host``, it's better to specify it upfront):

Others must be correctly specified for CI/CD to work:
* ``input_cicd_platform`` : CI/CD platform of choice (GitHub Actions or GitHub Actions for GitHub Enterprise Servers or Azure DevOps)
* ``input_databricks_staging_workspace_host``: URL of staging Databricks workspace, used to run CI tests on PRs and preview config changes before they're deployed to production.
We encourage granting data scientists working on the current ML project non-admin (read) access to this workspace,
to enable them to view and debug CI test results
Expand All @@ -86,6 +88,9 @@ ready to productionize a model. We recommend specifying any known parameters upf
* ``input_default_branch``: Name of the default branch, where the prod and staging ML assets are deployed from and the latest ML code is staged.
* ``input_release_branch``: Name of the release branch. The production jobs (model training, batch inference) defined in this
repo pull ML code from this branch.

Or used for project initialization:
* ``input_project_name``: name of the current project
* ``input_read_user_group``: User group name to give READ permissions to for project assets (ML jobs, integration test job runs, and machine learning assets). A group with this name must exist in both the staging and prod workspaces. Defaults to "users", which grants read permission to all users in the staging/prod workspaces. You can specify a custom group name e.g. to restrict read permissions to members of the team working on the current ML project.
* ``input_include_models_in_unity_catalog``: If selected, models will be registered to [Unity Catalog](https://docs.databricks.com/en/mlflow/models-in-uc.html#models-in-unity-catalog). Models will be registered under a three-level namespace of `<catalog>.<schema_name>.<model_name>`, according the the target environment in which the model registration code is executed. Thus, if model registration code runs in the `prod` environment, the model will be registered to the `prod` catalog under the namespace `<prod>.<schema>.<model_name>`. This assumes that the respective catalogs exist in Unity Catalog (e.g. `dev`, `staging` and `prod` catalogs). Target environment names, and catalogs to be used are defined in the Databricks bundles files, and can be updated as needed.
* ``input_schema_name``: If using [Models in Unity Catalog](https://docs.databricks.com/en/mlflow/models-in-uc.html#models-in-unity-catalog), specify the name of the schema under which the models should be registered, but we recommend keeping the name the same as the project name. We default to using the same `schema_name` across catalogs, thus this schema must exist in each catalog used. For example, the training pipeline when executed in the staging environment will register the model to `staging.<schema_name>.<model_name>`, whereas the same pipeline executed in the prod environment will register the mode to `prod.<schema_name>.<model_name>`. Also, be sure that the service principals in each respective environment have the right permissions to access this schema, which would be `USE_CATALOG`, `USE_SCHEMA`, `MODIFY`, `CREATE_MODEL`, and `CREATE_TABLE`.
Expand Down
200 changes: 158 additions & 42 deletions databricks_template_schema.json
Original file line number Diff line number Diff line change
@@ -1,126 +1,242 @@
{
"welcome_message": "Welcome to MLOps Stacks. For detailed information on project generation, see the README at https://github.com/databricks/mlops-stacks/blob/main/README.md.",
"min_databricks_cli_version": "v0.211.0",
"min_databricks_cli_version": "v0.212.2",
"properties": {
"input_project_name": {
"input_setup_cicd_and_project": {
"order": 1,
"type": "string",
"description": "{{if false}}\n\nERROR: This template is not supported by your current Databricks CLI version.\nPlease hit control-C and go to https://docs.databricks.com/en/dev-tools/cli/install.html for instructions on upgrading the CLI to the minimum version supported by MLOps Stacks.\n\n\n{{end}}\nSelect if both CI/CD and the Project should be set up, or only one of them. You can always set up the other later by running initialization again",
"default": "CICD_and_Project",
"enum": ["CICD_and_Project", "Project_Only", "CICD_Only"]
},
"input_project_name": {
"order": 2,
"type": "string",
"default": "my-mlops-project",
"description": "{{if false}}\n\nERROR: This template is no longer supported supported by CLI versions v0.211 and lower.\nPlease hit control-C and go to https://docs.databricks.com/en/dev-tools/cli/install.html for instructions on upgrading the CLI.\n\n\n{{end}}\nProject Name",
"description": "\nProject Name. Default",
"pattern": "^[^ .\\\\/]{3,}$",
"pattern_match_failure_message": "Project name must be at least 3 characters long and cannot contain the following characters: \"\\\", \"/\", \" \" and \".\"."

"pattern_match_failure_message": "Project name must be at least 3 characters long and cannot contain the following characters: \"\\\", \"/\", \" \" and \".\".",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
},
"input_root_dir": {
"order": 2,
"order": 3,
"type": "string",
"default": "my-mlops-project",
"description": "\nRoot directory name. Use a name different from the project name if you intend to use monorepo"
"default": "{{ .input_project_name }}",
"description": "\nRoot directory name. For monorepos, this is the name of the root directory that contains all the projects. Default",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_and_Project"
}
}
}
},
"input_cloud": {
"order": 3,
"order": 4,
"type": "string",
"description": "\nSelect cloud",
"default": "azure",
"enum": ["azure", "aws"]
},
"input_cicd_platform": {
"order": 4,
"order": 5,
"type": "string",
"description": "\nSelect CICD platform",
"default": "github_actions",
"enum": ["github_actions", "github_actions_for_github_enterprise_servers", "azure_devops"]
"enum": ["github_actions", "github_actions_for_github_enterprise_servers", "azure_devops"],
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "Project_Only"
}
}
}
},
"input_databricks_staging_workspace_host": {
"order": 5,
"order": 6,
"type": "string",
"default": "{{if eq .input_cloud `azure`}}https://adb-xxxx.xx.azuredatabricks.net{{else if eq .input_cloud `aws`}}https://your-staging-workspace.cloud.databricks.com{{end}}",
"description": "\nURL of staging Databricks workspace, used to run CI tests on PRs and preview config changes before they're deployed to production. Default",
"pattern": "^(https.*)?$",
"pattern_match_failure_message": "Databricks staging workspace host URLs must start with https. Got invalid workspace host."
"pattern_match_failure_message": "Databricks staging workspace host URLs must start with https. Got invalid workspace host.",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "Project_Only"
}
}
}
},
"input_databricks_prod_workspace_host": {
"order": 6,
"order": 7,
"type": "string",
"default": "{{if eq .input_cloud `azure`}}https://adb-xxxx.xx.azuredatabricks.net{{else if eq .input_cloud `aws`}}https://your-prod-workspace.cloud.databricks.com{{end}}",
"description": "\nURL of production Databricks workspace. Default",
"pattern": "^(https.*)?$",
"pattern_match_failure_message": "Databricks production workspace host URLs must start with https. Got invalid workspace host."
"pattern_match_failure_message": "Databricks production workspace host URLs must start with https. Got invalid workspace host.",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "Project_Only"
}
}
}
},
"input_default_branch": {
"order": 7,
"order": 8,
"type": "string",
"default": "main",
"description": "\nName of the default branch, where the prod and staging ML assets are deployed from and the latest ML code is staged. Default"
"description": "\nName of the default branch, where the prod and staging ML assets are deployed from and the latest ML code is staged. Default",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "Project_Only"
}
}
}
},
"input_release_branch": {
"order": 8,
"order": 9,
"type": "string",
"default": "release",
"description": "\nName of the release branch. The production jobs (model training, batch inference) defined in this stack pull ML code from this branch. Default"
"description": "\nName of the release branch. The production jobs (model training, batch inference) defined in this stack pull ML code from this branch. Default",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "Project_Only"
}
}
}
},
"input_read_user_group": {
"order": 9,
"order": 10,
"type": "string",
"default": "users",
"description": "\nUser group name to give READ permissions to for project assets (ML jobs, integration test job runs, and machine learning assets). A group with this name must exist in both the staging and prod workspaces. Default"
"description": "\nUser group name to give READ permissions to for project assets (ML jobs, integration test job runs, and machine learning assets). A group with this name must exist in both the staging and prod workspaces. Default",
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
},
"input_include_models_in_unity_catalog": {
"order": 10,
"order": 11,
"type": "string",
"description": "\nWhether to use the Model Registry with Unity Catalog",
"default": "yes",
"enum": ["yes", "no"]
"enum": ["yes", "no"],
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
},
"input_schema_name": {
"order": 11,
"order": 12,
"type": "string",
"description": "\nName of schema to use when registering a model in Unity Catalog. \nNote that this schema must already exist, and we recommend keeping the name the same as the project name as well as giving the service principals the right access. Default",
"default": "my-mlops-project",
"default": "{{ .input_project_name }}",
"pattern": "^[^ .\\/]*$",
"pattern_match_failure_message": "Valid schema names cannot contain any of the following characters: \" \", \".\", \"\\\", \"/\"",
"skip_prompt_if": {
"properties": {
"input_include_models_in_unity_catalog": {
"const": "no"
"anyOf":[
{
"properties": {
"input_include_models_in_unity_catalog": {
"const": "no"
}
}
},
{
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
}
]
}
},
"input_unity_catalog_read_user_group": {
"order": 12,
"order": 13,
"type": "string",
"default": "account users",
"description": "\nUser group name to give EXECUTE privileges to models in Unity Catalog. A group with this name must exist in the Unity Catalog that the staging and prod workspaces can access. Default",
"skip_prompt_if": {
"properties": {
"input_include_models_in_unity_catalog": {
"const": "no"
"anyOf":[
{
"properties": {
"input_include_models_in_unity_catalog": {
"const": "no"
}
}
},
{
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
}
]
}
},
"input_include_feature_store": {
"order": 13,
"order": 14,
"type": "string",
"description": "\nWhether to include Feature Store",
"default": "no",
"enum": ["no", "yes"]
"enum": ["no", "yes"],
"skip_prompt_if": {
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
},
"input_include_mlflow_recipes": {
"order": 14,
"order": 15,
"type": "string",
"description": "\nWhether to include MLflow Recipes",
"default": "no",
"enum": ["no", "yes"],
"skip_prompt_if": {
"properties": {
"input_include_models_in_unity_catalog": {
"const": "yes"
"anyOf":[
{
"properties": {
"input_include_models_in_unity_catalog": {
"const": "yes"
}
}
},
{
"properties": {
"input_include_feature_store": {
"const": "yes"
}
}
},
{
"properties": {
"input_setup_cicd_and_project": {
"const": "CICD_Only"
}
}
}
}
]
}
}
},
"success_message" : "\n Your MLOps Stack has been created in the '{{.input_project_name}}' directory!\n\nPlease refer to the README.md of your project for further instructions on getting started."
"success_message" : "\n*** Your MLOps Stack has been created in the '{{.input_root_dir}}{{if not (eq .input_setup_cicd_and_project `CICD_Only`) }}/{{.input_project_name}}{{end}}' directory! ***\n\nPlease refer to the README.md for further instructions on getting started."
}
2 changes: 1 addition & 1 deletion dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
pytest
pytest==7.4.4
pytest-black
mlflow==2.0.1
Loading
Loading