-
Notifications
You must be signed in to change notification settings - Fork 173
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Lakehouse monitoring integration (#156)
* define lakehouse monitoring resource * implement retrainig workflow based on monitored metric violation check * update monitoring table name * update readme with monitoring specific information * add explanation to the metrci violation check sql query * convert is_metric_violated flag to bool * fix monitoring resource definition * remove disallowed string * incorporate review comments - Minor readme changes - Use dafault assets_dir path for monitoring * incorporate review comments - Accept inference table name from CLI - Merge monitoring related resources into a single file - Parametrize the metric and validation threshold * accept fully qualified inference table name * updates * nit * apply comments * Update CLI version * Fix tests and add regex * Fix test * try to fix tests again * try to fix tests again * Fix query * upgrade gh versions * add assets-dir --------- Co-authored-by: Arpit Jasapara <[email protected]>
- Loading branch information
1 parent
9632f3d
commit 4306c6b
Showing
23 changed files
with
320 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,7 +57,7 @@ | |
{{- end }} | ||
|
||
{{ define `cli_version` -}} | ||
v0.212.2 | ||
v0.221.0 | ||
{{- end }} | ||
|
||
{{ define `stacks_version` -}} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6 changes: 3 additions & 3 deletions
6
...oot_dir}}/{{template `project_name_alphanumeric_underscore` .}}/monitoring/README.md.tmpl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
# Monitoring | ||
|
||
Databricks Data Monitoring is currently in Private Preview. | ||
|
||
Please contact a Databricks representative for more information. | ||
To enable monitoring as part of a scheduled Databricks workflow, please update all the TODOs in the [monitoring resource file](../resources/monitoring-resource.yml), and refer to | ||
[{{template `project_name_alphanumeric_underscore` .}}/resources/README.md](../resources/README.md). The implementation supports monitoring of batch inference tables directly. | ||
For real time inference tables, unpacking is required before monitoring can be attached. |
84 changes: 84 additions & 0 deletions
84
...project_name_alphanumeric_underscore` .}}/monitoring/metric_violation_check_query.py.tmpl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# This file is used for the main SQL query that checks the last {num_evaluation_windows} metric violations and whether at least {num_violation_windows} of those runs violate the condition. | ||
|
||
import sys | ||
import pathlib | ||
|
||
sys.path.append(str(pathlib.Path(__file__).parent.parent.parent.resolve())) | ||
|
||
"""The SQL query is divided into three main parts. The first part selects the top {num_evaluation_windows} | ||
values of the metric to be monitored, ordered by the time window, and saves as recent_metrics. | ||
```sql | ||
WITH recent_metrics AS ( | ||
SELECT | ||
{metric_to_monitor}, | ||
window | ||
FROM | ||
{table_name_under_monitor}_profile_metrics | ||
WHERE | ||
column_name = ":table" | ||
AND slice_key IS NULL | ||
AND model_id != "*" | ||
AND log_type = "INPUT" | ||
ORDER BY | ||
window DESC | ||
LIMIT | ||
{num_evaluation_windows} | ||
) | ||
``` | ||
The `column_name = ":table"` and `slice_key IS NULL` conditions ensure that the metric | ||
is selected for the entire table within the given granularity. The `log_type = "INPUT"` | ||
condition ensures that the primary table metrics are considered, but not the baseline | ||
table metrics. The `model_id!= "*"` condition ensures that the metric aggregated across | ||
all model IDs is not selected. | ||
|
||
The second part of the query determines if the metric values have been violated with two cases. | ||
The first case checks if the metric value is greater than the threshold for at least {num_violation_windows} windows: | ||
```sql | ||
(SELECT COUNT(*) FROM recent_metrics WHERE {metric_to_monitor} > {metric_violation_threshold}) >= {num_violation_windows} | ||
``` | ||
The second case checks if the most recent metric value is greater than the threshold. This is to make sure we only trigger retraining | ||
if the most recent window was violated, avoiding unnecessary retraining if the violation was in the past and the metric is now within the threshold: | ||
```sql | ||
(SELECT {metric_to_monitor} FROM recent_metrics ORDER BY window DESC LIMIT 1) > {metric_violation_threshold} | ||
``` | ||
|
||
The final part of the query sets the `query_result` to 1 if both of the above conditions are met, and 0 otherwise: | ||
```sql | ||
SELECT | ||
CASE | ||
WHEN | ||
# Check if the metric value is greater than the threshold for at least {num_violation_windows} windows | ||
AND | ||
# Check if the most recent metric value is greater than the threshold | ||
THEN 1 | ||
ELSE 0 | ||
END AS query_result | ||
``` | ||
""" | ||
|
||
sql_query = """WITH recent_metrics AS ( | ||
SELECT | ||
{metric_to_monitor}, | ||
window | ||
FROM | ||
{table_name_under_monitor}_profile_metrics | ||
WHERE | ||
column_name = ":table" | ||
AND slice_key IS NULL | ||
AND model_id != "*" | ||
AND log_type = "INPUT" | ||
ORDER BY | ||
window DESC | ||
LIMIT | ||
{num_evaluation_windows} | ||
) | ||
SELECT | ||
CASE | ||
WHEN | ||
(SELECT COUNT(*) FROM recent_metrics WHERE {metric_to_monitor} > {metric_violation_threshold}) >= {num_violation_windows} | ||
AND | ||
(SELECT {metric_to_monitor} FROM recent_metrics ORDER BY window DESC LIMIT 1) > {metric_violation_threshold} | ||
THEN 1 | ||
ELSE 0 | ||
END AS query_result | ||
""" |
68 changes: 68 additions & 0 deletions
68
...e_alphanumeric_underscore` .}}/monitoring/notebooks/MonitoredMetricViolationCheck.py.tmpl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Databricks notebook source | ||
################################################################################## | ||
# This notebook runs a sql query and set the result as job task value | ||
# | ||
# This notebook has the following parameters: | ||
# | ||
# * table_name_under_monitor (required) - The name of a table that is currently being monitored | ||
# * metric_to_monitor (required) - Metric to be monitored for threshold violation | ||
# * metric_violation_threshold (required) - Threshold value for metric violation | ||
# * num_evaluation_windows (required) - Number of windows to check for violation | ||
# * num_violation_windows (required) - Number of windows that need to violate the threshold | ||
################################################################################## | ||
|
||
# List of input args needed to run the notebook as a job. | ||
# Provide them via DB widgets or notebook arguments. | ||
# | ||
# Name of the table that is currently being monitored | ||
dbutils.widgets.text( | ||
"table_name_under_monitor", "{{ .input_inference_table_name }}", label="Full (three-Level) table name" | ||
) | ||
# Metric to be used for threshold violation check | ||
dbutils.widgets.text( | ||
"metric_to_monitor", "root_mean_squared_error", label="Metric to be monitored for threshold violation" | ||
) | ||
|
||
# Threshold value to be checked | ||
dbutils.widgets.text( | ||
"metric_violation_threshold", "100", label="Threshold value for metric violation" | ||
) | ||
|
||
# Threshold value to be checked | ||
dbutils.widgets.text( | ||
"num_evaluation_windows", "5", label="Number of windows to check for violation" | ||
) | ||
|
||
# Threshold value to be checked | ||
dbutils.widgets.text( | ||
"num_violation_windows", "2", label="Number of windows that need to violate the threshold" | ||
) | ||
|
||
# COMMAND ---------- | ||
|
||
import os | ||
import sys | ||
notebook_path = '/Workspace/' + os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) | ||
%cd $notebook_path | ||
%cd .. | ||
sys.path.append("../..") | ||
|
||
# COMMAND ---------- | ||
|
||
from metric_violation_check_query import sql_query | ||
|
||
table_name_under_monitor = dbutils.widgets.get("table_name_under_monitor") | ||
metric_to_monitor = dbutils.widgets.get("metric_to_monitor") | ||
metric_violation_threshold = dbutils.widgets.get("metric_violation_threshold") | ||
|
||
formatted_sql_query = sql_query.format( | ||
table_name_under_monitor=table_name_under_monitor, | ||
metric_to_monitor=metric_to_monitor, | ||
metric_violation_threshold=metric_violation_threshold, | ||
num_evaluation_windows=num_evaluation_windows, | ||
num_violation_windows=num_violation_windows) | ||
is_metric_violated = bool(spark.sql(formatted_sql_query).toPandas()["query_result"][0]) | ||
|
||
dbutils.jobs.taskValues.set("is_metric_violated", is_metric_violated) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.