Merge pull request #84 from databricks-industry-solutions/add-numeric…

…al-and-categorical-covariates added dynamic future numerical and categorical
databricks-industry-solutions · Jan 24, 2025 · ab0e195 · ab0e195
2 parents 1c0fb9b + f34ff6c
commit ab0e195
Show file tree

Hide file tree

Showing 12 changed files with 178 additions and 107 deletions.
diff --git a/README.md b/README.md
@@ -89,7 +89,7 @@ run_forecast(
 #### Parameters description:
 
 - ```train_data``` is a delta table name that stores the input dataset.
-- ```scoring_data``` is a delta table name that stores the [dynamic future regressors](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html#3-training-with-exogenous-variables). If not provided or if the same name as ```train_data``` is provided, the models will ignore the future dynamical regressors. 
+- ```scoring_data``` is a delta table name that stores the [dynamic future regressors](https://nixtlaverse.nixtla.io/statsforecast/docs/how-to-guides/exogenous.html). If not provided or if the same name as ```train_data``` is provided, the models will ignore the future dynamical regressors. 
 - ```scoring_output``` is a delta table where you write your forecasting output. This table will be created if does not exist
 - ```evaluation_output``` is a delta table where you write the evaluation results from all backtesting trials from all time series and all models. This table will be created if does not exist.
 - ```group_id``` is a column storing the unique id that groups your dataset to each time series.

diff --git a/examples/global_external_regressors_daily.py b/examples/global_external_regressors_daily.py
@@ -90,7 +90,7 @@
 # COMMAND ----------
 
 # MAGIC %md
-# MAGIC Note that in `rossmann_daily_train` we have our target variable `Sales` but not in `rossmann_daily_test`. This is because `rossmann_daily_test` is going to be used as our `scoring_data` that stores `dynamic_future` variables of the future dates. When you adapt this notebook to your use case, make sure to comply with these datasets formats. See neuralforecast's [documentation](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html) for more detail on exogenous regressors.
+# MAGIC Note that in `rossmann_daily_train` we have our target variable `Sales` but not in `rossmann_daily_test`. This is because `rossmann_daily_test` is going to be used as our `scoring_data` that stores `dynamic_future_categorical` variables of the future dates. When you adapt this notebook to your use case, make sure to comply with these datasets formats. See neuralforecast's [documentation](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html) for more detail on exogenous regressors.
 
 # COMMAND ----------
 
@@ -118,7 +118,7 @@
 
 # MAGIC %md ### Run MMF
 # MAGIC
-# MAGIC Now, we run the evaluation and forecasting using `run_forecast` function. We are providing the training table and the scoring table names. If `scoring_data` is not provided or if the same name as `train_data` is provided, the models will ignore the `dynamic_future` regressors. Note that we are providing a covariate field (i.e. `dynamic_future`) this time in `run_forecast` function called in [examples/run_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/run_external_regressors_daily.py). There are also other convariate fields, namely `static_features`, and `dynamic_historical`, which you can provide. Read more about these covariates in [neuralforecast's documentation](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html).
+# MAGIC Now, we run the evaluation and forecasting using `run_forecast` function. We are providing the training table and the scoring table names. If `scoring_data` is not provided or if the same name as `train_data` is provided, the models will ignore the `dynamic_future_numerical` and `dynamic_future_categorical` regressors. Note that we are providing a covariate field (i.e. `dynamic_future_numerical` or `dynamic_future_categorical`) this time in `run_forecast` function called in [examples/run_external_regressors_daily.py](https://github.com/databricks-industry-solutions/many-model-forecasting/blob/main/examples/run_external_regressors_daily.py). There are also other convariate fields, namely `static_features`, and `dynamic_historical_numerical` and `dynamic_historical_categorical`, which you can provide. Read more about these covariates in [neuralforecast's documentation](https://nixtlaverse.nixtla.io/neuralforecast/examples/exogenous_variables.html).
 
 # COMMAND ----------
 

diff --git a/examples/local_univariate_external_regressors_daily.py b/examples/local_univariate_external_regressors_daily.py
@@ -90,7 +90,7 @@
 # COMMAND ----------
 
 # MAGIC %md
-# MAGIC Note that in `rossmann_daily_train` we have our target variable `Sales` but not in `rossmann_daily_test`. This is because `rossmann_daily_test` is going to be used as our `scoring_data` that stores `dynamic_future` variables of the future dates. When you adapt this notebook to your use case, make sure to comply with these datasets formats. See statsforecast's [documentation](https://nixtlaverse.nixtla.io/statsforecast/docs/how-to-guides/exogenous.html) for more detail on exogenous regressors.
+# MAGIC Note that in `rossmann_daily_train` we have our target variable `Sales` but not in `rossmann_daily_test`. This is because `rossmann_daily_test` is going to be used as our `scoring_data` that stores `dynamic_future_categorical` variables of the future dates. When you adapt this notebook to your use case, make sure to comply with these datasets formats. See statsforecast's [documentation](https://nixtlaverse.nixtla.io/statsforecast/docs/how-to-guides/exogenous.html) for more detail on exogenous regressors.
 
 # COMMAND ----------
 
@@ -134,7 +134,7 @@
 
 # MAGIC %md ### Run MMF
 # MAGIC
-# MAGIC Now, we run the evaluation and forecasting using `run_forecast` function. We are providing the training table and the scoring table names. If `scoring_data` is not provided or if the same name as `train_data` is provided, the models will ignore the `dynamic_future` regressors. Note that we are providing a covariate field (i.e. `dynamic_future`) this time. There are also other convariate fields, namely `static_features`, and `dynamic_historical`, but these are only relevant with the global models. 
+# MAGIC Now, we run the evaluation and forecasting using `run_forecast` function. We are providing the training table and the scoring table names. If `scoring_data` is not provided or if the same name as `train_data` is provided, the models will ignore the `dynamic_future_numerical` and `dynamic_future_categorical` regressors. Note that we are providing a covariate field (i.e. `dynamic_future_numerical` or `dynamic_future_categorical`) this time. There are also other convariate fields, namely `static_features`, `dynamic_historical_numerical` and `dynamic_historical_categorical`, but these are only relevant with the global models.
 
 # COMMAND ----------
 
@@ -148,7 +148,7 @@
     date_col="Date",
     target="Sales",
     freq="D",
-    dynamic_future=["DayOfWeek", "Open", "Promo", "SchoolHoliday"],
+    dynamic_future_categorical=["DayOfWeek", "Open", "Promo", "SchoolHoliday"],
     prediction_length=10,
     backtest_months=1,
     stride=10,

diff --git a/examples/run_external_regressors_daily.py b/examples/run_external_regressors_daily.py
@@ -36,7 +36,7 @@
     date_col="Date",
     target="Sales",
     freq="D",
-    dynamic_future=["DayOfWeek", "Open", "Promo", "SchoolHoliday"],
+    dynamic_future_categorical=["DayOfWeek", "Open", "Promo", "SchoolHoliday"],
     prediction_length=10,
     backtest_months=1,
     stride=10,

diff --git a/mmf_sa/__init__.py b/mmf_sa/__init__.py
@@ -28,8 +28,10 @@ def run_forecast(
     model_output: str = None,
     use_case_name: str = None,
     static_features: List[str] = None,
-    dynamic_future: List[str] = None,
-    dynamic_historical: List[str] = None,
+    dynamic_future_numerical: List[str] = None,
+    dynamic_future_categorical: List[str] = None,
+    dynamic_historical_numerical: List[str] = None,
+    dynamic_historical_categorical: List[str] = None,
     active_models: List[str] = None,
     accelerator: str = "cpu",
     backtest_retrain: bool = None,
@@ -63,8 +65,10 @@ def run_forecast(
         model_output (str): A string specifying the output path for the model.
         use_case_name (str): A string specifying the use case name.
         static_features (List[str]): A list of strings specifying the static features.
-        dynamic_future (List[str]): A list of strings specifying the dynamic future features.
-        dynamic_historical (List[str]): A list of strings specifying the dynamic historical features.
+        dynamic_future_numerical (List[str]): A list of strings specifying the dynamic future features that are numerical.
+        dynamic_future_categorical (List[str]): A list of strings specifying the dynamic future features that are categorical.
+        dynamic_historical_numerical (List[str]): A list of strings specifying the dynamic historical features that are numerical.
+        dynamic_historical_categorical (List[str]): A list of strings specifying the dynamic historical features that are categorical.
         active_models (List[str]): A list of strings specifying the active models.
         accelerator (str): A string specifying the accelerator to use: cpu or gpu. Default is cpu.
         backtest_retrain (bool): A boolean specifying whether to retrain the model during backtesting. Currently, not supported.
@@ -137,10 +141,14 @@ def run_forecast(
         _conf["data_quality_check"] = data_quality_check
     if static_features is not None:
         _conf["static_features"] = static_features
-    if dynamic_future is not None:
-        _conf["dynamic_future"] = dynamic_future
-    if dynamic_historical is not None:
-        _conf["dynamic_historical"] = dynamic_historical
+    if dynamic_future_numerical is not None:
+        _conf["dynamic_future_numerical"] = dynamic_future_numerical
+    if dynamic_future_categorical is not None:
+        _conf["dynamic_future_categorical"] = dynamic_future_categorical
+    if dynamic_historical_numerical is not None:
+        _conf["dynamic_historical_numerical"] = dynamic_historical_numerical
+    if dynamic_historical_categorical is not None:
+        _conf["dynamic_historical_categorical"] = dynamic_historical_categorical
     if run_id is not None:
         _conf["run_id"] = run_id
 

diff --git a/mmf_sa/data_quality_checks.py b/mmf_sa/data_quality_checks.py
@@ -43,8 +43,10 @@ def _external_regressors_check(self):
         """
         if (
             self.conf.get("static_features", None)
-            or self.conf.get("dynamic_future", None)
-            or self.conf.get("dynamic_historical", None)
+            or self.conf.get("dynamic_future_numerical", None)
+            or self.conf.get("dynamic_future_categorical", None)
+            or self.conf.get("dynamic_historical_numerical", None)
+            or self.conf.get("dynamic_historical_categorical", None)
         ):
             if self.conf.get("resample"):
                 raise Exception(
@@ -77,19 +79,29 @@ def _multiple_checks(
 
         # 1. Checking for nulls in external regressors
         static_features = conf.get("static_features", None)
-        dynamic_future = conf.get("dynamic_future", None)
-        dynamic_historical = conf.get("dynamic_historical", None)
+        dynamic_future_numerical = conf.get("dynamic_future_numerical", None)
+        dynamic_future_categorical = conf.get("dynamic_future_categorical", None)
+        dynamic_historical_numerical = conf.get("dynamic_historical_numerical", None)
+        dynamic_historical_categorical = conf.get("dynamic_historical_categorical", None)
         if static_features:
             if _df[static_features].isnull().values.any():
-                # Removing: null in static categoricals
+                # Removing: null in static categorical
                 return pd.DataFrame()
-        if dynamic_future:
-            if _df[dynamic_future].isnull().values.any():
-                # Removing: null in dynamic future
+        if dynamic_future_numerical:
+            if _df[dynamic_future_numerical].isnull().values.any():
+                # Removing: null in dynamic future numerical
                 return pd.DataFrame()
-        if dynamic_historical:
-            if _df[dynamic_historical].isnull().values.any():
-                # Removing: null in dynamic historical
+        if dynamic_future_categorical:
+            if _df[dynamic_future_categorical].isnull().values.any():
+                # Removing: null in dynamic future categorical
+                return pd.DataFrame()
+        if dynamic_historical_numerical:
+            if _df[dynamic_historical_numerical].isnull().values.any():
+                # Removing: null in dynamic historical numerical
+                return pd.DataFrame()
+        if dynamic_historical_categorical:
+            if _df[dynamic_historical_categorical].isnull().values.any():
+                # Removing: null in dynamic historical categorical
                 return pd.DataFrame()
 
         # 2. Checking for training period length

diff --git a/mmf_sa/forecasting_conf.yaml b/mmf_sa/forecasting_conf.yaml
@@ -11,12 +11,16 @@ accelerator: cpu
 static_features:
   #- State
 
-dynamic_future:
+dynamic_future_numerical:
+
+dynamic_future_categorical:
   #- Open
   #- Promo
   #- DayOfWeek
 
-dynamic_historical:
+dynamic_historical_numerical:
+
+dynamic_historical_categorical:
 
 active_models:
   - StatsForecastBaselineWindowAverage

diff --git a/mmf_sa/models/models_conf.yaml b/mmf_sa/models/models_conf.yaml
@@ -10,8 +10,10 @@ promoted_props:
   - backtest_months
   - stride
   - static_features
-  - dynamic_future
-  - dynamic_historical
+  - dynamic_future_numerical
+  - dynamic_future_categorical
+  - dynamic_historical_numerical
+  - dynamic_historical_categorical
 
 models: