Merge pull request #58 from winedarksea/dev

0.3.1
winedarksea · Mar 24, 2021 · ec48749 · ec48749
2 parents 360775e + 14e7140
commit ec48749
Show file tree

Hide file tree

Showing 56 changed files with 1,060 additions and 649 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ For other time series needs, check out the list [here](https://github.com/MaxBen
 * Allows automatic ensembling of best models
 	* 'horizontal' ensembling on multivariate series - learning the best model for each series
 * Multiple cross validation options
-	* 'seasonal' validation allows forecasts to be optimized for the season of your forecast period
+	* 'seasonal' validation allows forecasts to be optimized for the seasonity of the data
 * Subsetting and weighting to improve speed and relevance of search on large datasets
 	* 'constraint' parameter can be used to assure forecasts don't drift beyond historic boundaries
 * Option to use one or a combination of metrics for model selection
@@ -72,12 +72,15 @@ model = model.fit(
     id_col='series_id' if long else None,
 )
 
+prediction = model.predict()
 # Print the details of the best model
 print(model)
 
-prediction = model.predict()
 # point forecasts dataframe
 forecasts_df = prediction.forecast
+# upper and lower forecasts
+forecasts_up, forecasts_low = prediction.upper_forecast, prediction.lower_forecast
+
 # accuracy of all tried model results
 model_results = model.results()
 # and aggregated from cross validation
@@ -88,6 +91,26 @@ The lower-level API, in particular the large section of time series transformers
 
 Check out [extended_tutorial.md](https://winedarksea.github.io/AutoTS/build/html/source/tutorial.html) for a more detailed guide to features!
 
+
+## Tips for Speed and Large Data:
+* Use appropriate model lists, especially the predefined lists:
+	* `superfast` (simple naive models) and `fast` (more complex but still faster models)
+	* `fast_parallel` (a combination of `fast` and `parallel`) or `parallel`, given mave many CPU cores are available
+		* `n_jobs` usually gets pretty close with `='auto'` but adjust as necessary for the environment
+	* see a dict of predefined lists (some defined for internal use) with `from autots.models.model_list import model_lists`
+* Use the `subset` parameter when there are many similar series, `subset=100` will often generalize well for tens of thousands of similar series.
+	* if using `subset`, passing `weights` for series will weight subset selection towards higher priority series.
+	* if limited by RAM, it can be easily distributed by running multiple instances of AutoTS on different batches of data, having first imported a template pretrained as a starting point for all.
+* Set `model_interrupt=True` which passes over the current model when a `KeyboardInterrupt` ie `crtl+c` is pressed (although if the interrupt falls between generations it will stop the entire training).
+* Use the `result_file` method of `.fit()` which will save progress after each generation - helpful to save progress if a long training is being done. Use `import_results` to recover.
+* While Transformations are pretty fast, setting `transformer_max_depth` to a lower number (say, 2) will increase speed. Also utilize `transformer_list`.
+* Ensembles are obviously slower to predict because they run many models, 'distance' models 2x slower, and 'simple' models 3x-5x slower.
+	* `ensemble='horizontal-max'` with `model_list='no_shared_fast'` can scale relatively well given many cpu cores because each model is only run on the series it is needed for.
+* Reducing `num_validations` and `models_to_validate` will decrease runtime but may lead to poorer model selections.
+* For datasets with many records, upsampling (for example, from daily to monthly frequency forecasts) can reduce training time if appropriate.
+	* this can be done by adjusting `frequency` and `aggfunc` but is probably best done before passing data into AutoTS.
+
+
 ## How to Contribute:
 * Give feedback on where you find the documentation confusing
 * Use AutoTS and...

diff --git a/TODO.md b/TODO.md
@@ -15,23 +15,18 @@
 * Forecasts are desired for the future immediately following the most recent data.
 
 # Latest
-* **breaking change** to model templates: transformers structure change
-	* grouping no longer used
-* parameter generation for transformers allowing more possible combinations
-* transformer_max_depth parameter
-* Horizontal Ensembles are now much faster by only running models on the subset of series they apply to
-* general starting template improved and updated to new transformer format
-* change many np.random to random
-	* random.choices further necessitates python 3.6 or greater
-* bug fix in Detrend transformer
-* bug fix in SeasonalDifference transformer
-* SPL bug fix when NaN in test set
-* inverse_transform now fills NaN with zero for upper/lower forecasts
-* expanded model_list aliases, with dedicated module
-* bug fix (creating 0,0 order) and tuning of VARMAX
-* Fix export_template bug
-* restructuring of some lower-level function locations
-
+* Additional models to GluonTS
+* GeneralTransformer transformation_params - now handle None or empty dict
+* cleaning up of the appropriately named 'ModelMonster'
+* improving MotifSimulation
+* better error message for all models
+* enable histgradientboost regressor, left it out before thinking it wouldn't stay experimental this long
+* import_template now has slightly better `method` input style
+* allow `ensemble` parameter to be a list
+* NumericTransformer
+	* add .fit_transform method
+	* generally more options and speed improvement
+* added NumericTransformer to future_regressors, should now coerce if they have different dtypes
 
 # Known Errors: 
 DynamicFactor holidays 	Exceptions 'numpy.ndarray' object has no attribute 'values'
@@ -64,12 +59,8 @@ Tensorflow GPU backend may crash on occasion.
 * Remove 'horizontal' sanity check run, takes too long (only if metric weights are x)?
 * Horizontal and BestN runtime variant, where speed is highly important in model selection
 * total runtime for .fit() as attribute (not just manual sum but capture in ModelPrediction)
-* allow Index to be other datetime not just DatetimeIndex
-* cleanse similar models out first, before horizontal ensembling
 * BestNEnsemble Add 5 or more model option
-* allow best_model to be specified and entirely bypass the .fit() stage.
 * drop duplicates as function of TemplateEvalObject
-* improve test.py script for actual testing of many features
 * Convert 'Holiday' regressors into Datepart + Holiday 2d
 * export and import of results includes all model parameters (but not templates?)
 * Option to use full traceback in errors in table
@@ -94,7 +85,7 @@ Tensorflow GPU backend may crash on occasion.
 	* Probabilistic:
 		https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
 * GluonTS
-	* Add support for future_regressor
+	* Add support for future_regressor  (potentially PCA down to 1 feature, then use?)
 	* Modify GluonStart if lots of NaN at start of that series
 	* GPU and CPU ctx
 * implement 'borrow' Genetic Recombination for ComponentAnalysis

diff --git a/autots/__init__.py b/autots/__init__.py
@@ -9,13 +9,14 @@
     load_monthly,
     load_yearly,
     load_weekly,
+    load_weekdays,
 )
 
 from autots.evaluator.auto_ts import AutoTS
 from autots.tools.transform import GeneralTransformer, RandomTransform
 from autots.tools.shaping import long_to_wide
 
-__version__ = '0.3.0'
+__version__ = '0.3.1'
 
 TransformTS = GeneralTransformer
 
@@ -25,6 +26,7 @@
     'load_yearly',
     'load_hourly',
     'load_weekly',
+    'load_weekdays',
     'AutoTS',
     'TransformTS',
     'GeneralTransformer',

diff --git a/autots/datasets/__init__.py b/autots/datasets/__init__.py
@@ -6,5 +6,13 @@
 from autots.datasets._base import load_yearly
 from autots.datasets._base import load_hourly
 from autots.datasets._base import load_weekly
+from autots.datasets._base import load_weekdays
 
-__all__ = ['load_daily', 'load_monthly', 'load_yearly', 'load_hourly', 'load_weekly']
+__all__ = [
+    'load_daily',
+    'load_monthly',
+    'load_yearly',
+    'load_hourly',
+    'load_weekly',
+    'load_weekdays',
+]
diff --git a/autots/datasets/_base.py b/autots/datasets/_base.py
@@ -172,3 +172,35 @@ def load_weekly(long: bool = True):
             aggfunc='first',
         )
         return df_wide
+
+
+def load_weekdays(long: bool = False, categorical: bool = True, periods: int = 180):
+    """Test edge cases by creating a Series with values as day of week.
+
+    Args:
+        long (bool):
+            if True, return a df with columns "value" and "datetime"
+            if False, return a Series with dt index
+        categorical (bool): if True, return str/object, else return int
+        periods (int): number of periods, ie length of data to generate
+    """
+    idx = pd.date_range(end=pd.Timestamp.today(), periods=periods, freq="D")
+    df_wide = pd.Series(idx.weekday, index=idx, name="value")
+    df_wide.index.name = "datetime"
+    if categorical:
+        df_wide = df_wide.replace(
+            {
+                0: "Mon",
+                1: "Tues",
+                2: "Wed",
+                3: "Thor's",
+                4: "Fri",
+                5: "Sat",
+                6: "Sun",
+                7: "Mon",
+            }
+        )
+    if long:
+        return df_wide.reset_index()
+    else:
+        return df_wide