PandasUDF with AutoTS #76

LukeBeckerBB · 2021-07-06T05:33:25Z

LukeBeckerBB
Jul 6, 2021

Hello,

I would like to use AutoTS with larger datasets. Does anyone know whether AutoTS works with pandas UDF and has already finished example code?

I am currently struggling to do it by myself as I get a lot of error messages along the way.

Thank you very much!

winedarksea · 2021-07-06T14:13:48Z

winedarksea
Jul 6, 2021
Maintainer

Instead of using a Spark cluster, just use a bigger VM. You can fit a lot of time series into 120 gb (or more) of RAM!

You can also manually split your time series into different chunks and then run AutoTS on each chunk on a different node. You could also use the loki backend of joblib to spread n_jobs across a dask cluster for many of the models.

I'm not a big fan of Spark, so unlikely I'll be adding support myself anytime soon.

3 replies

LukeBeckerBB Jul 7, 2021
Author

Hi,

thanks for your answer and your great work. I managed to implement AutoTS with Pandas UDF and the results are great.
900 Forecasts in 14 minutes using the "fast-parallel" model list, 5 generations and 3 validations.

winedarksea Jul 7, 2021
Maintainer

Would you mind sharing an example of your code, or showing if you had to do anything differently? Thanks, cool that it worked!

LukeBeckerBB Jul 7, 2021
Author

It is fairly easy actually:

`

from autots import AutoTS
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

df_spark = spark.createDataFrame(df1)

schema = StructType([StructField('series_id', StringType(), True),
                     StructField('datetime', DateType(), True),
                     StructField('value', LongType(), True)])

@pandas_udf(df_spark.schema, PandasUDFType.GROUPED_MAP)

def forecast_func(df_long):
  
  try:
    
    model = AutoTS(
      forecast_length=18,
      frequency='MS',
      model_list= "fast_parallel",
        transformer_list='all',
      ensemble='all',
      validation_method="even",
      max_generations=5,
      num_validations=3,
      no_negatives=True,
      constraint=2.0)

    model = model.fit(df_long, date_col='datetime', value_col='value', id_col='series_id')

    prediction = model.predict()
    forecasts_df = prediction.forecast
    forecasts_df.reset_index(inplace = True)
    forecasts_df.insert(2, "series_id", forecasts_df.columns.values[1])
    forecasts_df.columns = ['datetime', 'value', 'series_id']

    return(forecasts_df)
  except:
    exp_df = pd.DataFrame(columns=['series_id','datetime','value'])
    return(exp_df)

out_df = df_spark.repartitionByRange(2000, 'series_id').groupby(['series_id']).apply(forecast_func)

winedarksea · 2023-07-30T14:04:20Z

winedarksea
Jul 30, 2023
Maintainer

I have the need to run some code in Databricks. Here is a script that worked for me:

# {load your df from source}
# renaming the source data columns into something more generic
df = df.withColumnRenamed("id", "series_id").withColumnRenamed("MonthDateKey", "datetime").withColumnRenamed("UnitSales", "value")
df = df.withColumn("datetime", to_date(df.datetime, 'yyyy-MM-dd'))
df = df.withColumn('value', df.value.cast("float"))
# haven't got a product category column here but the first couple of characters of the product id make a group of sorts (by manufacturer)
# here we are running AutoTS on chunks, this allows multivariate/global models but only on that chunk, so chose a reasonable chunk of similar series
# this grouping is to allow use of a big cluster and distribute. If you are only using one node, set to a constant: lit("single_group")
df = df.withColumn("prod_group", substring("ProductID", 0, 3))
df.describe().show()
display(df)
df.printSchema()
schema = StructType([
    StructField('series_id', StringType(), True),
    StructField('datetime', DateType(), True),
    StructField('value', FloatType(), True),
    StructField('prediction_interval', StringType(), True),
])
# had issues trying to make datetime come out as a date type
schema_text = "datetime string, series_id string, value float, prediction_interval string"

def forecast_func(df_long: pd.DataFrame) -> pd.DataFrame:
  col_order = ['datetime', 'series_id', 'value', 'prediction_interval']
  forecast_length = 6
  try:
    model = AutoTS(
      forecast_length=forecast_length,
      frequency='infer',
      model_list= "superfast",  # fast_parallel_no_arima
      transformer_list='fast',
      ensemble=['horizontal-max'],
      validation_method="backwards",
      max_generations=20,
      num_validations=1,  # current sample data is pretty short history, FYI
      no_negatives=True,
      verbose=0,
    )
    model = model.fit(df_long, date_col='datetime', value_col='value', id_col='series_id')

    prediction = model.predict(forecast_length=forecast_length)
    forecasts_df = prediction.long_form_results(id_name="series_id", value_name='value', interval_name="prediction_interval")
    forecasts_df.index.name = "datetime"
    forecasts_df = forecasts_df.reset_index(drop=False)
    forecasts_df['datetime'] = forecasts_df['datetime'].dt.strftime("%Y-%m-%d")
    forecasts_df = forecasts_df[col_order]
    return(forecasts_df)
  except Exception as e:
    print(repr(e))
    forecasts_df = pd.DataFrame(columns=col_order)
    return(forecasts_df)

# you could also group by series_id if you only want univariate models/forecasts and for this to be highly distributed
out_df = df.groupby('prod_group').applyInPandas(forecast_func, schema=schema_text)

display(forecasts_df)

2 replies

marrov Oct 6, 2023

Jumping in to revive this old thread. I'm also using autots in a databricks environment and found your code very helpful. However, I generally like all the introspection into the different models and performance etc that one has when you have access to the model object. This example only returns the forecast for the best model, but if say within the top models all had similar performance and one of them is a simpler single model as opposed to a more complex ensemble, it would be make more sense to use the simple model. Is there a way to access the model object or at least to recover those multi-model/benchmarking results for the cross-validation? Any other tips for working with spark/databricks? Thanks!

winedarksea Oct 6, 2023
Maintainer

I unfortunately don't have that much experience with databricks, but I am expecting to have to do more work there soon. I think the easiest thing to do would be to embed some sort of SQL/S3 upload from inside the forecast function that uploads details to a logging table. Happy to work with you on this more

shuaiwang88 · 2023-11-30T01:45:23Z

shuaiwang88
Nov 30, 2023

Any one has example of using autots with snowpark?

1 reply

winedarksea Nov 30, 2023
Maintainer

No, but I'm guessing it should be fine if you convert their native dataframes to pandas dataframes and then input:
https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.DataFrame.to_pandas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PandasUDF with AutoTS #76

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

PandasUDF with AutoTS #76

LukeBeckerBB Jul 6, 2021

Replies: 3 comments · 6 replies

winedarksea Jul 6, 2021 Maintainer

LukeBeckerBB Jul 7, 2021 Author

winedarksea Jul 7, 2021 Maintainer

LukeBeckerBB Jul 7, 2021 Author

winedarksea Jul 30, 2023 Maintainer

marrov Oct 6, 2023

winedarksea Oct 6, 2023 Maintainer

shuaiwang88 Nov 30, 2023

winedarksea Nov 30, 2023 Maintainer

LukeBeckerBB
Jul 6, 2021

Replies: 3 comments 6 replies

winedarksea
Jul 6, 2021
Maintainer

LukeBeckerBB Jul 7, 2021
Author

winedarksea Jul 7, 2021
Maintainer

LukeBeckerBB Jul 7, 2021
Author

winedarksea
Jul 30, 2023
Maintainer

winedarksea Oct 6, 2023
Maintainer

shuaiwang88
Nov 30, 2023

winedarksea Nov 30, 2023
Maintainer