tft gap detection on each time series independently #770

int-chaos · 2022-10-19T04:34:30Z

perform gap detection on each time series independently for panel time series datasets

sonichi · 2022-10-19T16:39:35Z

flaml/automl.py

+            unique_ids = dataframe[group_ids].value_counts().reset_index()[group_ids]
+            for _, row in unique_ids.iterrows():
+                df = dataframe.copy()
+                for id in group_ids:
+                    ts = df.loc[df[id] == row[id]]
+            ts_series = pd.to_datetime(ts[ts.columns[0]])
+            inferred_freq = pd.infer_freq(ts_series)
+            if inferred_freq is None:
+                logger.warning(
+                    "Missing timestamps detected. To avoid error with estimators, set estimator list to ['prophet']. "
+                )


There are multiple problems:

Does prophet handle panel data? Does TFT handle missing timestamps? If TFT can handle missing timestamps, this check is not necessary. If prophet can't handle panel data, this message shouldn't suggest adding prophet.

For loop is not efficient. There should be a functional way using groupby().

Do you intend to infer the frequency for each series? The current code only does it for the last one.

Yea, the logger message is incorrect. We need there to be no missing timestamps in order to create the "time_idx" column. See add_time_idx_col(X) function in data.py.

Also looked into TFT handling missing data a little bit more. By default, allow_missing_timesteps for the TimeSeriesDataset object is False, so currently it does not handle missing timesteps. It would be reasonable if we turn it on. Also, another thing to consider now: by default TimeSeriesDataset uses forward fill strategy (fill based on previous data) to handle missing data, but also allow constant_fill_strategy where users input a "dictionary of column names with constants to fill in missing values if there are gaps in the sequence".

Perhaps with this, we should allow for missing timesteps and find a different solution to creating a "time_idx" column. @EgorKraevTransferwise @markharley what do you think? We had a conversation about this before in our first meeting and Egor suggested to just assume no missing time steps for simplicity of code.

I did try to use groupby at first but could not find a way. Will try again.

Yea, indentation was wrong... 1037 to 1042 should be indented.

To clarify, TimeSeriesDataset only handle timestamps that are missing, but do not handle NA values.

@EgorKraevTransferwise and @markharley Our current solution to this issue will be to either require no missing timestamps or else user have to provide a freq argument.

EgorKraevTransferwise · 2022-10-21T06:09:29Z

Here is how we handled it in our TFT wrapper (in our internal time series library that was a precursor to our PR to FLAML), pivot by timestamp and dimension, add time index and fill any gaps, melt back into the original shape. We do use the frequency argument in there, but Pandas can infer that.

def add_time_idx_new(data: BasicDataset, time_col: str) -> pd.DataFrame:

    pivoted = data.data.pivot(
        index=time_col,
        columns=data.metadata["dimensions"],
        values=data.metadata["metrics"],
    ).fillna(0.0)

    dt_index = pd.date_range(
        start=pivoted.index.min(),
        end=pivoted.index.max(),
        freq=data.metadata["frequency"],
    )
    indexes = pd.DataFrame(dt_index).reset_index()
    indexes.columns = ["time_idx", time_col]

    # this join is just to make sure that we get all the row timestamps in
    out = pd.merge(
        indexes, pivoted, left_on=time_col, right_index=True, how="left"
    ).fillna(0.0)

    # now flatten back

    melted = pd.melt(out, id_vars=[time_col, "time_idx"])

    for i, d in enumerate(["metric"] + data.metadata["dimensions"]):
        melted[d] = melted["variable"].apply(lambda x: x[i])

    # and finally, move metrics to separate columns
    re_pivoted = melted.pivot(
        index=["time_idx", time_col] + data.metadata["dimensions"],
        columns="metric",
        values="value",
    ).reset_index()

    return re_pivoted

int-chaos · 2022-10-25T20:18:47Z

Understood. The current implementation uses pd.infer_freq() as well under the assumption that the user's data has no missing timestamps. However, the concern is that if there are missing timestamps, pd.infer_freq would return None and it won't be possible to use pd.date_range. We discussed this issue before but want to revisit it now since TFT handles missing timestamps and we want to leverage that ability. (See #771 for more information)

So, in the case that there are missing timestamps, should we just tell the user to pass in a freq argument. This way we can still use pd.date_range to get the dates we need for the time series and create the time_idx column.

EgorKraevTransferwise · 2022-11-08T14:41:35Z

I still think we should be guessing the frequency from the user input, even with missing data. This could work as follows:

get the ordered set of all distinct timestamps, so we'll only get a gap if there's a gap in all the time series
Calculate the diffs between adjacent timestamps
Get the median diff (most of them will be equal anyway so this should be quite robust)
Generate a new sequence of timestamps by starting at the first original timestamp and recursively adding the median diff to it, like 20 times
Apply pd.infer_freq() to that sequence

All in all, like 5 lines of code and should be quite robust

Or if you just want to use date_range, you can just take the sequence from 4. directly instead

int-chaos · 2022-11-28T04:21:01Z

I still think we should be guessing the frequency from the user input, even with missing data. This could work as follows:

get the ordered set of all distinct timestamps, so we'll only get a gap if there's a gap in all the time series

Calculate the diffs between adjacent timestamps

Get the median diff (most of them will be equal anyway so this should be quite robust)

Generate a new sequence of timestamps by starting at the first original timestamp and recursively adding the median diff to it, like 20 times

Apply pd.infer_freq() to that sequence

All in all, like 5 lines of code and should be quite robust

Or if you just want to use date_range, you can just take the sequence from 4. directly instead

@EgorKraevTransferwise Do you mean something like this?

import pandas as pd
import numpy as np
import random
ts = list(pd.date_range("2017-01-01", periods=36, freq="M"))
ts.pop(random.randint(0, len(ts)-1))
diff = [ts[idx+1] - ts[idx] for idx in range(len(ts) - 1)]
med_diff = np.median(diff)
print(pd.date_range(start=ts[0], end=ts[-1], freq=med_diff))

For some time frequencies like year and month, the diff is the number of days (i.e. med_diff for months is 31D and med_diff for years is 365D). For these case, would we just use conditions, like if med_diff is 31D then freq=M? This was a previous concern as well, which is why we did not go with this method

EgorKraevTransferwise · 2022-11-28T10:20:24Z

Yes, it's a pretty small lookup table for the common ones, and those will cover 95% of all usecases; we should leave the option for the user to manually override the frequency instead, for the less common ones.

int-chaos · 2022-11-29T21:29:54Z

Yes, it's a pretty small lookup table for the common ones, and those will cover 95% of all usecases; we should leave the option for the user to manually override the frequency instead, for the less common ones.

got it!

int-chaos added 3 commits October 19, 2022 00:31

update automl.py - tft gap detection on each time series independently

de9c30e

update automl.py

5fc6255

update automl.py - fix bugs

99b7c1b

sonichi linked an issue Oct 19, 2022 that may be closed by this pull request

Time series gap detection for TFT tasks #754

Open

sonichi reviewed Oct 19, 2022

View reviewed changes

sonichi requested review from EgorKraevTransferwise and markharley October 19, 2022 16:53

int-chaos added 2 commits October 19, 2022 14:36

update automl.py - use groupby

31b326f

update automl.py - fix issue

5fdd8fa

qingyun-wu assigned sonichi Oct 31, 2022

int-chaos mentioned this pull request Jan 9, 2023

allow missing data for "ts_forecast_panel" task #878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tft gap detection on each time series independently #770

tft gap detection on each time series independently #770

int-chaos commented Oct 19, 2022

sonichi Oct 19, 2022

int-chaos Oct 19, 2022

int-chaos Oct 19, 2022

int-chaos Oct 20, 2022

EgorKraevTransferwise commented Oct 21, 2022 •

edited

Loading

int-chaos commented Oct 25, 2022 •

edited

Loading

EgorKraevTransferwise commented Nov 8, 2022 •

edited

Loading

int-chaos commented Nov 28, 2022 •

edited

Loading

EgorKraevTransferwise commented Nov 28, 2022

int-chaos commented Nov 29, 2022

tft gap detection on each time series independently #770

Are you sure you want to change the base?

tft gap detection on each time series independently #770

Conversation

int-chaos commented Oct 19, 2022

sonichi Oct 19, 2022

Choose a reason for hiding this comment

int-chaos Oct 19, 2022

Choose a reason for hiding this comment

int-chaos Oct 19, 2022

Choose a reason for hiding this comment

int-chaos Oct 20, 2022

Choose a reason for hiding this comment

EgorKraevTransferwise commented Oct 21, 2022 • edited Loading

int-chaos commented Oct 25, 2022 • edited Loading

EgorKraevTransferwise commented Nov 8, 2022 • edited Loading

int-chaos commented Nov 28, 2022 • edited Loading

EgorKraevTransferwise commented Nov 28, 2022

int-chaos commented Nov 29, 2022

EgorKraevTransferwise commented Oct 21, 2022 •

edited

Loading

int-chaos commented Oct 25, 2022 •

edited

Loading

EgorKraevTransferwise commented Nov 8, 2022 •

edited

Loading

int-chaos commented Nov 28, 2022 •

edited

Loading