Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tft gap detection on each time series independently #770

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

int-chaos
Copy link
Collaborator

perform gap detection on each time series independently for panel time series datasets

#754

@sonichi sonichi linked an issue Oct 19, 2022 that may be closed by this pull request
flaml/automl.py Outdated
Comment on lines 1032 to 1042
unique_ids = dataframe[group_ids].value_counts().reset_index()[group_ids]
for _, row in unique_ids.iterrows():
df = dataframe.copy()
for id in group_ids:
ts = df.loc[df[id] == row[id]]
ts_series = pd.to_datetime(ts[ts.columns[0]])
inferred_freq = pd.infer_freq(ts_series)
if inferred_freq is None:
logger.warning(
"Missing timestamps detected. To avoid error with estimators, set estimator list to ['prophet']. "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple problems:

  1. Does prophet handle panel data? Does TFT handle missing timestamps? If TFT can handle missing timestamps, this check is not necessary. If prophet can't handle panel data, this message shouldn't suggest adding prophet.
  2. For loop is not efficient. There should be a functional way using groupby().
  3. Do you intend to infer the frequency for each series? The current code only does it for the last one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yea, the logger message is incorrect. We need there to be no missing timestamps in order to create the "time_idx" column. See add_time_idx_col(X) function in data.py.

Also looked into TFT handling missing data a little bit more. By default, allow_missing_timesteps for the TimeSeriesDataset object is False, so currently it does not handle missing timesteps. It would be reasonable if we turn it on. Also, another thing to consider now: by default TimeSeriesDataset uses forward fill strategy (fill based on previous data) to handle missing data, but also allow constant_fill_strategy where users input a "dictionary of column names with constants to fill in missing values if there are gaps in the sequence".

Perhaps with this, we should allow for missing timesteps and find a different solution to creating a "time_idx" column. @EgorKraevTransferwise @markharley what do you think? We had a conversation about this before in our first meeting and Egor suggested to just assume no missing time steps for simplicity of code.

  1. I did try to use groupby at first but could not find a way. Will try again.
  2. Yea, indentation was wrong... 1037 to 1042 should be indented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, TimeSeriesDataset only handle timestamps that are missing, but do not handle NA values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorKraevTransferwise and @markharley Our current solution to this issue will be to either require no missing timestamps or else user have to provide a freq argument.

@EgorKraevTransferwise
Copy link
Collaborator

EgorKraevTransferwise commented Oct 21, 2022

Here is how we handled it in our TFT wrapper (in our internal time series library that was a precursor to our PR to FLAML), pivot by timestamp and dimension, add time index and fill any gaps, melt back into the original shape. We do use the frequency argument in there, but Pandas can infer that.

def add_time_idx_new(data: BasicDataset, time_col: str) -> pd.DataFrame:

    pivoted = data.data.pivot(
        index=time_col,
        columns=data.metadata["dimensions"],
        values=data.metadata["metrics"],
    ).fillna(0.0)

    dt_index = pd.date_range(
        start=pivoted.index.min(),
        end=pivoted.index.max(),
        freq=data.metadata["frequency"],
    )
    indexes = pd.DataFrame(dt_index).reset_index()
    indexes.columns = ["time_idx", time_col]

    # this join is just to make sure that we get all the row timestamps in
    out = pd.merge(
        indexes, pivoted, left_on=time_col, right_index=True, how="left"
    ).fillna(0.0)

    # now flatten back

    melted = pd.melt(out, id_vars=[time_col, "time_idx"])

    for i, d in enumerate(["metric"] + data.metadata["dimensions"]):
        melted[d] = melted["variable"].apply(lambda x: x[i])

    # and finally, move metrics to separate columns
    re_pivoted = melted.pivot(
        index=["time_idx", time_col] + data.metadata["dimensions"],
        columns="metric",
        values="value",
    ).reset_index()

    return re_pivoted

@int-chaos
Copy link
Collaborator Author

int-chaos commented Oct 25, 2022

Understood. The current implementation uses pd.infer_freq() as well under the assumption that the user's data has no missing timestamps. However, the concern is that if there are missing timestamps, pd.infer_freq would return None and it won't be possible to use pd.date_range. We discussed this issue before but want to revisit it now since TFT handles missing timestamps and we want to leverage that ability. (See #771 for more information)

So, in the case that there are missing timestamps, should we just tell the user to pass in a freq argument. This way we can still use pd.date_range to get the dates we need for the time series and create the time_idx column.

@EgorKraevTransferwise
Copy link
Collaborator

EgorKraevTransferwise commented Nov 8, 2022

I still think we should be guessing the frequency from the user input, even with missing data. This could work as follows:

  1. get the ordered set of all distinct timestamps, so we'll only get a gap if there's a gap in all the time series
  2. Calculate the diffs between adjacent timestamps
  3. Get the median diff (most of them will be equal anyway so this should be quite robust)
  4. Generate a new sequence of timestamps by starting at the first original timestamp and recursively adding the median diff to it, like 20 times
  5. Apply pd.infer_freq() to that sequence

All in all, like 5 lines of code and should be quite robust

Or if you just want to use date_range, you can just take the sequence from 4. directly instead

@int-chaos
Copy link
Collaborator Author

int-chaos commented Nov 28, 2022

I still think we should be guessing the frequency from the user input, even with missing data. This could work as follows:

  1. get the ordered set of all distinct timestamps, so we'll only get a gap if there's a gap in all the time series
  2. Calculate the diffs between adjacent timestamps
  3. Get the median diff (most of them will be equal anyway so this should be quite robust)
  4. Generate a new sequence of timestamps by starting at the first original timestamp and recursively adding the median diff to it, like 20 times
  5. Apply pd.infer_freq() to that sequence

All in all, like 5 lines of code and should be quite robust

Or if you just want to use date_range, you can just take the sequence from 4. directly instead

@EgorKraevTransferwise Do you mean something like this?

import pandas as pd
import numpy as np
import random
ts = list(pd.date_range("2017-01-01", periods=36, freq="M"))
ts.pop(random.randint(0, len(ts)-1))
diff = [ts[idx+1] - ts[idx] for idx in range(len(ts) - 1)]
med_diff = np.median(diff)
print(pd.date_range(start=ts[0], end=ts[-1], freq=med_diff))

For some time frequencies like year and month, the diff is the number of days (i.e. med_diff for months is 31D and med_diff for years is 365D). For these case, would we just use conditions, like if med_diff is 31D then freq=M? This was a previous concern as well, which is why we did not go with this method

@EgorKraevTransferwise
Copy link
Collaborator

Yes, it's a pretty small lookup table for the common ones, and those will cover 95% of all usecases; we should leave the option for the user to manually override the frequency instead, for the less common ones.

@int-chaos
Copy link
Collaborator Author

Yes, it's a pretty small lookup table for the common ones, and those will cover 95% of all usecases; we should leave the option for the user to manually override the frequency instead, for the less common ones.

got it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Time series gap detection for TFT tasks
3 participants