-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tft gap detection on each time series independently #770
base: main
Are you sure you want to change the base?
Conversation
flaml/automl.py
Outdated
unique_ids = dataframe[group_ids].value_counts().reset_index()[group_ids] | ||
for _, row in unique_ids.iterrows(): | ||
df = dataframe.copy() | ||
for id in group_ids: | ||
ts = df.loc[df[id] == row[id]] | ||
ts_series = pd.to_datetime(ts[ts.columns[0]]) | ||
inferred_freq = pd.infer_freq(ts_series) | ||
if inferred_freq is None: | ||
logger.warning( | ||
"Missing timestamps detected. To avoid error with estimators, set estimator list to ['prophet']. " | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple problems:
- Does prophet handle panel data? Does TFT handle missing timestamps? If TFT can handle missing timestamps, this check is not necessary. If prophet can't handle panel data, this message shouldn't suggest adding prophet.
- For loop is not efficient. There should be a functional way using
groupby()
. - Do you intend to infer the frequency for each series? The current code only does it for the last one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yea, the logger message is incorrect. We need there to be no missing timestamps in order to create the
"time_idx"
column. Seeadd_time_idx_col(X)
function in data.py.
Also looked into TFT handling missing data a little bit more. By default, allow_missing_timesteps
for the TimeSeriesDataset object is False
, so currently it does not handle missing timesteps. It would be reasonable if we turn it on. Also, another thing to consider now: by default TimeSeriesDataset uses forward fill strategy (fill based on previous data) to handle missing data, but also allow constant_fill_strategy
where users input a "dictionary of column names with constants to fill in missing values if there are gaps in the sequence".
Perhaps with this, we should allow for missing timesteps and find a different solution to creating a "time_idx"
column. @EgorKraevTransferwise @markharley what do you think? We had a conversation about this before in our first meeting and Egor suggested to just assume no missing time steps for simplicity of code.
- I did try to use groupby at first but could not find a way. Will try again.
- Yea, indentation was wrong... 1037 to 1042 should be indented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, TimeSeriesDataset only handle timestamps that are missing, but do not handle NA
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorKraevTransferwise and @markharley Our current solution to this issue will be to either require no missing timestamps or else user have to provide a freq
argument.
Here is how we handled it in our TFT wrapper (in our internal time series library that was a precursor to our PR to FLAML), pivot by timestamp and dimension, add time index and fill any gaps, melt back into the original shape. We do use the frequency argument in there, but Pandas can infer that.
|
Understood. The current implementation uses So, in the case that there are missing timestamps, should we just tell the user to pass in a |
I still think we should be guessing the frequency from the user input, even with missing data. This could work as follows:
All in all, like 5 lines of code and should be quite robust Or if you just want to use date_range, you can just take the sequence from 4. directly instead |
@EgorKraevTransferwise Do you mean something like this? import pandas as pd
import numpy as np
import random
ts = list(pd.date_range("2017-01-01", periods=36, freq="M"))
ts.pop(random.randint(0, len(ts)-1))
diff = [ts[idx+1] - ts[idx] for idx in range(len(ts) - 1)]
med_diff = np.median(diff)
print(pd.date_range(start=ts[0], end=ts[-1], freq=med_diff)) For some time frequencies like year and month, the diff is the number of days (i.e. |
Yes, it's a pretty small lookup table for the common ones, and those will cover 95% of all usecases; we should leave the option for the user to manually override the frequency instead, for the less common ones. |
got it! |
perform gap detection on each time series independently for panel time series datasets
#754