Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preprocess the Time-MMD dataset? #2

Open
Ironieser opened this issue Oct 9, 2024 · 3 comments
Open

How to preprocess the Time-MMD dataset? #2

Ironieser opened this issue Oct 9, 2024 · 3 comments

Comments

@Ironieser
Copy link

Thank you for your great work.

Currently, I am confused about preprocessing the Time-MMD dataset.

In your provided data, data/Public_Health/US_FLURATIO_Week.csv, I do not know how to get six kinds of data, such as prior_history_avg', 'prior_history_std', 'Final_Search_2', 'Final_Search_4', 'Final_Search_6', 'Final_Output'.

By data/DataPre_ClosedSourceLLM/Prepare.ipynb, we could obtain Final_Output, however, how can we get the other five columns of data?

Any suggestion will help me a lot, thank you!

@ranlychan
Copy link

im also wondering what does prior_history_avg means

@ranlychan
Copy link

ranlychan commented Nov 28, 2024

Thank you for your great work.

Currently, I am confused about preprocessing the Time-MMD dataset.

In your provided data, data/Public_Health/US_FLURATIO_Week.csv, I do not know how to get six kinds of data, such as prior_history_avg', 'prior_history_std', 'Final_Search_2', 'Final_Search_4', 'Final_Search_6', 'Final_Output'.

By data/DataPre_ClosedSourceLLM/Prepare.ipynb, we could obtain Final_Output, however, how can we get the other five columns of data?

Any suggestion will help me a lot, thank you!

After reading the paper Time-MMD: Multi-Domain Multimodal Dataset for Time Series Analysis and analyze the original data from https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html, I believe that the prior_history_avg in many of the datasets is obtained by conducting seasonal grouped average. To be specific, in data/Public_Health/US_FLURATIO_Week.csv, the author take a seasonal period (marked as $p$) of 51 weeks and group window size (marked as $n$) of 1, the prior_history_avg at time step $t$ is $x_{t-51}$ according to the following formula:

$$ \text{prior history avg}(t) = \frac{1}{n}\sum_{i=1}^n {x_{t-i * p}} $$

In which the $x_t$ is %UNWEIGHTED ILI data in US_FLURATIO_Week at time step $t$.

Accordingly, I write a code to get prior_history_avg:

import pandas as pd

def seasonal_group_average(df=pd.DataFrame(), seasonal_period=51, group_window_size=1, target='%UNWEIGHTED ILI'):
 
    """
    Compute the `prior_history_avg` based on seasonal grouped average.
    
    Args:
        df (pd.DataFrame): DataFrame containing the time series data.
        seasonal_period (int): Seasonal period, e.g., 51 weeks.
        group_window_size (int): Size of the group window for averaging.
        target (str): Column name of the target time series.
    
    Returns:
        pd.DataFrame: DataFrame with a new column `prior_history_avg`.
    """
    if target in df.columns:
        df['prior_history_avg'] = [
            (
                sum(
                    df[target].iloc[max(0, t - i * seasonal_period)] 
                    for i in range(1, group_window_size + 1)
                ) / group_window_size
                if t >= seasonal_period else 0.0
            )
            for t in range(len(df))
        ]
    return df

ili_data_df = pd.read_csv('ILINet.csv', header=1)
seasonal_grouped_df = seasonal_group_average(df=ili_data_df)
seasonal_grouped_df.to_csv('test.csv')
seasonal_grouped_df

But in the author's data US_FLURATIO_Week.csv, the result produced with my code did't match with the author's prior_history_avg in some places due to data changing or shifting. The reason for these manual adjustments remain unclear to me.

@ranlychan
Copy link

ranlychan commented Nov 29, 2024

In US_VMT_Month.csv, $p=12, n=2$, the prior_history_avg at time step $t$ is $\frac{1}{n}(x_{t-12}+x_{t-24})$. Processing data with my code has no difference with the author's.
d11bea953b59bf732db9ce878dbc1d74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants