Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] error message is not correct #565

Open
ngupta23 opened this issue Dec 12, 2024 · 10 comments
Open

[bug] error message is not correct #565

ngupta23 opened this issue Dec 12, 2024 · 10 comments
Labels

Comments

@ngupta23
Copy link
Member

ngupta23 commented Dec 12, 2024

What happened + What you expected to happen

When passing a missing column name to id_col or time_col, the error message is either not correct or could be improved.

Versions / Dependencies

Click to expand Dependencies: nixtla 0.6.4

Reproducible example

ID Column is missing

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()

# Read the data
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)

# ID column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', id_col="Sth")
ValueError: Series contain missing or duplicate timestamps, or the timestamps do not match the provided frequency.
Please make sure that all series have a single observation from the first to the last timestamp and that the provided frequency matches the timestamps'.
You can refer to https://docs.nixtla.io/docs/tutorials-missing_values for an end to end example.

In this case, the error message is incorrect. There should be some indication that the id_col is incorrect.

Time Column is missing

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()

# Read the data
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)

# Time column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', time_col="Sth")
KeyError: 'Sth'

In this case while the error is somewhat correct, it would be a nicer UX to catch this error and relay the error in a better way to the user (indicating that the time_col is missing)

Issue Severity

Low: It annoys or frustrates me.

@ngupta23 ngupta23 added the bug label Dec 12, 2024
@ngupta23
Copy link
Member Author

It is possible that we are not checking if that id_col exists or not. This is possibly due to the fact that the default value of id_col is unique_id and it need not be present in the data as in the air passenger dataset even though the argument value passed is unique_id.

In my opinion, it would be better to make the default value of the id_col argument = None and let the user specify it if it exists in the data. Then we should add a check to make sure that if a user specifies this (not None), then it should exist in the dataset before we proceed further.

@jmoralez
Copy link
Member

jmoralez commented Dec 16, 2024

This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.

Changing the default would mean breaking the correct workflow, which is having the ID as a column and providing its name.

@elephaint
Copy link
Contributor

This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.

+1 on voting for deprecating the support for Pandas indices.

@jmoralez
Copy link
Member

Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.

@ngupta23
Copy link
Member Author

ngupta23 commented Dec 16, 2024

Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.

Can we do this - Change the default for id_col to None and then

  1. If the data does not have a ID column and the user leaves the id_col=None, then we treat this as a single time series.
  2. If the data has an ID column and the user specifies this column name in the id_col argument, then we check for it and flag an error if it does not exist.
  3. If the user has an ID column but does not specify it in id_col, it is treated as an exogenous column and will flag error if it is categorical.

Do you see any issues with it?

@jmoralez
Copy link
Member

How do we know if we're in number 3?

@jmoralez
Copy link
Member

Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).

What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.

@ngupta23
Copy link
Member Author

3. If the user has an ID column but does not specify it in id_col

id_col would be None in this case (not specified by the user). i.e. the user has to explicitly specify which ID col they want to use (if any).

Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).

Could you clarify which standard this is and what other libraries are we referring to here?

What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.

This would be an acceptable compromise if changing the default to None is not acceptable.

@jmoralez
Copy link
Member

https://nixtlaverse.nixtla.io/statsforecast/docs/getting-started/getting_started_short.html

The input to StatsForecast is always a data frame in long format with three columns: unique_id, ds and y

https://nixtlaverse.nixtla.io/mlforecast/docs/getting-started/quick_start_local.html#data-format

The data is expected to be a pandas dataframe in long format, that is, each row represents an observation of a single serie at a given time, with at least three columns:
id_col: column that identifies each serie.
target_col: column that has the series values at each timestamp.
time_col: column that contains the time the series value was observed. These are usually timestamps, but can also be consecutive integers.

https://nixtlaverse.nixtla.io/neuralforecast/docs/getting-started/quickstart.html#2-loading-airpassengers-data

The core.NeuralForecast class contains shared, fit, predict and other methods that take as inputs pandas DataFrames with columns ['unique_id', 'ds', 'y']

@ngupta23
Copy link
Member Author

Thanks @jmoralez - In that case, your suggestion of not changing the default but accepting the value of None would be good to implement. We should also add checks in that case if the column specified by the user exists or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants