[bug] error message is not correct #565

ngupta23 · 2024-12-12T22:11:27Z

What happened + What you expected to happen

When passing a missing column name to id_col or time_col, the error message is either not correct or could be improved.

Versions / Dependencies

Click to expand

Dependencies: nixtla 0.6.4

Reproducible example

ID Column is missing

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()

# Read the data
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)

# ID column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', id_col="Sth")

ValueError: Series contain missing or duplicate timestamps, or the timestamps do not match the provided frequency.
Please make sure that all series have a single observation from the first to the last timestamp and that the provided frequency matches the timestamps'.
You can refer to https://docs.nixtla.io/docs/tutorials-missing_values for an end to end example.

In this case, the error message is incorrect. There should be some indication that the id_col is incorrect.

Time Column is missing

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(api_key="")
nixtla_client.validate_api_key()

# Read the data
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
)

# Time column is incorrect
forecast_df = nixtla_client.forecast(df=df, h=24, freq='h', time_col="Sth")

KeyError: 'Sth'

In this case while the error is somewhat correct, it would be a nicer UX to catch this error and relay the error in a better way to the user (indicating that the time_col is missing)

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

ngupta23 · 2024-12-16T18:22:35Z

It is possible that we are not checking if that id_col exists or not. This is possibly due to the fact that the default value of id_col is unique_id and it need not be present in the data as in the air passenger dataset even though the argument value passed is unique_id.

In my opinion, it would be better to make the default value of the id_col argument = None and let the user specify it if it exists in the data. Then we should add a check to make sure that if a user specifies this (not None), then it should exist in the dataset before we proceed further.

jmoralez · 2024-12-16T18:29:05Z

This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.

Changing the default would mean breaking the correct workflow, which is having the ID as a column and providing its name.

elephaint · 2024-12-16T19:26:34Z

This is because we support having no ID and ID as the index, so we have no idea where the ID is and we try to guess, which fills me with joy.

+1 on voting for deprecating the support for Pandas indices.

jmoralez · 2024-12-16T19:35:28Z

Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.

ngupta23 · 2024-12-16T19:48:01Z

Never mind, what can be in the index is the timestamps. So the only problem is that the data can have no ID and in that case we assign a constant id, so if the data does have an ID but the user doesn't set that as id_col there's no way for us to know.

Can we do this - Change the default for id_col to None and then

If the data does not have a ID column and the user leaves the id_col=None, then we treat this as a single time series.
If the data has an ID column and the user specifies this column name in the id_col argument, then we check for it and flag an error if it does not exist.
If the user has an ID column but does not specify it in id_col, it is treated as an exogenous column and will flag error if it is categorical.

Do you see any issues with it?

jmoralez · 2024-12-16T20:05:04Z

How do we know if we're in number 3?

jmoralez · 2024-12-16T20:21:23Z

Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).

What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.

ngupta23 · 2024-12-17T17:15:03Z

3. If the user has an ID column but does not specify it in id_col

id_col would be None in this case (not specified by the user). i.e. the user has to explicitly specify which ID col they want to use (if any).

Changing the default to None would be a breaking change to everyone who follows the standard we defined in the other libraries (which this library decided not to follow).

Could you clarify which standard this is and what other libraries are we referring to here?

What I would support would be allowing the id_col to be None and require users to set it to None if they don't have it and they're providing a single serie, which would also be a breaking change.

This would be an acceptable compromise if changing the default to None is not acceptable.

jmoralez · 2024-12-17T17:29:19Z

https://nixtlaverse.nixtla.io/statsforecast/docs/getting-started/getting_started_short.html

The input to StatsForecast is always a data frame in long format with three columns: unique_id, ds and y

https://nixtlaverse.nixtla.io/mlforecast/docs/getting-started/quick_start_local.html#data-format

The data is expected to be a pandas dataframe in long format, that is, each row represents an observation of a single serie at a given time, with at least three columns:
id_col: column that identifies each serie.
target_col: column that has the series values at each timestamp.
time_col: column that contains the time the series value was observed. These are usually timestamps, but can also be consecutive integers.

https://nixtlaverse.nixtla.io/neuralforecast/docs/getting-started/quickstart.html#2-loading-airpassengers-data

The core.NeuralForecast class contains shared, fit, predict and other methods that take as inputs pandas DataFrames with columns ['unique_id', 'ds', 'y']

ngupta23 · 2024-12-17T18:15:50Z

Thanks @jmoralez - In that case, your suggestion of not changing the default but accepting the value of None would be good to implement. We should also add checks in that case if the column specified by the user exists or not.

ngupta23 added the bug label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] error message is not correct #565

[bug] error message is not correct #565

ngupta23 commented Dec 12, 2024 •

edited

Loading

ngupta23 commented Dec 16, 2024

jmoralez commented Dec 16, 2024 •

edited

Loading

elephaint commented Dec 16, 2024

jmoralez commented Dec 16, 2024

ngupta23 commented Dec 16, 2024 •

edited

Loading

jmoralez commented Dec 16, 2024

jmoralez commented Dec 16, 2024

ngupta23 commented Dec 17, 2024

jmoralez commented Dec 17, 2024

ngupta23 commented Dec 17, 2024

[bug] error message is not correct #565

[bug] error message is not correct #565

Comments

ngupta23 commented Dec 12, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproducible example

ID Column is missing

Time Column is missing

Issue Severity

ngupta23 commented Dec 16, 2024

jmoralez commented Dec 16, 2024 • edited Loading

elephaint commented Dec 16, 2024

jmoralez commented Dec 16, 2024

ngupta23 commented Dec 16, 2024 • edited Loading

jmoralez commented Dec 16, 2024

jmoralez commented Dec 16, 2024

ngupta23 commented Dec 17, 2024

jmoralez commented Dec 17, 2024

ngupta23 commented Dec 17, 2024

ngupta23 commented Dec 12, 2024 •

edited

Loading

jmoralez commented Dec 16, 2024 •

edited

Loading

ngupta23 commented Dec 16, 2024 •

edited

Loading