Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average of different prediction horizons as a metric? #66

Open
santoshatchi opened this issue Feb 22, 2023 · 3 comments
Open

Average of different prediction horizons as a metric? #66

santoshatchi opened this issue Feb 22, 2023 · 3 comments

Comments

@santoshatchi
Copy link

Hello Authors,

Could you please clarify the usage of the average of different prediction horizons as a benchmarking metric? Why was it used, and how to justify the validity of this?

I am doing a similar project and trying to report values at different horizons. My model is not getting values close to those reported in SOTA (top 5) models like yours. Could you please help with the intuition on reporting the average rather than individual horizons?

Thanks
Santosh

@jakegrigsby
Copy link
Member

IIRC that's a convention inherited by Informer and the followup works to it that have come out since this repo's initial release and before it's more recent versions. The accuracy of individual timesteps into the future can be arbitrary and hard to interpret. 1 step predictions are too easy, but distant predictions can be very difficult given a fixed length context window which may be too short. In highly periodic domains some distant horizons can also be easy (such as 24 hours ahead in a dataset with clear daily periodicity like weather forecasting). So reporting every horizon metric takes a lot of explaining, large tables, and can be misleading. Averaging gives a better sense of the model's performance over the entire duration we care about.

At a few points during this project I hacked together logging metrics for accuracy at each individual timestep as a sanity-check. In my experience you can expect a roughly linearly increasing error as you predict further into the future.

As far as replicating the results on these datasets in your own project, double check that you aren't counting missing datapoints in the metrics. This can make a huge difference and is something a lot of the literature (and early versions of this codebase) get wrong.

@steve3nto
Copy link

I agree with Jake, averaging over the whole prediction horizon makes sense in order to compare single numbers as a metric.
It is a pity though that different benchmarks use different metrics.
For example, check here for PEMS-Bay:
https://paperswithcode.com/sota/traffic-prediction-on-pems-bay

They report RMSE (I guess this is averaged over the whole horizon)
and MAE @ 12 step (this is for a single prediction 12 steps into the future)

It would be good to have more standardized metrics.
In the paper there is no RMSE for PEMS-Bay. There is MAE, MSE and MAPE, but unfortunately PapersWithCode does not report those.

This is not a question, just a comment, sorry for the spam! 😁

@jakegrigsby
Copy link
Member

Yeah the traffic datasets / literature is the main example where reporting multiple horizons is the default. The longest horizons are 12 timesteps so this can be feasible. Once you get longer than that it stops making sense to report arbitrary intervals in tables in my opinion. It would be interesting if the convention for reporting forecasting results was a plot of error over forecast duration for each dataset. That wasn't necessary at the time (2021) but I think this is probably what I would do if I were to redo this project today...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants