-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to remove windows with poor data quality #1059
base: main
Are you sure you want to change the base?
add option to remove windows with poor data quality #1059
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@jasminerienecker Thanks for the PR, I think I understand what you're trying to do. Isn't it easier to remove the missing datapoints / large gaps from your dataset before training? |
@elephaint Going through the code it seems the base_windows class assumes all the timesteps are available. For example if your data is at one minute resolution but there is a gap of 10 minutes, the windows are created as if no timesteps are missing. This means I think you'd have to keep the rows with missing values in the dataset, but if there are longer chunks of missing data (as in most of the values in a window are NaN) this could interfere with the model training. This solution was a way of keeping the temporal information while not training the model on windows where the majority of the data is not available. Please let me know if there's something I've missed though! |
@jasminerienecker Thanks; I think this PR could be a generalization of #1036 (@jose-moralez). I have to think about the behaviour and we also would have to include the changes in the other Base classes. |
@marcopeix Now that I've had more time to think about it, I think this is a nice addition, wdyt? |
@elephaint @marcopeix any further thoughts on this? |
Hey @jasminerienecker, sorry for the delay in getting this merged. I think it's a valuable addition to the repo, I just want @marcopeix opinion on it too. There's a couple of things that are open (I'm happy to do this btw):
@marcopeix anything else? |
Sorry for the late reply! Why is this only for BaseWindow models, and not for multivariate and recurrent as well? |
@marcopeix I think we should be able to apply this logic in the same way to the BaseMultivariate models by adjusting the sample and available conditions
I found the behaviour for the Recurrent models a lot more uncertain though. I suppose it'd only really apply when input_size > 0 as otherwise we'd end up excluding the whole sample. In this case the window is created using a random time across the whole batch. This makes the logic of how you'd want to include/exclude samples less clear as different samples in the batch would likely have different data quality over different time windows.
|
I would personally include it for all models. Anyway, input_size should always be greater than 0 for all models. @elephaint, you agree too? |
This review adds the parameters data_availability_threshold (defaults to 0.0 to maintain currently functionality) to all models that inherit the BaseWindows class. This parameters allows us to discard windows where the percentage of good quality data points is below the threshold. The quality of a data point is determined by the corresponding value in the available_mask column.
This is a functionality I currently require as my dataset has many large gaps and I don't want to be training the model using these gaps.
I have added a test to the end of base_windows notebook.