Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to remove windows with poor data quality #1059

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jasminerienecker
Copy link
Contributor

@jasminerienecker jasminerienecker commented Jul 9, 2024

This review adds the parameters data_availability_threshold (defaults to 0.0 to maintain currently functionality) to all models that inherit the BaseWindows class. This parameters allows us to discard windows where the percentage of good quality data points is below the threshold. The quality of a data point is determined by the corresponding value in the available_mask column.

This is a functionality I currently require as my dataset has many large gaps and I don't want to be training the model using these gaps.

I have added a test to the end of base_windows notebook.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@elephaint
Copy link
Contributor

@jasminerienecker Thanks for the PR, I think I understand what you're trying to do.

Isn't it easier to remove the missing datapoints / large gaps from your dataset before training?

@jasminerienecker
Copy link
Contributor Author

jasminerienecker commented Jul 11, 2024

@jasminerienecker Thanks for the PR, I think I understand what you're trying to do.

Isn't it easier to remove the missing datapoints / large gaps from your dataset before training?

@elephaint Going through the code it seems the base_windows class assumes all the timesteps are available. For example if your data is at one minute resolution but there is a gap of 10 minutes, the windows are created as if no timesteps are missing.

This means I think you'd have to keep the rows with missing values in the dataset, but if there are longer chunks of missing data (as in most of the values in a window are NaN) this could interfere with the model training. This solution was a way of keeping the temporal information while not training the model on windows where the majority of the data is not available.

Please let me know if there's something I've missed though!

@elephaint
Copy link
Contributor

@jasminerienecker Thanks; I think this PR could be a generalization of #1036 (@jose-moralez). I have to think about the behaviour and we also would have to include the changes in the other Base classes.

@elephaint
Copy link
Contributor

@marcopeix Now that I've had more time to think about it, I think this is a nice addition, wdyt?

@elephaint elephaint requested a review from marcopeix September 30, 2024 19:47
@jasminerienecker
Copy link
Contributor Author

@elephaint @marcopeix any further thoughts on this?

@elephaint
Copy link
Contributor

elephaint commented Oct 22, 2024

@elephaint @marcopeix any further thoughts on this?

Hey @jasminerienecker, sorry for the delay in getting this merged. I think it's a valuable addition to the repo, I just want @marcopeix opinion on it too.

There's a couple of things that are open (I'm happy to do this btw):

  • Include all base_windows models
  • Resolve conflicts
  • Decide on whether to include it in base_multivariate and base_recurrent too (my opinion: no)

@marcopeix anything else?

@marcopeix
Copy link
Contributor

Sorry for the late reply! Why is this only for BaseWindow models, and not for multivariate and recurrent as well?

@jasminerienecker
Copy link
Contributor Author

@marcopeix I think we should be able to apply this logic in the same way to the BaseMultivariate models by adjusting the sample and available conditions

final_condition = (sample_condition > 0) & (available_condition > 0) 

I found the behaviour for the Recurrent models a lot more uncertain though. I suppose it'd only really apply when input_size > 0 as otherwise we'd end up excluding the whole sample. In this case the window is created using a random time across the whole batch. This makes the logic of how you'd want to include/exclude samples less clear as different samples in the batch would likely have different data quality over different time windows.

if (step == "train") and (self.input_size > 0):
            input_size = self.input_size
            if (input_size > 0) and (n_windows > input_size):
                max_sampleable_time = n_windows - self.input_size + 1
                start = np.random.choice(max_sampleable_time)
                windows = windows[:, :, start : (start + input_size), :]

@marcopeix
Copy link
Contributor

I would personally include it for all models. Anyway, input_size should always be greater than 0 for all models. @elephaint, you agree too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants