Skip to content

Batch filtering and resampling #270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

s-kganz
Copy link

@s-kganz s-kganz commented Apr 25, 2025

Description of proposed changes

This PR implements batch filtering and resampling. I'm making this PR because I have been using these features in my own work for some time and thought they would be useful to others. Also, filtering has been requested in #158 and #162.

In both operations, the user provides a function that either accepts/rejects a batch (for filtering) or assign a sample weight for each batch (for resampling). Both functions take the dataset and the dict of slice objects, so the user can write those functions strategically to minimize computation on dask arrays. The changes are all in BatchGenerator since it seemed like BatchSchema is primarily intended as a representation of windowing parameters.

I was not able to get the asv tests to work on my development environment, but there is no change to the original behavior if resample_fn and filter_fn are not provided so I do not expect there to be a performance penalty. Filtering and resampling happen independently, but you could approximate doing both in one shot by having resample_fn return 0 for invalid batches. That would be a little faster than "checking" each batch twice in two separate functions.

Copy link

codecov bot commented Apr 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.10%. Comparing base (43c9135) to head (2e721fc).
Report is 33 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #270      +/-   ##
==========================================
+ Coverage   96.25%   97.10%   +0.84%     
==========================================
  Files           6        6              
  Lines         347      414      +67     
  Branches       82       63      -19     
==========================================
+ Hits          334      402      +68     
+ Misses          8        6       -2     
- Partials        5        6       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxrjones
Copy link
Member

Thanks for your PR! I'm traveling these next two weeks and will take a look after I'm back, unless a different maintainer has time for a review before then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants