-
Notifications
You must be signed in to change notification settings - Fork 10
Added filtering logic for memmap files. #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new filter_dataset
function to apply sample-level filtering on packed memmap datasets, refactors header‐writing logic into a shared helper, and introduces tests to verify filtering behavior.
- Implements
filter_dataset
infilter_packed_data.py
to write a subset of samples based on a user‐provided predicate. - Extracts
_update_data_length_in_pre_allocated_header
fromcreate_packed_data.py
into a standalone function and updates its callers. - Adds three tests covering output file creation, filtered length, and content correctness.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/dataloader/test_filter_packed_data.py | Adds tests for the new filter_dataset function |
src/modalities/dataloader/filter_packed_data.py | New module implementing filtering logic and header update |
src/modalities/dataloader/create_packed_data.py | Refactors header‐update helper into a top‐level function |
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Left a few minor comments regarding stability for edge cases.
What does this PR do?
This PR adds a filtering function for the files produced by create pack data.
The filtering uses a filter function that takes the index and content of each sample and returns True iff that sample should be retained in the data.
iterates over the data in the file and writes out a new file containing only the retained samples.
General Changes
Checklist before submitting final PR
python tests/tests.py
)CHANGELOG_DEV.md
)