Skip to content

Added filtering logic for memmap files. #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

BlueCrescent
Copy link
Collaborator

What does this PR do?

This PR adds a filtering function for the files produced by create pack data.
The filtering uses a filter function that takes the index and content of each sample and returns True iff that sample should be retained in the data.
iterates over the data in the file and writes out a new file containing only the retained samples.

General Changes

  • New filtering
  • Corresponding tests

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@BlueCrescent BlueCrescent requested a review from Copilot June 12, 2025 08:49
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new filter_dataset function to apply sample-level filtering on packed memmap datasets, refactors header‐writing logic into a shared helper, and introduces tests to verify filtering behavior.

  • Implements filter_dataset in filter_packed_data.py to write a subset of samples based on a user‐provided predicate.
  • Extracts _update_data_length_in_pre_allocated_header from create_packed_data.py into a standalone function and updates its callers.
  • Adds three tests covering output file creation, filtered length, and content correctness.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/dataloader/test_filter_packed_data.py Adds tests for the new filter_dataset function
src/modalities/dataloader/filter_packed_data.py New module implementing filtering logic and header update
src/modalities/dataloader/create_packed_data.py Refactors header‐update helper into a top‐level function

@le1nux le1nux self-requested a review June 12, 2025 09:50
Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Left a few minor comments regarding stability for edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants