Skip to content

Add Grain support to image_dataset_from_directory and text_dataset_from_directory #21593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

james77777778
Copy link
Contributor

This PR adds Grain support to image_dataset_from_directory and text_dataset_from_directory by including format="grain" as an option.

Note that the text will be Python strings when using format="grain", since only TF supports string tensors.

Here is a demo colab:
https://colab.research.google.com/drive/13RNE_KTbc13Shp5c1uIKzCbo-Ztfe5Ik?usp=sharing

cc @divyashreepathihalli @mattdangerw

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @james77777778, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading capabilities within Keras by integrating Grain support into the image_dataset_from_directory and text_dataset_from_directory utilities. Users can now specify format="grain" to leverage Grain's efficient data processing pipeline, offering an alternative to the default TensorFlow dataset format. This change involves refactoring core dataset utilities to be backend-agnostic, introducing new Grain-specific helpers, and updating file system handling to support diverse storage locations, ultimately providing greater flexibility and performance for data ingestion in Keras workflows.

Highlights

  • Added format argument to dataset utilities: The image_dataset_from_directory and text_dataset_from_directory utilities now accept a new format argument. This allows users to choose between tf.data.Dataset (default) and grain.IterDataset for their data loading needs.
  • New grain_utils module for Grain-specific utilities: A new module, grain_utils.py, has been introduced to house helper functions specifically for handling Grain datasets, including make_batch and make_string_batch for efficient data batching.
  • Improved file system handling for local and remote paths: The internal logic for handling file paths in dataset_utils.py has been enhanced to dynamically use os or tf.io.gfile based on the path prefix. This enables seamless support for both local and remote file systems (e.g., Google Cloud Storage, HDFS).
  • Refactored data loading and label processing for multi-backend support: The data loading and label processing functions within image_dataset_utils.py and text_dataset_utils.py have been refactored to support both TensorFlow and Grain backends, ensuring compatibility and flexibility.
  • Expanded test coverage for new format argument: Test suites for both image and text dataset utilities have been updated to include parameterized tests, thoroughly validating the new format argument across different scenarios and ensuring correct behavior for both TensorFlow and Grain outputs.
  • Grain text data returned as Python strings: When using format="grain" with text_dataset_from_directory, the text data will be returned as standard Python strings, as Grain does not natively support TensorFlow string tensors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the grain data format to image_dataset_from_directory and text_dataset_from_directory by introducing a format='grain' option. The implementation is well-structured, creating parallel functions for grain and tensorflow data loading paths. The changes also include a refactoring in dataset_utils.py to handle different filesystems (local vs. GCS/HDFS) in a more generic way. My feedback includes a couple of suggestions for further refactoring to improve code clarity, performance, and maintainability by reducing code duplication and leveraging existing utilities.

@james77777778 james77777778 force-pushed the add-grain-support-in-dataset-utils branch from 4196615 to 4b791ab Compare August 17, 2025 09:26
@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2025

Codecov Report

❌ Patch coverage is 73.26733% with 54 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@89a8676). Learn more about missing BASE report.

Files with missing lines Patch % Lines
keras/src/utils/image_dataset_utils.py 65.55% 17 Missing and 14 partials ⚠️
keras/src/utils/dataset_utils.py 65.85% 11 Missing and 3 partials ⚠️
keras/src/utils/text_dataset_utils.py 89.36% 1 Missing and 4 partials ⚠️
keras/src/utils/grain_utils.py 81.81% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #21593   +/-   ##
=========================================
  Coverage          ?   82.70%           
=========================================
  Files             ?      568           
  Lines             ?    56881           
  Branches          ?     8889           
=========================================
  Hits              ?    47045           
  Misses            ?     7642           
  Partials          ?     2194           
Flag Coverage Δ
keras 82.51% <71.78%> (?)
keras-jax 63.65% <69.30%> (?)
keras-numpy 58.24% <69.30%> (?)
keras-openvino 34.55% <13.86%> (?)
keras-tensorflow 64.21% <71.28%> (?)
keras-torch 63.80% <69.80%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gbaned gbaned requested a review from mattdangerw August 18, 2025 07:22
@gbaned gbaned added this to PR Queue Aug 18, 2025
@github-project-automation github-project-automation bot moved this to Assigned Reviewer in PR Queue Aug 18, 2025
Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! The code looks good to me.

@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Aug 18, 2025
@github-project-automation github-project-automation bot moved this from Assigned Reviewer to Approved by Reviewer in PR Queue Aug 18, 2025
@fchollet fchollet merged commit 7da416d into keras-team:master Aug 19, 2025
11 checks passed
@google-ml-butler google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase labels Aug 19, 2025
@github-project-automation github-project-automation bot moved this from Approved by Reviewer to Merged in PR Queue Aug 19, 2025
@james77777778 james77777778 deleted the add-grain-support-in-dataset-utils branch August 19, 2025 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Merged
Development

Successfully merging this pull request may close these issues.

5 participants