Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting data access to hugging face data sets #964

Open
2 tasks done
blublinsky opened this issue Jan 23, 2025 · 4 comments
Open
2 tasks done

Supporting data access to hugging face data sets #964

blublinsky opened this issue Jan 23, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@blublinsky
Copy link
Collaborator

Search before asking

  • I searched the issues and found no similar issues.

Component

Library/core

Feature

Currently, DPK supports two data location options - local file system and S3 compatible. At the same time, one of the largest collections of public datasets is the HF hub. Natively supporting data there opens up many capabilities for the users and is also quite important for AI Alliance.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@blublinsky blublinsky added the enhancement New feature or request label Jan 23, 2025
@touma-I
Copy link
Collaborator

touma-I commented Jan 29, 2025

This writeup does not give me enough to act/prioritize. Please provide specifics. Here are some things you may want to address (but your writeup does not have to be limited those). You can describe the use case that you want to enable, what data set you need to access, do you need to write data back to HF, what other methods have you used to access (read or write) the data, why methods did you try before and why what you are proposing will work better.

@deanwampler
Copy link
Contributor

I'll spell out the most important use cases for Open Trusted Data Initiative (OTDI):

  1. Read datasets from HF to analyze them for conformance to our requirements for openness, provenance, and governance.
  2. Optionally write-back updates to the metadata for those datasets, reflecting the analysis.
  3. Generate new derived datasets from HF or other datasets and write them back to HF as new datasets.

I would think it is obvious that users need direct R/W access to the world's most important AI dataset repository. Without this feature, the Alliance will either have to fork DPK to use it or adopt an alternative.

cc: @nirmdesai

@touma-I
Copy link
Collaborator

touma-I commented Jan 30, 2025

@deanwampler Thanks. There are ample documentations out there to show how you can do all 3 using Hugging Face APIs. Few questions so we can add clarity on what you are trying to do:

1- Where have you used the HF APIs in your recipe and where did it fail in allowing you to realize your objectives for ingesting the initial data and then writing it back at the end?
2- Do you need to save intermediate results snapshots to Hugging Face and have you used the snapshotting API from HF to do that ?
2- Are you thinking of implementing any sort of governance on 3 ?
3- What is the volume of data transaction you want to achieve. Can you provide some information on non-functional requirements ?
4- There are several recipe out there that shows how one can use Hugging Face File System in a python application. Have you done any initial exploration on how that can be used.

Any clarificiation on how you tried in the past to solve the problem using readily available tools would be helpful.

@blublinsky
Copy link
Collaborator Author

  1. The whole implementation is based on HF file system APIs
  2. We are using the same HF fs APIs for both read and write, they were all tested
  3. It's the same as all other transforms
  4. See 1 and 2
    HF fs APIs work great, now we want to use them as part of DPK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants