Supporting data access to hugging face data sets #964

blublinsky · 2025-01-23T15:26:01Z

Search before asking

I searched the issues and found no similar issues.

Component

Library/core

Feature

Currently, DPK supports two data location options - local file system and S3 compatible. At the same time, one of the largest collections of public datasets is the HF hub. Natively supporting data there opens up many capabilities for the users and is also quite important for AI Alliance.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

touma-I · 2025-01-29T17:20:33Z

This writeup does not give me enough to act/prioritize. Please provide specifics. Here are some things you may want to address (but your writeup does not have to be limited those). You can describe the use case that you want to enable, what data set you need to access, do you need to write data back to HF, what other methods have you used to access (read or write) the data, why methods did you try before and why what you are proposing will work better.

deanwampler · 2025-01-30T13:30:40Z

I'll spell out the most important use cases for Open Trusted Data Initiative (OTDI):

Read datasets from HF to analyze them for conformance to our requirements for openness, provenance, and governance.
Optionally write-back updates to the metadata for those datasets, reflecting the analysis.
Generate new derived datasets from HF or other datasets and write them back to HF as new datasets.

I would think it is obvious that users need direct R/W access to the world's most important AI dataset repository. Without this feature, the Alliance will either have to fork DPK to use it or adopt an alternative.

cc: @nirmdesai

touma-I · 2025-01-30T16:40:21Z

@deanwampler Thanks. There are ample documentations out there to show how you can do all 3 using Hugging Face APIs. Few questions so we can add clarity on what you are trying to do:

1- Where have you used the HF APIs in your recipe and where did it fail in allowing you to realize your objectives for ingesting the initial data and then writing it back at the end?
2- Do you need to save intermediate results snapshots to Hugging Face and have you used the snapshotting API from HF to do that ?
2- Are you thinking of implementing any sort of governance on 3 ?
3- What is the volume of data transaction you want to achieve. Can you provide some information on non-functional requirements ?
4- There are several recipe out there that shows how one can use Hugging Face File System in a python application. Have you done any initial exploration on how that can be used.

Any clarificiation on how you tried in the past to solve the problem using readily available tools would be helpful.

blublinsky · 2025-01-30T17:50:11Z

The whole implementation is based on HF file system APIs
We are using the same HF fs APIs for both read and write, they were all tested
It's the same as all other transforms
See 1 and 2
HF fs APIs work great, now we want to use them as part of DPK

blublinsky added the enhancement New feature or request label Jan 23, 2025

blublinsky mentioned this issue Jan 23, 2025

Hf data access #962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting data access to hugging face data sets #964

Supporting data access to hugging face data sets #964

blublinsky commented Jan 23, 2025

touma-I commented Jan 29, 2025

deanwampler commented Jan 30, 2025

touma-I commented Jan 30, 2025

blublinsky commented Jan 30, 2025

Supporting data access to hugging face data sets #964

Supporting data access to hugging face data sets #964

Comments

blublinsky commented Jan 23, 2025

Search before asking

Component

Feature

Are you willing to submit a PR?

touma-I commented Jan 29, 2025

deanwampler commented Jan 30, 2025

touma-I commented Jan 30, 2025

blublinsky commented Jan 30, 2025