motivation for HF datasets? #830

arendu · 2025-04-03T01:15:28Z

arendu
Apr 3, 2025
Collaborator

The motivation behind these kind of "datasets" is very odd imo.

Why not just enforce a single dataset class - and call it a day! Anyone can write a simple script to download a dataset from HF and convert it to open-ai format, right? Also, there is almost zero usecase where you just take one dataset and train on it (at least to build high quality aligned models) its always a combination of a huge set of datasets, each with different formats etc. This is all data-munging work that a toolkit should not be entangled in.

Premature notions of "convenience" just end up being code debt.

terrykong · 2025-04-03T16:19:14Z

terrykong
Apr 3, 2025
Maintainer

Allowing HF datasets in process is just to allow us to test things E2E and give users an easy way to get started.

I am very anti-conversion scripts because they tend to be tech debt and have issues with maintenance and accumulate bugs due to being neglected.

We are open to other input formats. Perhaps you can share your format and we can discuss an API.

Premature notions of "convenience"

We want the community to be able to reproduce our results, so that's something we also care about, so we can't just enforce a data schema and a conversion script and hope that the community can reproduce our results.

0 replies

titu1994 · 2025-04-03T18:56:00Z

titu1994
Apr 3, 2025
Collaborator

I don't want dataset conversion scripts. HF + off the shelf jsonl support (using Datasets("json", ...)) similar to OpenRLHF is sufficient for most general tasks.

If there are custom datasets, then writing a dataset class is anyway required (for multimodal for example)

0 replies

terrykong · 2025-04-03T19:09:51Z

terrykong
Apr 3, 2025
Maintainer

off the shelf jsonl support (using Datasets("json", ...))

I think the schema for this is something we can discuss using this issue. I agree we should support a direct jsonl format for at least text datasets

0 replies

ashors1 · 2025-04-03T19:20:24Z

ashors1
Apr 3, 2025
Collaborator

Even though our only existing examples are using HuggingFace datasets, that's definitely not all we plan to support. As described in the SFT documentation, the current design just expects your processed data to conform to the HuggingFace chat format to make it simple to use HuggingFace's tokenizer.apply_chat_template. How you massage your data into that format is up to you. You're totally free to work with arbitrary json files. @arendu let me know if you have concerns about this design. I'd also be happy to work with you on a specific use-case to showcase how to work with json files directly.

0 replies

arendu · 2025-04-03T20:58:10Z

arendu
Apr 3, 2025
Collaborator Author

[The closer I look at it, I'm seeing this is can be somewhat minor/stylistic in nature, so my view is not as strong as it might seem from the first comment]
[Also I'm only coming from the text-only dataset point of view]

I guess Im getting mixed messages from the docs and the fact that data/hf_datasets/squad.py is a class and not a stand-alone toy script even though the line just before it says "If your data is not in the correct format, simply write a preprocessing script".

To me, if it is a preprocessing script there should be no references to SquadDataset in the rest of the repo.

Just something like ChatDataset is what I'd expect for a class reference in the rest of the repo. The e2e tests can certiainly have some logic to download HF and process it to fit into ChatDataset.

that's definitely not all we plan to support.

I'm actually advocating for actively supporting less :) I worry that it is not clear that SquadDataset is just for unit-tests and it looks like we are setting a precedence for "add your own dataset class for any new dataset you want to work with". (Which is not what the doc says, but the code looks like it saying that, at least to me).

0 replies

2025-08-03T02:16:56Z

github-actions[bot]
bot Aug 3, 2025

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

0 replies

yuki-666 · 2025-08-13T08:34:47Z

yuki-666
Aug 13, 2025
Collaborator

Hi @arendu @titu1994 , thanks for the interest in NeMo-RL!
Have a design for dataset refactor at #909, please let me know if you have any suggestions or concerns.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

motivation for HF datasets? #830

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

motivation for HF datasets? #830

Uh oh!

arendu Apr 3, 2025 Collaborator

Replies: 7 comments

Uh oh!

terrykong Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

titu1994 Apr 3, 2025 Collaborator

Uh oh!

terrykong Apr 3, 2025 Maintainer

Uh oh!

ashors1 Apr 3, 2025 Collaborator

Uh oh!

Uh oh!

arendu Apr 3, 2025 Collaborator Author

Uh oh!

github-actions[bot] bot Aug 3, 2025

Uh oh!

yuki-666 Aug 13, 2025 Collaborator

arendu
Apr 3, 2025
Collaborator

terrykong
Apr 3, 2025
Maintainer

titu1994
Apr 3, 2025
Collaborator

terrykong
Apr 3, 2025
Maintainer

ashors1
Apr 3, 2025
Collaborator

arendu
Apr 3, 2025
Collaborator Author

github-actions[bot]
bot Aug 3, 2025

yuki-666
Aug 13, 2025
Collaborator