Replies: 7 comments
-
Allowing HF datasets in process is just to allow us to test things E2E and give users an easy way to get started. I am very anti-conversion scripts because they tend to be tech debt and have issues with maintenance and accumulate bugs due to being neglected. We are open to other input formats. Perhaps you can share your format and we can discuss an API.
We want the community to be able to reproduce our results, so that's something we also care about, so we can't just enforce a data schema and a conversion script and hope that the community can reproduce our results. |
Beta Was this translation helpful? Give feedback.
-
I don't want dataset conversion scripts. HF + off the shelf jsonl support (using Datasets("json", ...)) similar to OpenRLHF is sufficient for most general tasks. If there are custom datasets, then writing a dataset class is anyway required (for multimodal for example) |
Beta Was this translation helpful? Give feedback.
-
I think the schema for this is something we can discuss using this issue. I agree we should support a direct jsonl format for at least text datasets |
Beta Was this translation helpful? Give feedback.
-
Even though our only existing examples are using HuggingFace datasets, that's definitely not all we plan to support. As described in the SFT documentation, the current design just expects your processed data to conform to the HuggingFace chat format to make it simple to use HuggingFace's |
Beta Was this translation helpful? Give feedback.
-
[The closer I look at it, I'm seeing this is can be somewhat minor/stylistic in nature, so my view is not as strong as it might seem from the first comment] I guess Im getting mixed messages from the docs and the fact that data/hf_datasets/squad.py is a class and not a stand-alone toy script even though the line just before it says "If your data is not in the correct format, simply write a preprocessing script". To me, if it is a preprocessing script there should be no references to Just something like
I'm actually advocating for actively supporting less :) I worry that it is not clear that |
Beta Was this translation helpful? Give feedback.
-
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Beta Was this translation helpful? Give feedback.
-
Hi @arendu @titu1994 , thanks for the interest in NeMo-RL! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The motivation behind these kind of "datasets" is very odd imo.
Why not just enforce a single dataset class - and call it a day! Anyone can write a simple script to download a dataset from HF and convert it to open-ai format, right? Also, there is almost zero usecase where you just take one dataset and train on it (at least to build high quality aligned models) its always a combination of a huge set of datasets, each with different formats etc. This is all data-munging work that a toolkit should not be entangled in.
Premature notions of "convenience" just end up being code debt.
Beta Was this translation helpful? Give feedback.
All reactions