You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pyvene is a library featuring interchange interventions. It frequently needs to process datasets that contain two sets of input_ids and (maybe) two sets of labels. When we need to train these libraries with batched datasets, the collator issue starts to arise: there is no existing collator that supports padding both sets of input_ids of different lengths at the same time.
Suggestion / Feature Request
Pyvene is a library featuring interchange interventions. It frequently needs to process datasets that contain two sets of input_ids and (maybe) two sets of labels. When we need to train these libraries with batched datasets, the collator issue starts to arise: there is no existing collator that supports padding both sets of input_ids of different lengths at the same time.
Hugging face transformers only pad the "input_ids" entries in the dataset
In addition to above, DataCollatorForSeq2Seq only pads "labels".
So dataset entries like "source_input_ids" are not padded, a problematic issue.
Adding a utility supporting this may help pyvene develop in general.
The text was updated successfully, but these errors were encountered: