You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I was comparing Hugging Face Datasets to array_record and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.
cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16
By default array_record parameters, compression is aproximatly 1.8x-2x better than huggingface dataset. The most optimized open source library for training LLMs maxtext uses grain and array_record for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?
The text was updated successfully, but these errors were encountered:
ZurabDz
changed the title
comparing array_records` ArrayRecordDataSource + (grain data loader) to huggingface
comparing array_records ArrayRecordDataSource + (grain data loader) to huggingface
Jan 3, 2025
ZurabDz
changed the title
comparing array_records ArrayRecordDataSource + (grain data loader) to huggingface
comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface
Jan 3, 2025
Hello,
I was comparing
Hugging Face Datasets
toarray_record
and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.created
array_record
datanow lets try just iterating over it using grain:
This takes around 6 hours to iterate over
Okay, now lets just use
array_record
and excludegrain
, this takes around 27 hoursLet's see what happends if I use
huggingface dataset
, 1 minuteBy default
array_record
parameters, compression is aproximatly 1.8x-2x better thanhuggingface dataset
. The most optimized open source library for training LLMs maxtext usesgrain
andarray_record
for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?The text was updated successfully, but these errors were encountered: