-
Notifications
You must be signed in to change notification settings - Fork 22
Closed
Description
Hello,
I was comparing Hugging Face Datasets to array_record and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.
cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16
created array_record data
from array_record.python.array_record_module import ArrayRecordWriter
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")
writer = ArrayRecordWriter('stories.array_record', 'group_size:1')
for row in tqdm(ds['train']):
data = json.dumps(row)
writer.write(data.encode())
writer.close()now lets try just iterating over it using grain:
This takes around 6 hours to iterate over
from array_record.python.array_record_data_source import ArrayRecordDataSource
import grain.python as grain
source = ArrayRecordDataSource('stories.array_record')
index_sampler_example = grain.IndexSampler(
num_records=len(source),
num_epochs=1,
shard_options=grain.ShardOptions(
shard_index=0, shard_count=1, drop_remainder=True),
shuffle=False,
seed=0)
loader = grain.DataLoader(
data_source=source,
sampler=index_sampler_example,
worker_count=2,
worker_buffer_size=2,
)
for element in tqdm(loader, total=len(source)):
a = elementOkay, now lets just use array_record and exclude grain, this takes around 27 hours
from array_record.python.array_record_module import ArrayRecordReader
reader = ArrayRecordReader('stories.array_record')
for i in tqdm(range(0, reader.num_records())):
a = reader.read([i])Let's see what happends if I use huggingface dataset, 1 minute
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")
for element in tqdm(ds['train']):
a = elementBy default array_record parameters, compression is aproximatly 1.8x-2x better than huggingface dataset. The most optimized open source library for training LLMs maxtext uses grain and array_record for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?
Metadata
Metadata
Assignees
Labels
No labels