Skip to content

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

Closed
@ZurabDz

Description

@ZurabDz

Hello,
I was comparing Hugging Face Datasets to array_record and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.

cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16

created array_record data

from array_record.python.array_record_module import ArrayRecordWriter
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
writer = ArrayRecordWriter('stories.array_record', 'group_size:1')

 for row in tqdm(ds['train']):
     data = json.dumps(row)
     writer.write(data.encode())

writer.close()

now lets try just iterating over it using grain:

This takes around 6 hours to iterate over

from array_record.python.array_record_data_source import ArrayRecordDataSource
import grain.python as grain

source = ArrayRecordDataSource('stories.array_record')

index_sampler_example = grain.IndexSampler(
    num_records=len(source),
    num_epochs=1,
    shard_options=grain.ShardOptions(
        shard_index=0, shard_count=1, drop_remainder=True),
    shuffle=False,
    seed=0)

loader = grain.DataLoader(
    data_source=source,
    sampler=index_sampler_example,
    worker_count=2,
    worker_buffer_size=2,
)

for element in tqdm(loader, total=len(source)):
    a =  element

Okay, now lets just use array_record and exclude grain, this takes around 27 hours

from array_record.python.array_record_module import ArrayRecordReader

reader = ArrayRecordReader('stories.array_record')

for i in tqdm(range(0, reader.num_records())):
    a = reader.read([i])

Let's see what happends if I use huggingface dataset, 1 minute

from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
for element in tqdm(ds['train']):
    a = element

By default array_record parameters, compression is aproximatly 1.8x-2x better than huggingface dataset. The most optimized open source library for training LLMs maxtext uses grain and array_record for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions