Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

Open
ZurabDz opened this issue Jan 3, 2025 · 1 comment

Comments

@ZurabDz
Copy link

ZurabDz commented Jan 3, 2025

Hello,
I was comparing Hugging Face Datasets to array_record and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.

cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16

created array_record data

from array_record.python.array_record_module import ArrayRecordWriter
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
writer = ArrayRecordWriter('stories.array_record', 'group_size:1')

 for row in tqdm(ds['train']):
     data = json.dumps(row)
     writer.write(data.encode())

writer.close()

now lets try just iterating over it using grain:

This takes around 6 hours to iterate over

from array_record.python.array_record_data_source import ArrayRecordDataSource
import grain.python as grain

source = ArrayRecordDataSource('stories.array_record')

index_sampler_example = grain.IndexSampler(
    num_records=len(source),
    num_epochs=1,
    shard_options=grain.ShardOptions(
        shard_index=0, shard_count=1, drop_remainder=True),
    shuffle=False,
    seed=0)

loader = grain.DataLoader(
    data_source=source,
    sampler=index_sampler_example,
    worker_count=2,
    worker_buffer_size=2,
)

for element in tqdm(loader, total=len(source)):
    a =  element

Okay, now lets just use array_record and exclude grain, this takes around 27 hours

from array_record.python.array_record_module import ArrayRecordReader

reader = ArrayRecordReader('stories.array_record')

for i in tqdm(range(0, reader.num_records())):
    a = reader.read([i])

Let's see what happends if I use huggingface dataset, 1 minute

from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
for element in tqdm(ds['train']):
    a = element

By default array_record parameters, compression is aproximatly 1.8x-2x better than huggingface dataset. The most optimized open source library for training LLMs maxtext uses grain and array_record for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?

@ZurabDz ZurabDz changed the title comparing array_records` ArrayRecordDataSource + (grain data loader) to huggingface comparing array_records ArrayRecordDataSource + (grain data loader) to huggingface Jan 3, 2025
@ZurabDz ZurabDz changed the title comparing array_records ArrayRecordDataSource + (grain data loader) to huggingface comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface Jan 3, 2025
@dryman
Copy link
Collaborator

dryman commented Jan 5, 2025

ArrayRecord works best when using its ParallelRead methods, which utilizes its internal threadpool.

In python, these methods are exposed by supplying indices, range to read, or calling the read_all.

  1. records = reader.read([0..99]) # reads records by indices 0..99. Any list of indices would do
  2. records = reader.read(0, 99) # reads 0..99 records
  3. all_records = reader.read_all() # reads all records

You should be able to see a performance boost after switching to use these methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants