Closed
Description
Hello,
I was comparing Hugging Face Datasets
to array_record
and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.
cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16
created array_record
data
from array_record.python.array_record_module import ArrayRecordWriter
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")
writer = ArrayRecordWriter('stories.array_record', 'group_size:1')
for row in tqdm(ds['train']):
data = json.dumps(row)
writer.write(data.encode())
writer.close()
now lets try just iterating over it using grain:
This takes around 6 hours to iterate over
from array_record.python.array_record_data_source import ArrayRecordDataSource
import grain.python as grain
source = ArrayRecordDataSource('stories.array_record')
index_sampler_example = grain.IndexSampler(
num_records=len(source),
num_epochs=1,
shard_options=grain.ShardOptions(
shard_index=0, shard_count=1, drop_remainder=True),
shuffle=False,
seed=0)
loader = grain.DataLoader(
data_source=source,
sampler=index_sampler_example,
worker_count=2,
worker_buffer_size=2,
)
for element in tqdm(loader, total=len(source)):
a = element
Okay, now lets just use array_record
and exclude grain
, this takes around 27 hours
from array_record.python.array_record_module import ArrayRecordReader
reader = ArrayRecordReader('stories.array_record')
for i in tqdm(range(0, reader.num_records())):
a = reader.read([i])
Let's see what happends if I use huggingface dataset
, 1 minute
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")
for element in tqdm(ds['train']):
a = element
By default array_record
parameters, compression is aproximatly 1.8x-2x better than huggingface dataset
. The most optimized open source library for training LLMs maxtext uses grain
and array_record
for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?
Metadata
Metadata
Assignees
Labels
No labels