comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface

Hello,
I was comparing `Hugging Face Datasets` to `array_record` and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.

```
cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16
```

#### created `array_record` data
```python3
from array_record.python.array_record_module import ArrayRecordWriter
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
writer = ArrayRecordWriter('stories.array_record', 'group_size:1')

 for row in tqdm(ds['train']):
     data = json.dumps(row)
     writer.write(data.encode())

writer.close()
```

now lets try just iterating over it using grain:

#### This takes around 6 hours to iterate over
```python3
from array_record.python.array_record_data_source import ArrayRecordDataSource
import grain.python as grain

source = ArrayRecordDataSource('stories.array_record')

index_sampler_example = grain.IndexSampler(
    num_records=len(source),
    num_epochs=1,
    shard_options=grain.ShardOptions(
        shard_index=0, shard_count=1, drop_remainder=True),
    shuffle=False,
    seed=0)

loader = grain.DataLoader(
    data_source=source,
    sampler=index_sampler_example,
    worker_count=2,
    worker_buffer_size=2,
)

for element in tqdm(loader, total=len(source)):
    a =  element

```
#### Okay, now lets just use `array_record` and exclude `grain`, this takes around 27 hours

```python3

from array_record.python.array_record_module import ArrayRecordReader

reader = ArrayRecordReader('stories.array_record')

for i in tqdm(range(0, reader.num_records())):
    a = reader.read([i])

```
#### Let's see what happends if I use `huggingface dataset`, 1 minute

```python3
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
for element in tqdm(ds['train']):
    a = element
```

#### By default `array_record` parameters, compression is aproximatly 1.8x-2x better than `huggingface dataset`. The most optimized open source library for training LLMs [maxtext](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/input_pipeline/_grain_data_processing.py) uses `grain` and `array_record` for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

created `array_record` data

This takes around 6 hours to iterate over

Okay, now lets just use `array_record` and exclude `grain`, this takes around 27 hours

Let's see what happends if I use `huggingface dataset`, 1 minute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

Description

created array_record data

This takes around 6 hours to iterate over

Okay, now lets just use array_record and exclude grain, this takes around 27 hours

Let's see what happends if I use huggingface dataset, 1 minute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

created `array_record` data

Okay, now lets just use `array_record` and exclude `grain`, this takes around 27 hours

Let's see what happends if I use `huggingface dataset`, 1 minute