Add blocksize to `DocumentDataset.read_` that uses `dask_cudf.read_` #285

praateekmahajan · 2024-10-08T21:44:04Z

This PR introduces a new codepath that leverages dask_cudf.read_json / read_parquet directly rather than our existing from_map implementation.

Important

The newer implementation supports fewer use-cases compared to original, however it supports the most frequently used implementations (i.e jsonl and parquet files without add_filename)

New code gotchya's

blocksize > filesize doesn't work when backend='cpu' and filetype='jsonl' since pandas doesn't support reading multiple files together.
If underlying data has inconsistent schema / metadata the read might fail. See Supporting inconsistent schemas in read_json dask/dask#11595

Existing (fpp implementtion) unexpected behavior

As we added more tests for existing code, uncovered a few cases where it doesn't work as expected

filename column can't be selected when backend='pandas', filetype='parquet' and add_columns=True
Specifying input_meta with pandas jsonl didn't select only subset of columns. That behavior has been fixed.

Differences between Old vs New

Discussion Points	New Implementation	Existing Implementation
Underlying API	`dask_cudf.read_*`	`dd.from_map(read_single_partition)`
backend	Works with pandas / cudf	Works with pandas / cudf
filetype	Only supports jsonl and parquet	Supports `json`, along with `jsonl` and `parquet`
add_filename	Only when filetype is `jsonl`	Supports all filetypes
input_meta	Returns only columns in `input_meta.keys()`	Returns only columns in `input_meta.keys()`
meta as `**kwarg`	Not required as the first file is used to parse the schema	Is required otherwise can result in OOM (see benchmark row 2 below)
Inconsitent Schema	Doesn't always work (see dask/dask#11595)	Is supported

Follow up Issues

Benchmarking

Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs

add_filename	partition_size	input_meta	Using `dask.read_json` #285	Providing meta in `dask.from_map` #291
False	2gb	Specified	24.9 s ± 330 ms	25.9 s ± 520 ms
False	2gb	None	24.9 s ± 470 ms	OOM
True	2gb	Specified	55 s ± 177 ms	53.2 s ± 350 ms per loop
True	2gb	None	54.8 s ± 248 ms	64s ± 289 ms per loop

Using dask.read_json #285	Providing meta in dask.from_map #291

First two are `add_filename=False`, latter two are `True` where we see a lower utilization	The first one is `add_filename=False`, and the latter are `True` where we see a lower utilization

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

nemo_curator/utils/distributed_utils.py

nemo_curator/datasets/doc_dataset.py

Signed-off-by: Praateek <[email protected]>

nemo_curator/utils/distributed_utils.py

Signed-off-by: Praateek <[email protected]>

…ry-dask-cudf-read-json

Signed-off-by: Praateek <[email protected]>

docs/user-guide/bestpractices.rst

nemo_curator/utils/distributed_utils.py

praateekmahajan · 2024-11-22T08:14:30Z

tests/test_read_data.py

+            for record_id in range(NUM_RECORDS):
+                # 100 rows are ~5kb
+                f.write(
+                    f'{{"id": "id_{file_id}_{record_id}", "text": "A longish string {file_id}_{record_id}"}}\n'


Observed an interesting thing here with pandas when "id" : f"{file_id}_{record_id}" then pd.read_json(.., lines=True).dtypes["id"] returns int

praateekmahajan · 2024-11-22T08:16:38Z

tests/test_read_data.py

+        pytest.param(
+            "pandas",
+            "parquet",
+            marks=pytest.mark.xfail(


Marked it as xfail but this is something that we should look at. I would imagine that the errors in test_read_data_select_columns are related, but I could be wrong (or just tired looking at the same errors)

This is demonstrating the note in the README:
"filename column cannot be selected when backend="pandas", filetype="parquet" and add_columns=True"
?

This is saying that the "filename" column does not get added when backend="pandas" and file_type="parquet"? Or are you just saying it gets dropped by the select_columns function?

df.columns doesn't return filename as a column, and I believe that's because filename is a reserved field in the pyarrow schema. However if we were to change our filename field to file_name / path things would work as expected. So if it sounds good I'll create a followup issue and do the renaming in the next PR (and also remove the xfail)

Yes, that works for me. Thanks for the explanation.

praateekmahajan · 2024-11-22T08:19:32Z

nemo_curator/utils/distributed_utils.py

+            columns.append("filename")
+        df = df[columns]
+
+    df = df[sorted(df.columns)]


The sorting also isn't respected under the read_files_fpp path. Removing the sorting still results in test_read_data_select_columns failures but far fewer than when we sort.

I like this a lot, we can have underlying data that has columns with order not respected. Doing this makes a lot of sense especially for formats like jsonl

praateekmahajan · 2024-11-22T08:19:49Z

tests/test_read_data.py

+@pytest.mark.parametrize(
+    "cols_to_select", [None, ["id"], ["text", "id"], ["id", "text"]]
+)
+def test_read_data_select_columns(


These are the tests that are failing right now

praateekmahajan · 2024-11-22T08:21:24Z

tests/test_read_data.py

+        columns=None,
+        **read_kwargs,
+    )
+


Please TAL here as the behavior of when we pass input_meta results in different outputs depending on fpp vs blocksize

Okay, I think I understand what is being demonstrated here.

Even though Pandas read_data_fpp and Pandas/cuDF read_data_blocksize cannot prune columns while reading, can't we still have it prune the columns after reading? I know that doesn't save on I/O but then at least the user is getting the resulting DataFrame that they would expect.

That's a good point. Would you want to do a select_columns([input_meta.keys()]) when input_type=='jsonl' (since input_meta in it's current setting only supports json)? We can do that, though I worry that it'll become convolved along with columns (and maybe even add_filename)

I think as long as we are thoroughly checking all the conditions we expect to have to do this, it should be okay. Happy to discuss this offline if anything is unclear here.

@sarahyurick I made the changes in the last commit however the test_io now fails due to since the test does

dataset = DocumentDataset.read_json( temp_file.name, input_meta='{"id": "float"}' ) output_meta = str({col: str(dtype) for col, dtype in dataset.df.dtypes.items()}) expected_meta = ( "{'date': 'datetime64[ns, UTC]', 'id': 'float64', 'text': 'object'}" )

I was going to change it but just wanted to confirm if it's okay. FWIW, that test only passes in pandas backend case, but would've failed if backend would've been cudf, i.e the new tests are more thorough and caught the behavior difference.

Signed-off-by: Praateek <[email protected]>

sarahyurick

Added a couple more nits, then should be good on my end. Thanks!

nemo_curator/utils/distributed_utils.py

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

…raateek/try-dask-cudf-read-json

…ry-dask-cudf-read-json

…kmahajan/NeMo-Curator into praateek/try-dask-cudf-read-json

sarahyurick · 2024-12-16T17:51:14Z

tests/test_read_data.py

+def test_read_data_blocksize_add_filename_parquet(mock_multiple_parquet_files, backend):
+    with pytest.raises(
+        ValueError,
+        match="add_filename and blocksize cannot be set at the same time for parquet files",


Suggested change

match="add_filename and blocksize cannot be set at the same time for parquet files",

match="add_filename and blocksize cannot be set at the same time for Parquet files",

Signed-off-by: Praateek <[email protected]>

sarahyurick

#434 should fix the gpuCI build errors, then we can make sure all GPU tests pass, then merge!

fc

04139d0

Signed-off-by: Praateek <[email protected]>

praateekmahajan changed the title ~~Trying dasks' read_json~~ Trying dask_cudf's read_json Oct 8, 2024

praateekmahajan changed the title ~~Trying dask_cudf's read_json~~ [DRAFT] Trying dask_cudf's read_json Oct 9, 2024

praateekmahajan changed the title ~~[DRAFT] Trying dask_cudf's read_json~~ [DRAFT] Trying dask_cudf's read_json / read_parquet Oct 9, 2024

sarahyurick mentioned this pull request Oct 21, 2024

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Open

praateekmahajan added 2 commits November 14, 2024 23:05

merge conflict

953c84e

Signed-off-by: Praateek <[email protected]>

review comments

b2de5cb

Signed-off-by: Praateek <[email protected]>

praateekmahajan changed the title ~~[DRAFT] Trying dask_cudf's read_json / read_parquet~~ Add blocksize to DocumentDataset.read_* that uses dask_cudf.read_* Nov 15, 2024

praateekmahajan marked this pull request as ready for review November 15, 2024 09:34

praateekmahajan mentioned this pull request Nov 15, 2024

Add blocksize to DocumentDataset.read_* that uses dd.from_map #374

Closed

3 tasks

sarahyurick reviewed Nov 15, 2024

View reviewed changes

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

sarahyurick reviewed Nov 15, 2024

View reviewed changes

nemo_curator/datasets/doc_dataset.py Show resolved Hide resolved

make blocksize work with parquet

eb49b70

Signed-off-by: Praateek <[email protected]>

sarahyurick reviewed Nov 18, 2024

View reviewed changes

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

praateekmahajan added 7 commits November 19, 2024 04:37

filetype

386d443

Signed-off-by: Praateek <[email protected]>

complete merge

f4c963a

Signed-off-by: Praateek <[email protected]>

fix merge

aa47a37

Signed-off-by: Praateek <[email protected]>

Merge branch 'main' of github.com:NVIDIA/NeMo-Curator into praateek/t…

fed553e

…ry-dask-cudf-read-json

add test cases

c1ea0fb

Signed-off-by: Praateek <[email protected]>

add test file

3a0f13f

Signed-off-by: Praateek <[email protected]>

failing test for select_columns

9c6428c

Signed-off-by: Praateek <[email protected]>

praateekmahajan commented Nov 22, 2024

View reviewed changes

docs/user-guide/bestpractices.rst Show resolved Hide resolved

praateekmahajan commented Nov 22, 2024

View reviewed changes

nemo_curator/utils/distributed_utils.py Show resolved Hide resolved

praateekmahajan commented Nov 22, 2024

View reviewed changes

rename func name

1e8a7fc

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested a review from sarahyurick November 22, 2024 08:29

praateekmahajan requested review from VibhuJawa and removed request for sarahyurick December 7, 2024 04:41

review comments + add warnings for inconsistent schemas

2599f26

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested a review from sarahyurick December 13, 2024 08:11

sarahyurick requested changes Dec 13, 2024

View reviewed changes

praateekmahajan and others added 10 commits December 15, 2024 02:36

Update nemo_curator/utils/distributed_utils.py

09ca9d9

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/utils/distributed_utils.py

70efd69

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/utils/distributed_utils.py

91671ec

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/utils/distributed_utils.py

a4fcd2f

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/utils/distributed_utils.py

8e4827f

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/utils/distributed_utils.py

0cd86a6

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/utils/distributed_utils.py

5871b83

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Merge branch 'main' of github.com:praateekmahajan/NeMo-Curator into p…

0284045

…raateek/try-dask-cudf-read-json

Merge branch 'main' of github.com:NVIDIA/NeMo-Curator into praateek/t…

7aeca38

…ry-dask-cudf-read-json

Merge branch 'praateek/try-dask-cudf-read-json' of github.com:praatee…

fdd0c69

…kmahajan/NeMo-Curator into praateek/try-dask-cudf-read-json

praateekmahajan requested a review from sarahyurick December 16, 2024 15:13

sarahyurick reviewed Dec 16, 2024

View reviewed changes

fix tests

fc196d5

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested a review from sarahyurick December 16, 2024 19:07

sarahyurick added the gpuci Run GPU CI/CD on PR label Dec 16, 2024

praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 16, 2024

sarahyurick approved these changes Dec 16, 2024

View reviewed changes

praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 17, 2024

Merge branch 'main' into praateek/try-dask-cudf-read-json

e52d0ea

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 17, 2024

sarahyurick merged commit e820b8b into NVIDIA:main Dec 17, 2024
5 checks passed

sarahyurick mentioned this pull request Jan 7, 2025

Add more comprehensive DocumentDataset PyTests #371

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blocksize to `DocumentDataset.read_` that uses `dask_cudf.read_` #285

Add blocksize to `DocumentDataset.read_` that uses `dask_cudf.read_` #285

praateekmahajan commented Oct 8, 2024 •

edited

Loading

praateekmahajan Nov 22, 2024

praateekmahajan Nov 22, 2024

sarahyurick Dec 5, 2024

praateekmahajan Dec 6, 2024

sarahyurick Dec 6, 2024

praateekmahajan Nov 22, 2024

VibhuJawa Nov 22, 2024

praateekmahajan Nov 22, 2024

praateekmahajan Nov 22, 2024

sarahyurick Dec 5, 2024

praateekmahajan Dec 6, 2024

sarahyurick Dec 6, 2024

praateekmahajan Dec 13, 2024 •

edited

Loading

sarahyurick left a comment

sarahyurick Dec 16, 2024

sarahyurick left a comment

	match="add_filename and blocksize cannot be set at the same time for parquet files",
	match="add_filename and blocksize cannot be set at the same time for Parquet files",

Add blocksize to DocumentDataset.read_* that uses dask_cudf.read_* #285

Add blocksize to DocumentDataset.read_* that uses dask_cudf.read_* #285

Conversation

praateekmahajan commented Oct 8, 2024 • edited Loading

New code gotchya's

Existing (fpp implementtion) unexpected behavior

Differences between Old vs New

Follow up Issues

Benchmarking

Usage

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

praateekmahajan Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

Add blocksize to `DocumentDataset.read_` that uses `dask_cudf.read_` #285

Add blocksize to `DocumentDataset.read_` that uses `dask_cudf.read_` #285

praateekmahajan commented Oct 8, 2024 •

edited

Loading

praateekmahajan Dec 13, 2024 •

edited

Loading