Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blocksize to DocumentDataset.read_* that uses dask_cudf.read_* #285

Merged

Conversation

praateekmahajan
Copy link
Collaborator

@praateekmahajan praateekmahajan commented Oct 8, 2024

This PR introduces a new codepath that leverages dask_cudf.read_json / read_parquet directly rather than our existing from_map implementation.

Important

The newer implementation supports fewer use-cases compared to original, however it supports the most frequently used implementations (i.e jsonl and parquet files without add_filename)

New code gotchya's

  1. blocksize > filesize doesn't work when backend='cpu' and filetype='jsonl' since pandas doesn't support reading multiple files together.
  2. If underlying data has inconsistent schema / metadata the read might fail. See Supporting inconsistent schemas in read_json dask/dask#11595

Existing (fpp implementtion) unexpected behavior

As we added more tests for existing code, uncovered a few cases where it doesn't work as expected

  1. filename column can't be selected when backend='pandas', filetype='parquet' and add_columns=True
  2. Specifying input_meta with pandas jsonl didn't select only subset of columns. That behavior has been fixed.

Differences between Old vs New

Discussion Points New Implementation Existing Implementation
Underlying API dask_cudf.read_* dd.from_map(read_single_partition)
backend Works with pandas / cudf Works with pandas / cudf
filetype Only supports jsonl and parquet Supports json, along with jsonl and parquet
add_filename Only when filetype is jsonl Supports all filetypes
input_meta Returns only columns in input_meta.keys() Returns only columns in input_meta.keys()
meta as **kwarg Not required as the first file is used to parse the schema Is required otherwise can result in OOM (see benchmark row 2 below)
Inconsitent Schema Doesn't always work (see dask/dask#11595) Is supported

Follow up Issues

  1. filename column inaccessible with pandas backend and parquet #427
  2. Supporting inconsistent schemas in read_json dask/dask#11595
  3. Add include_path_column to read_parquet dask/dask#6575
  4. https://github.com/rapidsai/cudf/pull/17554/files

Benchmarking

Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs

add_filename partition_size input_meta Using dask.read_json #285 Providing meta in
dask.from_map #291
False 2gb Specified 24.9 s ± 330 ms 25.9 s ± 520 ms
False 2gb None 24.9 s ± 470 ms OOM
True 2gb Specified 55 s ± 177 ms 53.2 s ± 350 ms per loop
True 2gb None 54.8 s ± 248 ms 64s ± 289 ms per loop
Using dask.read_json #285 Providing meta in dask.from_map #291
image image
First two are add_filename=False, latter two are True where we see a lower utilization The first one is add_filename=False, and the latter are True where we see a lower utilization

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
@praateekmahajan praateekmahajan changed the title Trying dasks' read_json Trying dask_cudf's read_json Oct 8, 2024
@praateekmahajan praateekmahajan changed the title Trying dask_cudf's read_json [DRAFT] Trying dask_cudf's read_json Oct 9, 2024
@praateekmahajan praateekmahajan changed the title [DRAFT] Trying dask_cudf's read_json [DRAFT] Trying dask_cudf's read_json / read_parquet Oct 9, 2024
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@praateekmahajan praateekmahajan changed the title [DRAFT] Trying dask_cudf's read_json / read_parquet Add blocksize to DocumentDataset.read_* that uses dask_cudf.read_* Nov 15, 2024
@praateekmahajan praateekmahajan marked this pull request as ready for review November 15, 2024 09:34
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
for record_id in range(NUM_RECORDS):
# 100 rows are ~5kb
f.write(
f'{{"id": "id_{file_id}_{record_id}", "text": "A longish string {file_id}_{record_id}"}}\n'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observed an interesting thing here with pandas when "id" : f"{file_id}_{record_id}" then pd.read_json(.., lines=True).dtypes["id"] returns int

pytest.param(
"pandas",
"parquet",
marks=pytest.mark.xfail(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked it as xfail but this is something that we should look at. I would imagine that the errors in test_read_data_select_columns are related, but I could be wrong (or just tired looking at the same errors)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is demonstrating the note in the README:
"filename column cannot be selected when backend="pandas", filetype="parquet" and add_columns=True"
?

This is saying that the "filename" column does not get added when backend="pandas" and file_type="parquet"? Or are you just saying it gets dropped by the select_columns function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns doesn't return filename as a column, and I believe that's because filename is a reserved field in the pyarrow schema. However if we were to change our filename field to file_name / path things would work as expected. So if it sounds good I'll create a followup issue and do the renaming in the next PR (and also remove the xfail)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that works for me. Thanks for the explanation.

columns.append("filename")
df = df[columns]

df = df[sorted(df.columns)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting also isn't respected under the read_files_fpp path. Removing the sorting still results in test_read_data_select_columns failures but far fewer than when we sort.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot, we can have underlying data that has columns with order not respected. Doing this makes a lot of sense especially for formats like jsonl

@pytest.mark.parametrize(
"cols_to_select", [None, ["id"], ["text", "id"], ["id", "text"]]
)
def test_read_data_select_columns(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the tests that are failing right now

columns=None,
**read_kwargs,
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please TAL here as the behavior of when we pass input_meta results in different outputs depending on fpp vs blocksize

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I understand what is being demonstrated here.

Even though Pandas read_data_fpp and Pandas/cuDF read_data_blocksize cannot prune columns while reading, can't we still have it prune the columns after reading? I know that doesn't save on I/O but then at least the user is getting the resulting DataFrame that they would expect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Would you want to do a select_columns([input_meta.keys()]) when input_type=='jsonl' (since input_meta in it's current setting only supports json)? We can do that, though I worry that it'll become convolved along with columns (and maybe even add_filename)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we are thoroughly checking all the conditions we expect to have to do this, it should be okay. Happy to discuss this offline if anything is unclear here.

Copy link
Collaborator Author

@praateekmahajan praateekmahajan Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarahyurick I made the changes in the last commit however the test_io now fails due to since the test does

            dataset = DocumentDataset.read_json(
                temp_file.name, input_meta='{"id": "float"}'
            )

        output_meta = str({col: str(dtype) for col, dtype in dataset.df.dtypes.items()})

        expected_meta = (
            "{'date': 'datetime64[ns, UTC]', 'id': 'float64', 'text': 'object'}"
        )

I was going to change it but just wanted to confirm if it's okay. FWIW, that test only passes in pandas backend case, but would've failed if backend would've been cudf, i.e the new tests are more thorough and caught the behavior difference.

Signed-off-by: Praateek <[email protected]>
@praateekmahajan praateekmahajan requested review from VibhuJawa and removed request for sarahyurick December 7, 2024 04:41
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple more nits, then should be good on my end. Thanks!

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
praateekmahajan and others added 10 commits December 15, 2024 02:36
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
…kmahajan/NeMo-Curator into praateek/try-dask-cudf-read-json
def test_read_data_blocksize_add_filename_parquet(mock_multiple_parquet_files, backend):
with pytest.raises(
ValueError,
match="add_filename and blocksize cannot be set at the same time for parquet files",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match="add_filename and blocksize cannot be set at the same time for parquet files",
match="add_filename and blocksize cannot be set at the same time for Parquet files",

Signed-off-by: Praateek <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Dec 16, 2024
@praateekmahajan praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 16, 2024
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#434 should fix the gpuCI build errors, then we can make sure all GPU tests pass, then merge!

@praateekmahajan praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 17, 2024
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 17, 2024
@sarahyurick sarahyurick merged commit e820b8b into NVIDIA:main Dec 17, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants