-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add blocksize to DocumentDataset.read_*
that uses dask_cudf.read_*
#285
Add blocksize to DocumentDataset.read_*
that uses dask_cudf.read_*
#285
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
DocumentDataset.read_*
that uses dask_cudf.read_*
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
…ry-dask-cudf-read-json
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
for record_id in range(NUM_RECORDS): | ||
# 100 rows are ~5kb | ||
f.write( | ||
f'{{"id": "id_{file_id}_{record_id}", "text": "A longish string {file_id}_{record_id}"}}\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observed an interesting thing here with pandas when "id" : f"{file_id}_{record_id}"
then pd.read_json(.., lines=True).dtypes["id"]
returns int
pytest.param( | ||
"pandas", | ||
"parquet", | ||
marks=pytest.mark.xfail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked it as xfail
but this is something that we should look at. I would imagine that the errors in test_read_data_select_columns
are related, but I could be wrong (or just tired looking at the same errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is demonstrating the note in the README:
"filename
column cannot be selected when backend="pandas"
, filetype="parquet"
and add_columns=True
"
?
This is saying that the "filename" column does not get added when backend="pandas"
and file_type="parquet"
? Or are you just saying it gets dropped by the select_columns
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.columns
doesn't return filename
as a column, and I believe that's because filename
is a reserved field in the pyarrow schema. However if we were to change our filename
field to file_name
/ path
things would work as expected. So if it sounds good I'll create a followup issue and do the renaming in the next PR (and also remove the xfail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that works for me. Thanks for the explanation.
columns.append("filename") | ||
df = df[columns] | ||
|
||
df = df[sorted(df.columns)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorting also isn't respected under the read_files_fpp
path. Removing the sorting still results in test_read_data_select_columns
failures but far fewer than when we sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot, we can have underlying data that has columns with order not respected. Doing this makes a lot of sense especially for formats like jsonl
@pytest.mark.parametrize( | ||
"cols_to_select", [None, ["id"], ["text", "id"], ["id", "text"]] | ||
) | ||
def test_read_data_select_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the tests that are failing right now
columns=None, | ||
**read_kwargs, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please TAL here as the behavior of when we pass input_meta
results in different outputs depending on fpp
vs blocksize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I understand what is being demonstrated here.
Even though Pandas read_data_fpp
and Pandas/cuDF read_data_blocksize
cannot prune columns while reading, can't we still have it prune the columns after reading? I know that doesn't save on I/O but then at least the user is getting the resulting DataFrame that they would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Would you want to do a select_columns([input_meta.keys()])
when input_type=='jsonl' (since input_meta in it's current setting only supports json
)? We can do that, though I worry that it'll become convolved along with columns
(and maybe even add_filename
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as we are thoroughly checking all the conditions we expect to have to do this, it should be okay. Happy to discuss this offline if anything is unclear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarahyurick I made the changes in the last commit however the test_io
now fails due to since the test does
dataset = DocumentDataset.read_json(
temp_file.name, input_meta='{"id": "float"}'
)
output_meta = str({col: str(dtype) for col, dtype in dataset.df.dtypes.items()})
expected_meta = (
"{'date': 'datetime64[ns, UTC]', 'id': 'float64', 'text': 'object'}"
)
I was going to change it but just wanted to confirm if it's okay. FWIW, that test only passes in pandas backend case, but would've failed if backend would've been cudf, i.e the new tests are more thorough and caught the behavior difference.
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple more nits, then should be good on my end. Thanks!
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
…raateek/try-dask-cudf-read-json
…ry-dask-cudf-read-json
…kmahajan/NeMo-Curator into praateek/try-dask-cudf-read-json
tests/test_read_data.py
Outdated
def test_read_data_blocksize_add_filename_parquet(mock_multiple_parquet_files, backend): | ||
with pytest.raises( | ||
ValueError, | ||
match="add_filename and blocksize cannot be set at the same time for parquet files", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match="add_filename and blocksize cannot be set at the same time for parquet files", | |
match="add_filename and blocksize cannot be set at the same time for Parquet files", |
Signed-off-by: Praateek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#434 should fix the gpuCI build errors, then we can make sure all GPU tests pass, then merge!
This PR introduces a new codepath that leverages
dask_cudf.read_json / read_parquet
directly rather than our existingfrom_map
implementation.Important
The newer implementation supports fewer use-cases compared to original, however it supports the most frequently used implementations (i.e jsonl and parquet files without add_filename)
New code gotchya's
blocksize
>filesize
doesn't work whenbackend='cpu'
andfiletype='jsonl'
since pandas doesn't support reading multiple files together.Existing (fpp implementtion) unexpected behavior
As we added more tests for existing code, uncovered a few cases where it doesn't work as expected
filename
column can't be selected whenbackend='pandas'
,filetype='parquet'
andadd_columns=True
input_meta
withpandas
jsonl
didn't select only subset of columns. That behavior has been fixed.Differences between Old vs New
dask_cudf.read_*
dd.from_map(read_single_partition)
json
, along withjsonl
andparquet
jsonl
input_meta.keys()
input_meta.keys()
**kwarg
Follow up Issues
Benchmarking
Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs
dask.read_json
#285dask.from_map
#291add_filename=False
, latter two areTrue
where we see a lower utilizationadd_filename=False
, and the latter areTrue
where we see a lower utilizationUsage
# Add snippet demonstrating usage
Checklist