Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pii Modifier should work with DocumentDataset on cudf #418

Open
praateekmahajan opened this issue Dec 10, 2024 · 0 comments
Open

Pii Modifier should work with DocumentDataset on cudf #418

praateekmahajan opened this issue Dec 10, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@praateekmahajan
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

(not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset filtering e.g. in #417 )

Noticed an oddity in the PII examples / scripts / docs that PII doesn't work when we do DocDataset.read_*(backend="cudf")
Given that

  1. We call a text.tolist() here
  2. And cudf.Series doesn't have support tolist() (here)

All of the examples / scripts / docs do a read dataset using dask (pandas) but to the Modifier pass in device='gpu'

Describe the solution you'd like
The code works with DocumentDataset('cudf')
I think we might just need to_pyarrow().tolist() when series is cudf type

@praateekmahajan praateekmahajan added the enhancement New feature or request label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant