You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the Data curation for DAPT tutorial (tutorials/dapt-curation) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in this case), the program raises a LookupError, which is not currently being caught in the exception handling. This causes the program to fail unexpectedly and to skip the parsing of the whole repo in this case.
Steps/Code to reproduce bug
I have created a repo that only contains the file that is triggering this error, available here ggcr/nvidia-nemo-error-report. To reproduce, I follow this steps:
Clone NeMo-Curator
$ git clone https://github.com/NVIDIA/NeMo-Curator.git
$ cd NeMo-Curator/
Add the github repo with a standalone file made to reproduce this issue to the list of repos to curate:
$ cd tutorials/dapt-curation/code
$ python3 main.py --n-workers 2
In my case, this run logs the following execution:
Args: Namespace(device='cpu', files_per_partition=2, n_workers=2, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1)
Download directory: /private/tmp/NeMo-Curator/tutorials/dapt-curation/code/data/raw/wikipedia
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/HVM'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Parallel%20computing'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Number%20Assignment%20Module'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Separation%20of%20concerns'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Operand%20forwarding'...
...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Memory%20rank'...
Traceback (most recent call last):
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 257, in <module>
main()
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 240, in main
text_files, code_files = download_sources(100, 100, 100)
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 73, in download_sources
github_dir = download_github_sources(
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/downloaders.py", line 168, in download_github_sources
dataset.persist()
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/datasets/doc_dataset.py", line 38, in persist
return DocumentDataset(self.df.persist())
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask_expr/_collection.py", line 447, in persist
return DaskMethodsMixin.persist(out, **kwargs)
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 345, in persist
(result,) = persist(self, traverse=False, **kwargs)
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 999, in persist
results = schedule(dsk, keys, **kwargs)
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/download/doc_builder.py", line 127, in _download_and_extract_single_partition
for item in iterator.iterate(downloaded_file):
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 335, in iterate
parsed = self.parse_file(zip_ref, file_info)
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 285, in parse_file
content = content.decode(encoding)
LookupError: unknown encoding: VISCII
Proposed solution
In the current implementation of parse_file, the exception handling only catches UnicodeDecodeError.
While this works, it will still trigger an error when prompted with an encoding not available in the runtime system. Would be very nice to parse line by line instead, this way we would only skip the line and not ditch the whole github repo from the curation process. However, parsing the file line by line might introduce a lot of overhead for big repos.
Environment overview (please complete the following information)
Environment location: Local (MacBook M3)
Method of NeMo-Curator install: conda create new env with python 3.10 and pip install
Environment details
OS version: macOS 14.3
Dask version: dask 2024.12.0
Python version: Python 3.10.15
The text was updated successfully, but these errors were encountered:
Describe the bug
In the Data curation for DAPT tutorial (
tutorials/dapt-curation
) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in this case), the program raises a LookupError, which is not currently being caught in the exception handling. This causes the program to fail unexpectedly and to skip the parsing of the whole repo in this case.Steps/Code to reproduce bug
I have created a repo that only contains the file that is triggering this error, available here ggcr/nvidia-nemo-error-report. To reproduce, I follow this steps:
$ git clone https://github.com/NVIDIA/NeMo-Curator.git $ cd NeMo-Curator/
$ cd tutorials/dapt-curation/code $ python3 main.py --n-workers 2
In my case, this run logs the following execution:
Proposed solution
In the current implementation of parse_file, the exception handling only catches
UnicodeDecodeError
.NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py
Lines 275 to 288 in 7272ca0
This can be updated to also catch
LookupError
.While this works, it will still trigger an error when prompted with an encoding not available in the runtime system. Would be very nice to parse line by line instead, this way we would only skip the line and not ditch the whole github repo from the curation process. However, parsing the file line by line might introduce a lot of overhead for big repos.
Environment overview (please complete the following information)
Environment details
The text was updated successfully, but these errors were encountered: