Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LookupError not caught during Encoding handling #411

Open
ggcr opened this issue Dec 6, 2024 · 0 comments
Open

LookupError not caught during Encoding handling #411

ggcr opened this issue Dec 6, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ggcr
Copy link

ggcr commented Dec 6, 2024

Describe the bug

In the Data curation for DAPT tutorial (tutorials/dapt-curation) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in this case), the program raises a LookupError, which is not currently being caught in the exception handling. This causes the program to fail unexpectedly and to skip the parsing of the whole repo in this case.

Steps/Code to reproduce bug

I have created a repo that only contains the file that is triggering this error, available here ggcr/nvidia-nemo-error-report. To reproduce, I follow this steps:

  1. Clone NeMo-Curator
$ git clone https://github.com/NVIDIA/NeMo-Curator.git
$ cd NeMo-Curator/
  1. Add the github repo with a standalone file made to reproduce this issue to the list of repos to curate:
$ echo '"ggcr/nvidia-nemo-error-report"' >> tutorials/dapt-curation/code/sources/github_repos.jsonl
  1. Run the tutorial:
$ cd tutorials/dapt-curation/code
$ python3 main.py --n-workers 2

In my case, this run logs the following execution:

Args:  Namespace(device='cpu', files_per_partition=2, n_workers=2, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1)
Download directory:  /private/tmp/NeMo-Curator/tutorials/dapt-curation/code/data/raw/wikipedia
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/HVM'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Parallel%20computing'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Number%20Assignment%20Module'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Separation%20of%20concerns'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Operand%20forwarding'...
...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Memory%20rank'...
Traceback (most recent call last):
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 257, in <module>
    main()
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 240, in main
    text_files, code_files = download_sources(100, 100, 100)
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 73, in download_sources
    github_dir = download_github_sources(
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/downloaders.py", line 168, in download_github_sources
    dataset.persist()
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/datasets/doc_dataset.py", line 38, in persist
    return DocumentDataset(self.df.persist())
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask_expr/_collection.py", line 447, in persist
    return DaskMethodsMixin.persist(out, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 345, in persist
    (result,) = persist(self, traverse=False, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 999, in persist
    results = schedule(dsk, keys, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/download/doc_builder.py", line 127, in _download_and_extract_single_partition
    for item in iterator.iterate(downloaded_file):
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 335, in iterate
    parsed = self.parse_file(zip_ref, file_info)
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 285, in parse_file
    content = content.decode(encoding)
LookupError: unknown encoding: VISCII

Proposed solution

In the current implementation of parse_file, the exception handling only catches UnicodeDecodeError.

# Open the file and read its content. Determine the encoding using cchardet. Skip over binary files.
with zip_ref.open(file_info, "r") as file:
content = file.read()
# Determine the encoding of the file
encoding = chardet.detect(content)["encoding"]
if not encoding:
return None
try:
content = content.decode(encoding)
except UnicodeDecodeError:
# If the file cannot be decoded, return None
return None

This can be updated to also catch LookupError.

except (UnicodeDecodeError, LookupError):
    return None

While this works, it will still trigger an error when prompted with an encoding not available in the runtime system. Would be very nice to parse line by line instead, this way we would only skip the line and not ditch the whole github repo from the curation process. However, parsing the file line by line might introduce a lot of overhead for big repos.

Environment overview (please complete the following information)

  • Environment location: Local (MacBook M3)
  • Method of NeMo-Curator install: conda create new env with python 3.10 and pip install

Environment details

  • OS version: macOS 14.3
  • Dask version: dask 2024.12.0
  • Python version: Python 3.10.15
@ggcr ggcr added the bug Something isn't working label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant