`LookupError` not caught during Encoding handling #411

ggcr · 2024-12-06T06:53:54Z

Describe the bug

In the Data curation for DAPT tutorial (tutorials/dapt-curation) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in this case), the program raises a LookupError, which is not currently being caught in the exception handling. This causes the program to fail unexpectedly and to skip the parsing of the whole repo in this case.

Steps/Code to reproduce bug

I have created a repo that only contains the file that is triggering this error, available here ggcr/nvidia-nemo-error-report. To reproduce, I follow this steps:

Clone NeMo-Curator

$ git clone https://github.com/NVIDIA/NeMo-Curator.git
$ cd NeMo-Curator/

Add the github repo with a standalone file made to reproduce this issue to the list of repos to curate:

$ echo '"ggcr/nvidia-nemo-error-report"' >> tutorials/dapt-curation/code/sources/github_repos.jsonl

Run the tutorial:

$ cd tutorials/dapt-curation/code
$ python3 main.py --n-workers 2

In my case, this run logs the following execution:

Args:  Namespace(device='cpu', files_per_partition=2, n_workers=2, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1)
Download directory:  /private/tmp/NeMo-Curator/tutorials/dapt-curation/code/data/raw/wikipedia
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/HVM'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Parallel%20computing'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Number%20Assignment%20Module'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Separation%20of%20concerns'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Operand%20forwarding'...
...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Memory%20rank'...
Traceback (most recent call last):
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 257, in <module>
    main()
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 240, in main
    text_files, code_files = download_sources(100, 100, 100)
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 73, in download_sources
    github_dir = download_github_sources(
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/downloaders.py", line 168, in download_github_sources
    dataset.persist()
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/datasets/doc_dataset.py", line 38, in persist
    return DocumentDataset(self.df.persist())
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask_expr/_collection.py", line 447, in persist
    return DaskMethodsMixin.persist(out, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 345, in persist
    (result,) = persist(self, traverse=False, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 999, in persist
    results = schedule(dsk, keys, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/download/doc_builder.py", line 127, in _download_and_extract_single_partition
    for item in iterator.iterate(downloaded_file):
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 335, in iterate
    parsed = self.parse_file(zip_ref, file_info)
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 285, in parse_file
    content = content.decode(encoding)
LookupError: unknown encoding: VISCII

Proposed solution

In the current implementation of parse_file, the exception handling only catches UnicodeDecodeError.

NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py

Lines 275 to 288 in 7272ca0

    
           # Open the file and read its content. Determine the encoding using cchardet. Skip over binary files. 
        
           with zip_ref.open(file_info, "r") as file: 
        
               content = file.read() 
        
               # Determine the encoding of the file 
        
               encoding = chardet.detect(content)["encoding"] 
        
               if not encoding: 
        
                   return None 
        
               try: 
        
                   content = content.decode(encoding) 
        
               except UnicodeDecodeError: 
        
                   # If the file cannot be decoded, return None 
        
                   return None

This can be updated to also catch LookupError.

except (UnicodeDecodeError, LookupError):
    return None

While this works, it will still trigger an error when prompted with an encoding not available in the runtime system. Would be very nice to parse line by line instead, this way we would only skip the line and not ditch the whole github repo from the curation process. However, parsing the file line by line might introduce a lot of overhead for big repos.

Environment overview (please complete the following information)

Environment location: Local (MacBook M3)
Method of NeMo-Curator install: conda create new env with python 3.10 and pip install

Environment details

OS version: macOS 14.3
Dask version: dask 2024.12.0
Python version: Python 3.10.15

The text was updated successfully, but these errors were encountered:

ggcr added the bug Something isn't working label Dec 6, 2024

ggcr mentioned this issue Dec 6, 2024

Added LookUp error handling during encoding detection. #412

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LookupError` not caught during Encoding handling #411

`LookupError` not caught during Encoding handling #411

ggcr commented Dec 6, 2024 •

edited

Loading

LookupError not caught during Encoding handling #411

LookupError not caught during Encoding handling #411

Comments

ggcr commented Dec 6, 2024 • edited Loading

`LookupError` not caught during Encoding handling #411

`LookupError` not caught during Encoding handling #411

ggcr commented Dec 6, 2024 •

edited

Loading