NoneType for images in pdf.py #72

HuyLe82US · 2024-10-23T05:10:54Z

When I tried to OCR a .pdf file, I have this error.
Here is the log:

ERROR:root:Error converting PDF to images: Unable to get page count. Is poppler installed and in PATH?
Traceback (most recent call last):
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 22, in <module>
    asyncio.run(main())
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 15, in main
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\core\zerox.py", line 149, in zerox
    results = await process_pages_in_batches(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\processor\pdf.py", line 104, in process_pages_in_batches
    for image in images
                 ^^^^^^
TypeError: 'NoneType' object is not iterable

I have installed poppler-utils already, and also checked that the package has already in the project.

The text was updated successfully, but these errors were encountered:

HuyLe82US · 2024-10-23T05:30:33Z

I found out the cause and here is the solution from ChatGPT:

Steps to Resolve

Install Poppler:
- Poppler is required for converting PDF pages into images. You need to install it on your system.
On Windows:
- Download the Poppler binaries from Poppler for Windows.
- Extract the zip file to a folder (e.g., C:\poppler).
On macOS:
- You can install Poppler via Homebrew:
```
brew install poppler
```
On Linux (Debian/Ubuntu):
- Install Poppler using the package manager:
```
sudo apt-get install poppler-utils
```
Add Poppler to System PATH:

If you're on Windows, you'll need to add the bin folder from the Poppler installation to your system's PATH.

Adding Poppler to PATH (Windows):
1. Right-click on This PC or My Computer and go to Properties.
2. Click on Advanced system settings.
3. In the System Properties window, click on the Environment Variables button.
4. Under System variables, find the Path variable, and click Edit.
5. Click New and add the path to the Poppler bin directory (e.g., C:\poppler\bin).
6. Click OK to close all the windows.
Verify Poppler Installation:

After installing Poppler and adding it to the PATH, verify that it’s correctly set up by running the following command in your terminal (command prompt or shell):
```
pdftoppm -h
```
This should display help information for pdftoppm, one of the tools included with Poppler. If you see this, Poppler is correctly installed and added to the PATH.
Retry Running Your Script:

After ensuring Poppler is installed and available in the PATH, retry running your Python script. The error related to Poppler should be resolved.

Additional Debugging:

If you still encounter issues, make sure:

Your PDF file is valid and not corrupted.
The file_path provided to your script is correct.

HuyLe82US · 2024-10-23T05:37:02Z

After fix that, I have another issue with encoding:

Traceback (most recent call last):
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 22, in <module>
    asyncio.run(main())
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 15, in main
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\core\zerox.py", line 169, in zerox
    await f.write("\n\n".join(aggregated_markdown))
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\aiofiles\threadpool\utils.py", line 43, in method
    return await self._loop.run_in_executor(self._executor, cb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u1edc' in position 1: character maps to <undefined>

I've update the PYTHONIOENCODING=utf-8 already in System Variables.

pradhyumna85 · 2024-10-25T07:12:24Z

@HuyLe82US, please don't follow the INSTRUCTIONS ON THE ABOVE LINK SHARED BY ummm288

@tylermaran, @annapo23 please block the previous comment, the link contains a malware.

Also report the user.

Vamshi-Madineni · 2024-11-13T04:09:13Z

@HuyLe82US, did this issue get resolved? If not, could you try setting the errors='ignore' parameter when reading the PDF? This will skip any special characters that can't be encoded.

HuyLe82US closed this as completed Oct 23, 2024

HuyLe82US reopened this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoneType for images in pdf.py #72

NoneType for images in pdf.py #72

HuyLe82US commented Oct 23, 2024 •

edited

Loading

HuyLe82US commented Oct 23, 2024

On Windows:

On macOS:

On Linux (Debian/Ubuntu):

Adding Poppler to PATH (Windows):

HuyLe82US commented Oct 23, 2024

pradhyumna85 commented Oct 25, 2024 •

edited

Loading

Vamshi-Madineni commented Nov 13, 2024

NoneType for images in pdf.py #72

NoneType for images in pdf.py #72

Comments

HuyLe82US commented Oct 23, 2024 • edited Loading

HuyLe82US commented Oct 23, 2024

Steps to Resolve

On Windows:

On macOS:

On Linux (Debian/Ubuntu):

Adding Poppler to PATH (Windows):

Additional Debugging:

HuyLe82US commented Oct 23, 2024

pradhyumna85 commented Oct 25, 2024 • edited Loading

Vamshi-Madineni commented Nov 13, 2024

HuyLe82US commented Oct 23, 2024 •

edited

Loading

pradhyumna85 commented Oct 25, 2024 •

edited

Loading