Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoneType for images in pdf.py #72

Open
HuyLe82US opened this issue Oct 23, 2024 · 4 comments
Open

NoneType for images in pdf.py #72

HuyLe82US opened this issue Oct 23, 2024 · 4 comments

Comments

@HuyLe82US
Copy link

HuyLe82US commented Oct 23, 2024

When I tried to OCR a .pdf file, I have this error.
Here is the log:

ERROR:root:Error converting PDF to images: Unable to get page count. Is poppler installed and in PATH?
Traceback (most recent call last):
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 22, in <module>
    asyncio.run(main())
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 15, in main
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\core\zerox.py", line 149, in zerox
    results = await process_pages_in_batches(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\processor\pdf.py", line 104, in process_pages_in_batches
    for image in images
                 ^^^^^^
TypeError: 'NoneType' object is not iterable

I have installed poppler-utils already, and also checked that the package has already in the project.

@HuyLe82US
Copy link
Author

I found out the cause and here is the solution from ChatGPT:

Steps to Resolve

  1. Install Poppler:

    • Poppler is required for converting PDF pages into images. You need to install it on your system.

    On Windows:

    • Download the Poppler binaries from Poppler for Windows.
    • Extract the zip file to a folder (e.g., C:\poppler).

    On macOS:

    • You can install Poppler via Homebrew:
      brew install poppler

    On Linux (Debian/Ubuntu):

    • Install Poppler using the package manager:
      sudo apt-get install poppler-utils
  2. Add Poppler to System PATH:

    If you're on Windows, you'll need to add the bin folder from the Poppler installation to your system's PATH.

    Adding Poppler to PATH (Windows):

    1. Right-click on This PC or My Computer and go to Properties.
    2. Click on Advanced system settings.
    3. In the System Properties window, click on the Environment Variables button.
    4. Under System variables, find the Path variable, and click Edit.
    5. Click New and add the path to the Poppler bin directory (e.g., C:\poppler\bin).
    6. Click OK to close all the windows.
  3. Verify Poppler Installation:

    After installing Poppler and adding it to the PATH, verify that it’s correctly set up by running the following command in your terminal (command prompt or shell):

    pdftoppm -h

    This should display help information for pdftoppm, one of the tools included with Poppler. If you see this, Poppler is correctly installed and added to the PATH.

  4. Retry Running Your Script:

    After ensuring Poppler is installed and available in the PATH, retry running your Python script. The error related to Poppler should be resolved.

Additional Debugging:

If you still encounter issues, make sure:

  • Your PDF file is valid and not corrupted.
  • The file_path provided to your script is correct.

@HuyLe82US HuyLe82US reopened this Oct 23, 2024
@HuyLe82US
Copy link
Author

After fix that, I have another issue with encoding:

Traceback (most recent call last):
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 22, in <module>
    asyncio.run(main())
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 15, in main
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\core\zerox.py", line 169, in zerox
    await f.write("\n\n".join(aggregated_markdown))
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\aiofiles\threadpool\utils.py", line 43, in method
    return await self._loop.run_in_executor(self._executor, cb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u1edc' in position 1: character maps to <undefined>

I've update the PYTHONIOENCODING=utf-8 already in System Variables.

@pradhyumna85
Copy link
Contributor

pradhyumna85 commented Oct 25, 2024

@HuyLe82US, please don't follow the INSTRUCTIONS ON THE ABOVE LINK SHARED BY ummm288

@tylermaran, @annapo23 please block the previous comment, the link contains a malware.

Also report the user.

@Vamshi-Madineni
Copy link

@HuyLe82US, did this issue get resolved? If not, could you try setting the errors='ignore' parameter when reading the PDF? This will skip any special characters that can't be encoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@pradhyumna85 @Vamshi-Madineni @HuyLe82US and others