Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for reading PDF files using pypdf #80

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Vyaas99
Copy link

@Vyaas99 Vyaas99 commented Dec 30, 2024

This PR adds functionality to ingest and process PDF files in the repository. It introduces the following changes:

Enhancements:
    Added _is_pdf_file to detect .pdf files.
    Implemented _read_pdf_content for extracting text using PyPDF2.
    Updated _read_file_content to call _read_pdf_content for .pdf files.
Dependencies:
    Updated requirements.txt to include PyPDF2.

Please let me know if further changes are required!

Closes #74

@cyclotruc cyclotruc added the planned This will be implemented label Dec 30, 2024
@cyclotruc
Copy link
Owner

Thank you so much!
I'll review and test ASAP

Copy link
Owner

@cyclotruc cyclotruc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I think we should follow the import stated in their official docs

src/gitingest/ingest_from_query.py Outdated Show resolved Hide resolved
@cyclotruc cyclotruc changed the title Add support for reading PDF files using PyPDF2 Add support for reading PDF files using pypdf Jan 2, 2025
Copy link
Owner

@cyclotruc cyclotruc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vyaas99 Thank you very much

There's a few things to change before we can merge this:
It seems like the pdf reading feature is not functional, I tried with: https://github.com/cyclotruc/test

  • In ignore_patterns.py there's a .pdf filter that needs to be removed

  • _is_text_file returns False on a pdf, this function needs to be adapted in order to accept pdfs (maybe a simple check on filename)
    image

If you can make those changes and run pre-commit to ensure CI checks passes I would glaadly merge this!

@cyclotruc cyclotruc added work in progress This PR is not ready yet but is being worked on changes requested A review requested changes before merging and removed planned This will be implemented changes requested A review requested changes before merging work in progress This PR is not ready yet but is being worked on labels Jan 2, 2025
@cyclotruc cyclotruc changed the title Add support for reading PDF files using pypdf feat: add support for reading PDF files using pypdf Jan 2, 2025
@joydeep049
Copy link
Contributor

Hello @Vyaas99,
Could you mention the related issue which you are solving, so that it can auto-merge.
It would be convenient if the linked pull requests are visible in the issues tab

Thanx!

@Vyaas99
Copy link
Author

Vyaas99 commented Jan 2, 2025

@joydeep049, thanks for informing me about this. I have linked the pull request to the issue now.

@cyclotruc, I have written separate functions to check for and read PDF files because more work can be done on it apart from just extracting text if the need arises. I have removed .pdf from ignore_patterns.py as well and ran the pre-commit hooks.

@cyclotruc
Copy link
Owner

Hello @Vyaas99, Could you mention the related issue which you are solving, so that it can auto-merge. It would be convenient if the linked pull requests are visible in the issues tab

Thanx!

I didn't know about that thank you for the idea!

@joydeep049
Copy link
Contributor

joydeep049 commented Jan 2, 2025

Hello @Vyaas99, Could you mention the related issue which you are solving, so that it can auto-merge. It would be convenient if the linked pull requests are visible in the issues tab
Thanx!

I didn't know about that thank you for the idea!

@cyclotruc I had some more ideas to make the commit structure better, we can discuss about it on discord or I can open a separate issue for discussion.

@Vyaas99
Copy link
Author

Vyaas99 commented Jan 2, 2025

@joydeep049, sure, we can discuss on discord. My username is werwet10.

@joydeep049
Copy link
Contributor

@joydeep049, sure, we can discuss on discord. My username is werwet10.

Oh, I thought I could discuss it with @cyclotruc first and then file an issue

@Vyaas99
Copy link
Author

Vyaas99 commented Jan 2, 2025

@joydeep049, oh okay. You didn't tag in your original comment and I misunderstood.

@joydeep049
Copy link
Contributor

@joydeep049, oh okay. You didn't tag in your original comment and I misunderstood.

Yeah, I realised it later. My mistake!

Copy link
Owner

@cyclotruc cyclotruc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

This unfortunately doesn't seem to work yet
Did you test with something specific that made it work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changes requested A review requested changes before merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: include PDF files in digest
3 participants