-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for reading PDF files using pypdf #80
base: main
Are you sure you want to change the base?
Conversation
Modified to read PDF files
Thank you so much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but I think we should follow the import stated in their official docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vyaas99 Thank you very much
There's a few things to change before we can merge this:
It seems like the pdf reading feature is not functional, I tried with: https://github.com/cyclotruc/test
-
In
ignore_patterns.py
there's a.pdf
filter that needs to be removed -
_is_text_file
returns False on a pdf, this function needs to be adapted in order to accept pdfs (maybe a simple check on filename)
If you can make those changes and run pre-commit
to ensure CI checks passes I would glaadly merge this!
Hello @Vyaas99, Thanx! |
@joydeep049, thanks for informing me about this. I have linked the pull request to the issue now. @cyclotruc, I have written separate functions to check for and read PDF files because more work can be done on it apart from just extracting text if the need arises. I have removed .pdf from ignore_patterns.py as well and ran the pre-commit hooks. |
I didn't know about that thank you for the idea! |
@cyclotruc I had some more ideas to make the commit structure better, we can discuss about it on discord or I can open a separate issue for discussion. |
@joydeep049, sure, we can discuss on discord. My username is werwet10. |
Oh, I thought I could discuss it with @cyclotruc first and then file an issue |
@joydeep049, oh okay. You didn't tag in your original comment and I misunderstood. |
Yeah, I realised it later. My mistake! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds functionality to ingest and process PDF files in the repository. It introduces the following changes:
Please let me know if further changes are required!
Closes #74