Extracting Full URLs from PDFs #771
fabiocordeiro
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone,
I’m currently using the Docling library in Python to extract text from PDF files. While it works well for retrieving visible text, I’ve noticed an issue with hyperlinks. Specifically, the library only extracts the short display text of the links, but it doesn’t capture the full URLs embedded in the PDF.
This is problematic because I need to extract both the visible text and the actual URLs. For example, if a link in the PDF displays as "Click here" but points to "https://example.com", I can only retrieve "Click here" using Docling.
I was wondering if there is a way to make Docling extract this information or if anyone has encountered a similar issue and found a workaround. If Docling doesn’t support this, are there any recommended libraries or techniques to extract full URLs from PDF files? I’ve read that tools like PyMuPDF or PyPDF2 might help, but I’m looking for the best way to integrate this functionality with my existing Docling-based workflow.
Any advice, examples, or pointers would be greatly appreciated!
Thank you in advance for help!
Beta Was this translation helpful? Give feedback.
All reactions