Extracting Full URLs from PDFs #771

fabiocordeiro · 2025-01-17T22:54:26Z

fabiocordeiro
Jan 17, 2025

Hello everyone,

I’m currently using the Docling library in Python to extract text from PDF files. While it works well for retrieving visible text, I’ve noticed an issue with hyperlinks. Specifically, the library only extracts the short display text of the links, but it doesn’t capture the full URLs embedded in the PDF.

This is problematic because I need to extract both the visible text and the actual URLs. For example, if a link in the PDF displays as "Click here" but points to "https://example.com", I can only retrieve "Click here" using Docling.

I was wondering if there is a way to make Docling extract this information or if anyone has encountered a similar issue and found a workaround. If Docling doesn’t support this, are there any recommended libraries or techniques to extract full URLs from PDF files? I’ve read that tools like PyMuPDF or PyPDF2 might help, but I’m looking for the best way to integrate this functionality with my existing Docling-based workflow.

Any advice, examples, or pointers would be greatly appreciated!

Thank you in advance for help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Full URLs from PDFs #771

{{title}}

Replies: 0 comments

Select a reply

Extracting Full URLs from PDFs #771

fabiocordeiro Jan 17, 2025

Replies: 0 comments

fabiocordeiro
Jan 17, 2025