Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docling lost the link URLs embedded in the text while parsing the PDF content. #585

Open
Crespo522 opened this issue Dec 13, 2024 · 1 comment
Labels
enhancement New feature or request priority:low

Comments

@Crespo522
Copy link

Bug

Docling lost the link URLs embedded in the text while parsing the PDF content.
...

Steps to reproduce

input_doc_path = Path("Test Docling.pdf")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
start_time = time.time()
conv_result = doc_converter.convert(input_doc_path)
end_time = time.time() - start_time

_log.info(f"Document converted in {end_time:.2f} seconds.")

## Export results
output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_result.input.file.stem
# Export Text format:
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_text())

...

Docling version

docling 2.10.0
python 3.10
...

Test Docling.pdf

@Crespo522 Crespo522 added the bug Something isn't working label Dec 13, 2024
@dolfim-ibm
Copy link
Contributor

Our PDF reader is currently not extracting any styling or meta information about the text.

@dolfim-ibm dolfim-ibm added enhancement New feature or request priority:low and removed bug Something isn't working labels Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:low
Projects
None yet
Development

No branches or pull requests

2 participants