Multi-column sections extending to to multiple pages fails #106

dilshans2k · 2024-08-12T06:31:42Z

First of all, i want to thank you for writing such robust algorithm for pdf parsing. It handles most of the cases but particularly fails in some.
One such case is Multi-column section which extends to another page fails to be parsed correctly.
Here is the pdf used.
SMJ-63-753.PMC9875870.pdf

I have written a small parser which uses the llm-sherpa sdk to parse it to markdown format (easier to debug).

from llmsherpa.readers import LayoutPDFReader, Block, Paragraph, Section, LayoutReader

llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all&useNewIndentParser=yes&applyOcr=no"

def convert_to_markdown(section: Section, traversed: list, level=2):
    """Recursively convert a section and its children to Markdown."""
    if section.block_idx in traversed:
        return ""
    markdown_output = ""

    # Handle different types of tags
    if section.tag == 'header':
        markdown_output += f"{'#' * level} {section.title}\n\n"
    elif section.tag == 'para':
        markdown_output += f"{section.to_text(include_children=False, recurse=False)}\n\n"
    elif section.tag == 'list_item':
        markdown_output += f"- {section.to_text(include_children=False, recurse=False)}\n"
    elif section.tag == 'table':
        markdown_output += section.to_text(include_children=False, recurse=False) + "\n"
    
    traversed.append(section.block_idx)
    # Recursively process children
    print("Children: ", section.block_idx, [child.block_idx for child in section.children])
    for child in section.children:
        markdown_output += convert_to_markdown(child, traversed, level + 1)
        traversed.append(child.block_idx)
    
    return markdown_output

pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf("input.pdf")
markdown_output = ""
traversed = []
for section in doc.sections():
    markdown_output += convert_to_markdown(section, traversed)

with open("output-multi.md", "w") as md_file:
    md_file.write(markdown_output)

So the output of input pdf has few issues.

In this llmsherpa extracted: An exception was the art given the context.....
But it should have been: An exception was the art psychotherapy group, ....

Similarly,

In this llmsherpa extracted: iii. Participation in the online ... engage fully and safely <Section from left column> iv. It is important...
But it should have been: iii. Participation in the online... iv. It is important ...

Thanks and regards

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-column sections extending to to multiple pages fails #106

Multi-column sections extending to to multiple pages fails #106

dilshans2k commented Aug 12, 2024

Multi-column sections extending to to multiple pages fails #106

Multi-column sections extending to to multiple pages fails #106

Comments

dilshans2k commented Aug 12, 2024