Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-column sections extending to to multiple pages fails #106

Open
dilshans2k opened this issue Aug 12, 2024 · 0 comments
Open

Multi-column sections extending to to multiple pages fails #106

dilshans2k opened this issue Aug 12, 2024 · 0 comments

Comments

@dilshans2k
Copy link

First of all, i want to thank you for writing such robust algorithm for pdf parsing. It handles most of the cases but particularly fails in some.
One such case is Multi-column section which extends to another page fails to be parsed correctly.
Here is the pdf used.
SMJ-63-753.PMC9875870.pdf

I have written a small parser which uses the llm-sherpa sdk to parse it to markdown format (easier to debug).

from llmsherpa.readers import LayoutPDFReader, Block, Paragraph, Section, LayoutReader

llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all&useNewIndentParser=yes&applyOcr=no"

def convert_to_markdown(section: Section, traversed: list, level=2):
    """Recursively convert a section and its children to Markdown."""
    if section.block_idx in traversed:
        return ""
    markdown_output = ""

    # Handle different types of tags
    if section.tag == 'header':
        markdown_output += f"{'#' * level} {section.title}\n\n"
    elif section.tag == 'para':
        markdown_output += f"{section.to_text(include_children=False, recurse=False)}\n\n"
    elif section.tag == 'list_item':
        markdown_output += f"- {section.to_text(include_children=False, recurse=False)}\n"
    elif section.tag == 'table':
        markdown_output += section.to_text(include_children=False, recurse=False) + "\n"
    
    traversed.append(section.block_idx)
    # Recursively process children
    print("Children: ", section.block_idx, [child.block_idx for child in section.children])
    for child in section.children:
        markdown_output += convert_to_markdown(child, traversed, level + 1)
        traversed.append(child.block_idx)
    
    return markdown_output

pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf("input.pdf")
markdown_output = ""
traversed = []
for section in doc.sections():
    markdown_output += convert_to_markdown(section, traversed)

with open("output-multi.md", "w") as md_file:
    md_file.write(markdown_output)

So the output of input pdf has few issues.
image
In this llmsherpa extracted: An exception was the art given the context.....
But it should have been: An exception was the art psychotherapy group, ....

Similarly,
image
In this llmsherpa extracted: iii. Participation in the online ... engage fully and safely <Section from left column> iv. It is important...
But it should have been: iii. Participation in the online... iv. It is important ...

Thanks and regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant