You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, i want to thank you for writing such robust algorithm for pdf parsing. It handles most of the cases but particularly fails in some.
One such case is Multi-column section which extends to another page fails to be parsed correctly.
Here is the pdf used. SMJ-63-753.PMC9875870.pdf
I have written a small parser which uses the llm-sherpa sdk to parse it to markdown format (easier to debug).
fromllmsherpa.readersimportLayoutPDFReader, Block, Paragraph, Section, LayoutReaderllmsherpa_api_url="http://localhost:5001/api/parseDocument?renderFormat=all&useNewIndentParser=yes&applyOcr=no"defconvert_to_markdown(section: Section, traversed: list, level=2):
"""Recursively convert a section and its children to Markdown."""ifsection.block_idxintraversed:
return""markdown_output=""# Handle different types of tagsifsection.tag=='header':
markdown_output+=f"{'#'*level}{section.title}\n\n"elifsection.tag=='para':
markdown_output+=f"{section.to_text(include_children=False, recurse=False)}\n\n"elifsection.tag=='list_item':
markdown_output+=f"- {section.to_text(include_children=False, recurse=False)}\n"elifsection.tag=='table':
markdown_output+=section.to_text(include_children=False, recurse=False) +"\n"traversed.append(section.block_idx)
# Recursively process childrenprint("Children: ", section.block_idx, [child.block_idxforchildinsection.children])
forchildinsection.children:
markdown_output+=convert_to_markdown(child, traversed, level+1)
traversed.append(child.block_idx)
returnmarkdown_outputpdf_reader=LayoutPDFReader(llmsherpa_api_url)
doc=pdf_reader.read_pdf("input.pdf")
markdown_output=""traversed= []
forsectionindoc.sections():
markdown_output+=convert_to_markdown(section, traversed)
withopen("output-multi.md", "w") asmd_file:
md_file.write(markdown_output)
So the output of input pdf has few issues.
In this llmsherpa extracted: An exception was the art given the context.....
But it should have been: An exception was the art psychotherapy group, ....
Similarly,
In this llmsherpa extracted: iii. Participation in the online ... engage fully and safely <Section from left column> iv. It is important...
But it should have been: iii. Participation in the online... iv. It is important ...
Thanks and regards
The text was updated successfully, but these errors were encountered:
First of all, i want to thank you for writing such robust algorithm for pdf parsing. It handles most of the cases but particularly fails in some.
One such case is Multi-column section which extends to another page fails to be parsed correctly.
Here is the pdf used.
SMJ-63-753.PMC9875870.pdf
I have written a small parser which uses the llm-sherpa sdk to parse it to markdown format (easier to debug).
So the output of input pdf has few issues.
In this llmsherpa extracted:
An exception was the art given the context.....
But it should have been:
An exception was the art psychotherapy group, ....
Similarly,
In this llmsherpa extracted:
iii. Participation in the online ... engage fully and safely <Section from left column> iv. It is important...
But it should have been:
iii. Participation in the online... iv. It is important ...
Thanks and regards
The text was updated successfully, but these errors were encountered: