Skip to content

Pattern Recognition #12634

Discussion options

You must be logged in to vote

Hello @standenman, quite an interesting problem indeed. Having worked with PDF documents for a bit, I can recommend two processing tools that should help you get the job done:

  • EDS-PDF provides a framework to extract text from PDF documents. It is developed at AP-HP (Greater Paris University Hospitals) and hence with medical documents in mind, although it's general-purpose.
  • If you're looking for a lower-level library, PDFMiner is a pure-Python PDF parsing tool.

Either way, you'll be able to use relevant information such as position within the page, font and style of the area of interest, etc. to help you with your task.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by danieldk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third-party Third-party packages and services
3 participants