Pattern Recognition #12634

standenman · 2023-05-14T15:31:56Z

standenman
May 14, 2023

I seek to pre-process a single pdf file of medical records for a large date range of office visits. I want to group the records - split into seperate pdfs - based upon treatment date. The problem is that "treatment date"in text can be presented in a variety of way. I am wondering if one way to come at it is to identify the group of pages that belong to one visit date simply based upon the layout, format or pattern. For example, the a page wiht half blank text at the bottom is likely a "spill over" page from a prior page so we know those two pages belong together. Or the format of text for a given visit date likely following the same pattern/repetitions of text section like "Patient Date" , "Medical History", "diagnoses", "notes" etc. I want to able to take advantage of the reality that a given doctor visit is documented using a medical record system template of some kind. Once a group of pages is identified as belonging to the same patient visit it may be easier to hone in on what date text in that group of pages actually refers to treatment date. Any ideas of approach and what tools would be best?

Answered by bdura

May 15, 2023

Hello @standenman, quite an interesting problem indeed. Having worked with PDF documents for a bit, I can recommend two processing tools that should help you get the job done:

EDS-PDF provides a framework to extract text from PDF documents. It is developed at AP-HP (Greater Paris University Hospitals) and hence with medical documents in mind, although it's general-purpose.
If you're looking for a lower-level library, PDFMiner is a pure-Python PDF parsing tool.

Either way, you'll be able to use relevant information such as position within the page, font and style of the area of interest, etc. to help you with your task.

View full answer

danieldk · 2023-05-15T07:28:16Z

danieldk
May 15, 2023

That's an interesting problem! It's not really a spaCy question, but more a question about how to process and split PDFs.

I'll ask one of the PDF preprocessing experts in our team to see if they have any pointers and report back!

0 replies

bdura · 2023-05-15T09:12:47Z

bdura
May 15, 2023

Hello @standenman, quite an interesting problem indeed. Having worked with PDF documents for a bit, I can recommend two processing tools that should help you get the job done:

EDS-PDF provides a framework to extract text from PDF documents. It is developed at AP-HP (Greater Paris University Hospitals) and hence with medical documents in mind, although it's general-purpose.
If you're looking for a lower-level library, PDFMiner is a pure-Python PDF parsing tool.

Either way, you'll be able to use relevant information such as position within the page, font and style of the area of interest, etc. to help you with your task.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern Recognition #12634

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Pattern Recognition #12634

standenman May 14, 2023

Replies: 2 comments

danieldk May 15, 2023

bdura May 15, 2023

standenman
May 14, 2023

danieldk
May 15, 2023

bdura
May 15, 2023