Pattern Recognition #12634
-
I seek to pre-process a single pdf file of medical records for a large date range of office visits. I want to group the records - split into seperate pdfs - based upon treatment date. The problem is that "treatment date"in text can be presented in a variety of way. I am wondering if one way to come at it is to identify the group of pages that belong to one visit date simply based upon the layout, format or pattern. For example, the a page wiht half blank text at the bottom is likely a "spill over" page from a prior page so we know those two pages belong together. Or the format of text for a given visit date likely following the same pattern/repetitions of text section like "Patient Date" , "Medical History", "diagnoses", "notes" etc. I want to able to take advantage of the reality that a given doctor visit is documented using a medical record system template of some kind. Once a group of pages is identified as belonging to the same patient visit it may be easier to hone in on what date text in that group of pages actually refers to treatment date. Any ideas of approach and what tools would be best? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
That's an interesting problem! It's not really a spaCy question, but more a question about how to process and split PDFs. I'll ask one of the PDF preprocessing experts in our team to see if they have any pointers and report back! |
Beta Was this translation helpful? Give feedback.
-
Hello @standenman, quite an interesting problem indeed. Having worked with PDF documents for a bit, I can recommend two processing tools that should help you get the job done:
Either way, you'll be able to use relevant information such as position within the page, font and style of the area of interest, etc. to help you with your task. |
Beta Was this translation helpful? Give feedback.
Hello @standenman, quite an interesting problem indeed. Having worked with PDF documents for a bit, I can recommend two processing tools that should help you get the job done:
Either way, you'll be able to use relevant information such as position within the page, font and style of the area of interest, etc. to help you with your task.