This is a project that extracts data from pdf snapshot and enters data into csv This project is using 2 methods to detect boxes from table. First method depends on the border of table. Check the result of first method in "focus_border_Images" Second method depends on the text. Check the result of first method in "focus_text_Images" It was tested on Window and is using multi-threading on 2 stages to speed up.
- Extract JPGs from pdf
- Extract data from JPG Pretrained Tesseract model is used in this project. You can find the project that is used custom model by CNN on
- poppler-0.68.0
- Anaconda3-2020.11-Windows-x86_64.exe
- tesseract-ocr-w64-setup-v5.1.0.20220510.exe
- Extract poppler-0.68.0 and copy it into C:\Program Files\
- Install ananconda and add path to environment variables
- Install tesseract-ocr-w64-setup-v5.1.0.20220510.exe
- Download this repository
- Install requirements.txt in project root directory
Convert pdf to images Open the and define the parameters. You can check the parameters here. Then, run this command in project root dirctory
Extracted JPG files are stored in "pdf_img" folder
Extract boxes by border
Extract boxes by text
The result is stored in "focus_text_Images" folder
- You can get the final result by running only below command after running
This script extracts the boxes by border and get the OCR result by pytesseract The results are stored in "output_img" folder and "table_1.csv" file.
In this project, I used the table that has 7 columns
1, 4, and 5 columns can't be recognized by pytesseract.
The boxes of these columns are stored in "output_img" folder as JPG and added their file name to csv file.
You can check example of "output_img" folder here.
Other columns can be recognized by pytesseract and the results are stored in csv directly.
The csv keeps table structure of original pdf
I attached example "output_img" folder and "table_1.csv"
- Development of based on text of table.
- Improve accuracy
- Extract data from any table.
Please give me star if this project was helpful to your startup project. :)