This is a project that extracts data from pdf snapshot and enters data into csv This project is using 2 methods to detect boxes from table. First method depends on the border of table. Check the result of first method in "focus_border_Images" Second method depends on the text. Check the result of first method in "focus_text_Images" It was tested on Window and is using multi-threading on 2 stages to speed up.
- Extract JPGs from pdf
- Extract data from JPG Pretrained Tesseract model is used in this project. You can find the project that is used custom model by CNN on https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN
- poppler-0.68.0
- Anaconda3-2020.11-Windows-x86_64.exe
- tesseract-ocr-w64-setup-v5.1.0.20220510.exe
- Extract poppler-0.68.0 and copy it into C:\Program Files\
- Install ananconda and add path to environment variables
- Install tesseract-ocr-w64-setup-v5.1.0.20220510.exe
- Download this repository
- Install requirements.txt in project root directory
-
Convert pdf to images Open the pdf_image.py and define the parameters. You can check the parameters here. https://pdf2image.readthedocs.io/en/latest/reference.html Then, run this command in project root dirctory
python pdf_img.py
Extracted JPG files are stored in "pdf_img" folder
-
Extract boxes by border
python focus_border.py
-
Extract boxes by text
python focus_text.py
The result is stored in "focus_text_Images" folder
- You can get the final result by running only below command after running pdf_img.py
This script extracts the boxes by border and get the OCR result by pytesseract The results are stored in "output_img" folder and "table_1.csv" file.
python ocr_border.py
In this project, I used the table that has 7 columns
1, 4, and 5 columns can't be recognized by pytesseract.
The boxes of these columns are stored in "output_img" folder as JPG and added their file name to csv file.
You can check example of "output_img" folder here.
https://drive.google.com/drive/folders/1nrns5zZkzfVP9o8aCyjkj9O4O-TwJAvP?usp=sharing
Other columns can be recognized by pytesseract and the results are stored in csv directly.
The csv keeps table structure of original pdf
I attached example "output_img" folder and "table_1.csv"
- Development of ocr_text.py based on text of table.
- Improve accuracy
- Extract data from any table.
1.0.2
Please give me star if this project was helpful to your startup project. :)