Ocr_pdf_table_csv

This is a project that extracts data from pdf snapshot and enters data into csv This project is using 2 methods to detect boxes from table. First method depends on the border of table. Check the result of first method in "focus_border_Images" Second method depends on the text. Check the result of first method in "focus_text_Images" It was tested on Window and is using multi-threading on 2 stages to speed up.

Extract JPGs from pdf
Extract data from JPG Pretrained Tesseract model is used in this project. You can find the project that is used custom model by CNN on https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN

Dependencies

poppler-0.68.0
Anaconda3-2020.11-Windows-x86_64.exe
tesseract-ocr-w64-setup-v5.1.0.20220510.exe

Installing

Extract poppler-0.68.0 and copy it into C:\Program Files\
Install ananconda and add path to environment variables
Install tesseract-ocr-w64-setup-v5.1.0.20220510.exe
Download this repository
Install requirements.txt in project root directory

How to Run

Convert pdf to images Open the pdf_image.py and define the parameters. You can check the parameters here. https://pdf2image.readthedocs.io/en/latest/reference.html Then, run this command in project root dirctory
```
python pdf_img.py
```
Extracted JPG files are stored in "pdf_img" folder
Extract boxes by border
```
python focus_border.py
```
The result is stored in "focus_border_Images" folder
Extract boxes by text
```
python focus_text.py
```
The result is stored in "focus_text_Images" folder

You can get the final result by running only below command after running pdf_img.py
```
python ocr_border.py
```
This script extracts the boxes by border and get the OCR result by pytesseract The results are stored in "output_img" folder and "table_1.csv" file.

Result Description

In this project, I used the table that has 7 columns 1, 4, and 5 columns can't be recognized by pytesseract. The boxes of these columns are stored in "output_img" folder as JPG and added their file name to csv file. You can check example of "output_img" folder here. https://drive.google.com/drive/folders/1nrns5zZkzfVP9o8aCyjkj9O4O-TwJAvP?usp=sharing Other columns can be recognized by pytesseract and the results are stored in csv directly. The csv keeps table structure of original pdf I attached example "output_img" folder and "table_1.csv"

Futher development

Development of ocr_text.py based on text of table.
Improve accuracy
Extract data from any table.

Version

1.0.2

Please give me star if this project was helpful to your startup project. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ocr_pdf_table_csv

Dependencies

Installing

How to Run

Result Description

Futher development

Version

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
focus_border_Images		focus_border_Images
focus_text_Images		focus_text_Images
output_img		output_img
pdf_img		pdf_img
.gitignore		.gitignore
README.md		README.md
focus_border.py		focus_border.py
focus_border_images.png		focus_border_images.png
focus_border_img.jpg		focus_border_img.jpg
focus_text.py		focus_text.py
ocr_border.py		ocr_border.py
pdf_to_img.py		pdf_to_img.py
requirements.txt		requirements.txt
table_1.csv		table_1.csv
table_1.jpg		table_1.jpg

Supernova1024/29OCR-PDF-to-CSV-Table-by-Tesseract

Folders and files

Latest commit

History

Repository files navigation

Ocr_pdf_table_csv

Dependencies

Installing

How to Run

Result Description

Futher development

Version

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages