Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GUI to assist in fine tuning/teaching Tesseract on scanned images #529

Open
AvtechScientific opened this issue Jul 9, 2021 · 3 comments

Comments

@AvtechScientific
Copy link
Contributor

AvtechScientific commented Jul 9, 2021

It would be nice to have GUI elements that would assist in fine tuning/teaching Tesseract on scanned images. Similar to what jTessBoxEditor does, as described in this article1. Mainly creating the .tiff and .box files...

java -jar jTessBoxEditor.jar

tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox

nano font_properties
font 0 0 0 0 0


# Create a .tr file (training file)
tesseract font_name.font.exp0.tif font_name.font.exp0 nobatch box.train

# Create a unicharset file
unicharset_extractor font_name.font.exp0.box

# Create a shapetable file
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

# Create a pffmtable, intemp file
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

# Create a normproto file
cntraining font_name.font.exp0.tr

mv shapetable font_name.shapetable
mv normproto font_name.normproto
mv pffmtable font_name.pffmtable
mv inttemp font_name.inttemp

combine_tessdata font_name.

Now copy font_name.traineddata to :
sudo cp font_name.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

Now test new traindata:
tesseract test_numbers.png stdout -l font_name 

Footnotes

  1. not all the commands listed in the article worked for me. Here are those corrected by me a bit:

@AKmatiAK
Copy link

Yes, this is one of basic features neccesary for OCR program. If it will get added I can donate to support development. Just make simple gui to modify tesseract configuration file with short description of parameter on hover.

@manisandro
Copy link
Owner

Probably the fastest way to achieve this is if someone contributed the code via PR. On my part I won't have the capacity to work on this in the near future.

@khashashin
Copy link

I created a simple Python script that extracts the boxes from the HTML file. In gImageReader you should export the edited image as HTML and then use the script to extract the boxes: https://github.com/khashashin/chechen_ocr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants