You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice to have GUI elements that would assist in fine tuning/teaching Tesseract on scanned images. Similar to what jTessBoxEditor does, as described in this article1. Mainly creating the .tiff and .box files...
java -jar jTessBoxEditor.jar
tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox
nano font_properties
font 0 0 0 0 0
# Create a .tr file (training file)
tesseract font_name.font.exp0.tif font_name.font.exp0 nobatch box.train
# Create a unicharset file
unicharset_extractor font_name.font.exp0.box
# Create a shapetable file
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
# Create a pffmtable, intemp file
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
# Create a normproto file
cntraining font_name.font.exp0.tr
mv shapetable font_name.shapetable
mv normproto font_name.normproto
mv pffmtable font_name.pffmtable
mv inttemp font_name.inttemp
combine_tessdata font_name.
Now copy font_name.traineddata to :
sudo cp font_name.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
Now test new traindata:
tesseract test_numbers.png stdout -l font_name
Footnotes
not all the commands listed in the article worked for me. Here are those corrected by me a bit: ↩
The text was updated successfully, but these errors were encountered:
Yes, this is one of basic features neccesary for OCR program. If it will get added I can donate to support development. Just make simple gui to modify tesseract configuration file with short description of parameter on hover.
Probably the fastest way to achieve this is if someone contributed the code via PR. On my part I won't have the capacity to work on this in the near future.
I created a simple Python script that extracts the boxes from the HTML file. In gImageReader you should export the edited image as HTML and then use the script to extract the boxes: https://github.com/khashashin/chechen_ocr
It would be nice to have GUI elements that would assist in fine tuning/teaching Tesseract on scanned images. Similar to what jTessBoxEditor does, as described in this article1. Mainly creating the .tiff and .box files...
Footnotes
not all the commands listed in the article worked for me. Here are those corrected by me a bit: ↩
The text was updated successfully, but these errors were encountered: