GUI to assist in fine tuning/teaching Tesseract on scanned images #529

AvtechScientific · 2021-07-09T14:53:32Z

It would be nice to have GUI elements that would assist in fine tuning/teaching Tesseract on scanned images. Similar to what jTessBoxEditor does, as described in this article¹. Mainly creating the .tiff and .box files...

java -jar jTessBoxEditor.jar

tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox

nano font_properties
font 0 0 0 0 0


# Create a .tr file (training file)
tesseract font_name.font.exp0.tif font_name.font.exp0 nobatch box.train

# Create a unicharset file
unicharset_extractor font_name.font.exp0.box

# Create a shapetable file
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

# Create a pffmtable, intemp file
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

# Create a normproto file
cntraining font_name.font.exp0.tr

mv shapetable font_name.shapetable
mv normproto font_name.normproto
mv pffmtable font_name.pffmtable
mv inttemp font_name.inttemp

combine_tessdata font_name.

Now copy font_name.traineddata to :
sudo cp font_name.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

Now test new traindata:
tesseract test_numbers.png stdout -l font_name

not all the commands listed in the article worked for me. Here are those corrected by me a bit: ↩

The text was updated successfully, but these errors were encountered:

AKmatiAK · 2022-03-30T12:53:47Z

Yes, this is one of basic features neccesary for OCR program. If it will get added I can donate to support development. Just make simple gui to modify tesseract configuration file with short description of parameter on hover.

manisandro · 2022-03-30T13:03:27Z

Probably the fastest way to achieve this is if someone contributed the code via PR. On my part I won't have the capacity to work on this in the near future.

khashashin · 2023-03-25T23:54:09Z

I created a simple Python script that extracts the boxes from the HTML file. In gImageReader you should export the edited image as HTML and then use the script to extract the boxes: https://github.com/khashashin/chechen_ocr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GUI to assist in fine tuning/teaching Tesseract on scanned images #529

GUI to assist in fine tuning/teaching Tesseract on scanned images #529

AvtechScientific commented Jul 9, 2021 •

edited

Loading

AKmatiAK commented Mar 30, 2022

manisandro commented Mar 30, 2022

khashashin commented Mar 25, 2023

GUI to assist in fine tuning/teaching Tesseract on scanned images #529

GUI to assist in fine tuning/teaching Tesseract on scanned images #529

Comments

AvtechScientific commented Jul 9, 2021 • edited Loading

Footnotes

AKmatiAK commented Mar 30, 2022

manisandro commented Mar 30, 2022

khashashin commented Mar 25, 2023

AvtechScientific commented Jul 9, 2021 •

edited

Loading