Skip to content

Latest commit

 

History

History
85 lines (66 loc) · 2.28 KB

Make-Box-Files.md

File metadata and controls

85 lines (66 loc) · 2.28 KB

Box Files (Tesseract 4.1.1, Tesseract 5.0.0)

Multiple formats of box files are accepted for LSTM training, though they are different from the one used by Tesseract 3.

text2image

Generated by text2image using Unicode fonts and training text.

I 114 4655 120 4691 0
n 127 4655 150 4682 0
f 152 4655 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4681 0
m 214 4654 250 4681 0
a 255 4654 280 4681 0
t 282 4654 295 4689 0
i 298 4654 304 4690 0
o 308 4654 333 4681 0
n 337 4654 360 4681 0
  360 4653 378 4691 0
G 378 4653 413 4691 0
r 418 4653 434 4680 0
o 434 4653 459 4680 0
u 463 4653 486 4679 0
p 491 4643 515 4680 0
s 517 4653 540 4680 0
  540 4653 555 4690 0

lstmbox

Generated by tesseract using lstmbox config from image files - each char uses coordinates of its entire line. This format is also generated by the tesstrain makefile.

I 114 4640 1912 4692 0
n 114 4640 1912 4692 0
f 114 4640 1912 4692 0
o 114 4640 1912 4692 0
r 114 4640 1912 4692 0
m 114 4640 1912 4692 0
a 114 4640 1912 4692 0
t 114 4640 1912 4692 0
i 114 4640 1912 4692 0
o 114 4640 1912 4692 0
n 114 4640 1912 4692 0
  114 4640 1912 4692 0
G 114 4640 1912 4692 0
r 114 4640 1912 4692 0
o 114 4640 1912 4692 0
u 114 4640 1912 4692 0
p 114 4640 1912 4692 0
s 114 4640 1912 4692 0
  114 4640 1912 4692 0
  

wordstrbox

Generated by tesseract using wordstrbox config from image files - Uses Wordstr, coordinates and text for whole line. This format is also generated by the tesstrain makefile for Indic scripts. The Wordstr format box files make it easier to create and correct box files, specially for complex scripts.

WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION 
	 1908 4640 1912 4692 0
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS 
	 2016 4544 2020 4592 0

See example in Devanagari script below.

WordStr 0 0 235 28 0 #मनुष्यबाह्यगतये कृतज्ञाय शिखण्डिने
	 0 0 235 28 0

Please note that box files generated using makebox config file are OK for training legacy models but not for LSTM training.

See Creating Training Data section for more details.