Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use the models for Fraktur (German) ? #46

Open
Hermann12 opened this issue Apr 11, 2021 · 10 comments
Open

How can I use the models for Fraktur (German) ? #46

Hermann12 opened this issue Apr 11, 2021 · 10 comments
Labels

Comments

@Hermann12
Copy link

I would like to use your model for Fraktur. How must this implemented or is this only a special command?

@stweil
Copy link
Member

stweil commented Apr 11, 2021

  • Download the desired model file(s) (*.traineddata), either fast (recommended for recognition) or best (required for additional training) variant
  • Install the model file(s) in your local tessdata directory or a subdirectory of that directory
  • Optionally rename the model file(s)
  • Run Tesseract and specify the name of the model file (-l MODEL), maybe with the subdirectory before the name and without the trailing .traineddata

Models are available from these URLs:

We used https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_fast/frak2021_1.069_755545_3685930.traineddata (CER 1.069 % on selected ground truth) for our latest own OCR, but depending on your texts other models might give better results.

@Hermann12
Copy link
Author

Realy good. Thank you very much! I see only one Problem with my Test file with the "ſ" => "s" . But anyway realy good in comparison of my previous Tests.
Example:
Vorrede.
Belehrt durch die Erfahrung, wie leicht der Zuhörer Urtheil
über die Geiſteserzeugniſſe ihres Predigers durch ſo manche

@stweil
Copy link
Member

stweil commented Apr 11, 2021

We train our models to detect the long s as "ſ", so if you want an "s", that requires a simple search and replace operation on the results.

@Hermann12
Copy link
Author

Hermann12 commented Apr 11, 2021

o.k. thanks, understood. As I told before, I am very happy with this result! I detect another issue "oͤ" instead of "ö", but not always. Maybe my bad scan could the reason. I have very rough paper. Does you prefere .jpg or .png as the source?
I will figure out for my project, if it's good enough to improve my pictures, or I have to improve the traineddata. The second is maybe the more difficult thing.

@stweil
Copy link
Member

stweil commented Apr 11, 2021

The model was trained on a wide range of historic texts (from early prints to early 20th century) which include both umlaut variants "oͤ" and "ö". Tesseract does not care which image format you provide: it works with jpg, png and other image formats.

@Hermann12
Copy link
Author

My print is from 1828. I see both variants on the same page, even it's only a unique sign "ö" on paper.

@stweil stweil added the question label May 7, 2021
@stweil
Copy link
Member

stweil commented May 19, 2021

Can you provide example images?

@Hermann12
Copy link
Author

Hermann12 commented May 19, 2021

Source:
grafik

Result: see row 24, same line different character.
beſtehenden allerhchſten Vorſchriften kräftig zu fördern: um

grafik

@stweil
Copy link
Member

stweil commented May 20, 2021

Line 24 contains indeed both variants of "ö", so the OCR result is correct when it makes a difference. "allerhöchsten" uses lower case "o" combined with a small "e". That's what the OCR should detect.

@Hermann12
Copy link
Author

Hermann12 commented May 20, 2021

Line 24:
case 1: "allerhöchsten" => "o" & "e" - AND same line
case 2: "fördern" => "ö"
Why ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants