Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP request's "config_vars" without effect when using open-ocr-2 / Tesseract 4.00 #99

Open
danielpater opened this issue Jan 4, 2018 · 1 comment

Comments

@danielpater
Copy link

danielpater commented Jan 4, 2018

The HTTP request's config_vars property has no effect when using the Docker setup with Docker image tleyden5iwx/open-ocr-2 (Version 2 / Tesseract 4.00).

When using Version 1 based on the original Docker image tleyden5iwx/open-ocr, config_vars property allows to pass command line config arguments like tessedit_char_whitelist to Tesseract. This does not work with tleyden5iwx/open-ocr-2.

Example request allowing only digits as result character set:
$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract", "engine_args":{"config_vars":{"tessedit_char_whitelist":"0123456789"}, "psm":"3"}}' http://$DOCKER_HOST:$HTTP_PORT/ocr

Response body with tleyden5iwx/open-ocr:
011 2111 01133126 10031 31 111165 1 01 116 1 1 61 11635 1211 11 116 811111 13126 13 1 1 3 3161 116 31 211113 1131116 1211 1 1 5 5 3711 211 1311 16 112111135 1121 16 0 1 3 001111 053 1 0 311 1121111111161 6113111012813 5111 21 116 11111 161 8001 6 111 116 63211111 16 1 610 1 1121 16 118611 1 1155 31 1 0115 1131 01 1 1 01 31 211113 112111133

Response body with tleyden5iwx/open-ocr-2:
You can create local variables for the pipelines within the template by prefixing the variable name with a "$" sign. Variable names have to be composed of alphanumeric characters and the underscore. In the example below I have used a few variations that work for variable names.

@tleyden
Copy link
Owner

tleyden commented Jan 4, 2018

I think open-ocr must be calling tesseract incorrectly when it comes to newer versions of tesseract

Thinking out loud ... this is the call to tesseract:

// build args array
cflags := engineArgs.Export()
cmdArgs := []string{inputFilename, tmpOutFileBaseName}
cmdArgs = append(cmdArgs, cflags...)
logg.LogTo("OCR_TESSERACT", "cmdArgs: %v", cmdArgs)

and this is how it builds the args:

// return a slice that can be passed to tesseract binary as command line
// args, eg, ["-c", "tessedit_char_whitelist=0123456789", "-c", "foo=bar"]
func (t TesseractEngineArgs) Export() []string {
result := []string{}
for k, v := range t.configVars {
result = append(result, "-c")
keyValArg := fmt.Sprintf("%s=%s", k, v)
result = append(result, keyValArg)
}
if t.pageSegMode != "" {
result = append(result, "-psm")
result = append(result, t.pageSegMode)
}
if t.lang != "" {
result = append(result, "-l")
result = append(result, t.lang)
}
return result
}

I wonder if this works for tesseract 4?

-c tessedit_char_whitelist=0123456789

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants