Skip to content

Commit

Permalink
README.md update.
Browse files Browse the repository at this point in the history
  • Loading branch information
maxim2266 committed Jun 20, 2017
1 parent 1dbbdf6 commit b257b53
Showing 1 changed file with 13 additions and 7 deletions.
20 changes: 13 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
A simple driver script for "tesseract" OCR tool.

### The tool
Given an input file in either `pdf` or `djvu` format, the tool extracts images from
Given an input file in either `pdf` or `djvu` format, the tool extracts images from
the input files using `pdfimages` or `ddjvu` tool, and then converts the images to
plain text using `tesseract` tool.

Expand All @@ -21,7 +21,8 @@ Command line options:
```

##### Example
The following command processes a document `some.pdf` in Russian, from page 12 to page 26 (inclusive),

The following command processes a document `some.pdf` in Russian, from page 12 to page 26 (inclusive),
storing the result in the file `document.txt`:
```
./ocr -f 12 -l 26 -L rus -o document.txt some.pdf
Expand All @@ -33,9 +34,14 @@ Tested on Linux Mint 18.1, will probably work on other Debian-based distribution

#### External tools

Internally the script relies on `pdfimages` and `ddjvu` tools for extracting images,
and on `tesseract` program for the actual OCR'ing. The tool `pdfimages` is usually a part of `poppler-utils`
package, the tool `ddjvu` comes from `djvulibre-bin` package, and `tesseract` is included in `tesseract-ocr`
package. By default, `tesseract` comes with the English language support only, other languages should
be installed separately, for example, run `sudo apt install tesseract-ocr-rus` to install the Russian
Internally the script relies on `pdfimages` and `ddjvu` tools for extracting images,
and on `tesseract` program for the actual OCR'ing. The tool `pdfimages` is usually a part of `poppler-utils`
package, the tool `ddjvu` comes from `djvulibre-bin` package, and `tesseract` is included in `tesseract-ocr`
package. By default, `tesseract` comes with the English language support only, other languages should
be installed separately, for example, run `sudo apt install tesseract-ocr-rus` to install the Russian
language support. To find out what languages are currently installed type `tesseract --list-langs`.

#### Known limitations

The tool may produce somewhat messy output from `.pdf` files composed of images with masks. No simple
workaround is known at this time. Check the input with `pdfinfo` first.

0 comments on commit b257b53

Please sign in to comment.