README.md update.

maxim2266 · Jun 20, 2017 · b257b53 · b257b53
1 parent 1dbbdf6
commit b257b53
Showing 1 changed file with 13 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 A simple driver script for "tesseract" OCR tool.
 
 ### The tool
-Given an input file in either `pdf` or `djvu` format, the tool extracts images from 
+Given an input file in either `pdf` or `djvu` format, the tool extracts images from
 the input files using `pdfimages` or `ddjvu` tool, and then converts the images to
 plain text using `tesseract` tool.
 
@@ -21,7 +21,8 @@ Command line options:
 ```
 
 ##### Example
-The following command processes a document `some.pdf` in Russian, from page 12 to page 26 (inclusive), 
+
+The following command processes a document `some.pdf` in Russian, from page 12 to page 26 (inclusive),
 storing the result in the file `document.txt`:
 ```
 ./ocr -f 12 -l 26 -L rus -o document.txt some.pdf
@@ -33,9 +34,14 @@ Tested on Linux Mint 18.1, will probably work on other Debian-based distribution
 
 #### External tools
 
-Internally the script relies on `pdfimages` and `ddjvu` tools for extracting images, 
-and on `tesseract` program for the actual OCR'ing. The tool `pdfimages` is usually a part of `poppler-utils` 
-package, the tool `ddjvu` comes from `djvulibre-bin` package, and `tesseract` is included in `tesseract-ocr` 
-package. By default, `tesseract` comes with the English language support only, other languages should 
-be installed separately, for example, run `sudo apt install tesseract-ocr-rus` to install the Russian 
+Internally the script relies on `pdfimages` and `ddjvu` tools for extracting images,
+and on `tesseract` program for the actual OCR'ing. The tool `pdfimages` is usually a part of `poppler-utils`
+package, the tool `ddjvu` comes from `djvulibre-bin` package, and `tesseract` is included in `tesseract-ocr`
+package. By default, `tesseract` comes with the English language support only, other languages should
+be installed separately, for example, run `sudo apt install tesseract-ocr-rus` to install the Russian
 language support. To find out what languages are currently installed type `tesseract --list-langs`.
+
+#### Known limitations
+
+The tool may produce somewhat messy output from `.pdf` files composed of images with masks. No simple
+workaround is known at this time. Check the input with `pdfinfo` first.