OCR Settings and User-Input Dictionaries

Overview

The settings file defines the path structure of the PDF processing and various image processing and OCR parameters that should be tuned for each dataset. To learn about what OCR is, how it is used, and what the image processing pipeline is doing, see this overview of OCR and Pytesseract, the Tesseract manual, and this guide to image quality improvement.

User-input words and patterns

Two files allow a user amend the default dictionary with patterns or words specific to their application: eng.my-patterns and eng.my-words. Each file should contains patterns or words separated by a line return. For a simple example, see the Tesseract documentation. The pattern are input as a limit regular expression search, for details on valid patterns, see the description here. If you edit these, make sure that the files maintain the .my-pattern extension, rather than converting to eng.my-pattern.rtf! Blacklisting or whitelisting characters may also be helpful, and is described below.

Parameters

When using pdf_to_text_script.py, make sure you check that the following parameters in settings.py work well for your dataset:

Parameter	Description
DPI	dots per inch to convert PDF to image
KERNEL	denoising kernel, change from 3 to any odd number if text is fuzzy or disappearing, see OpenCV Image Transformations
THRESH	BW image threshold, pixels > THRESH go to 255, see OpenCV Image Thresholding
N_LINES	total possible lines of text in your PDF file, including empty lines, for binning
BLACKLIST	list of characters to try to exclude from search library
CONFIG	PyTesseract configuration parameters, change page segmentation method (PSM) if desired

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USER_SETTINGS.md

USER_SETTINGS.md

OCR Settings and User-Input Dictionaries

Overview

User-input words and patterns

Parameters

Files

USER_SETTINGS.md

Latest commit

History

USER_SETTINGS.md

File metadata and controls

OCR Settings and User-Input Dictionaries

Overview

User-input words and patterns

Parameters