Skip to content

Latest commit

 

History

History
19 lines (15 loc) · 2.42 KB

USER_SETTINGS.md

File metadata and controls

19 lines (15 loc) · 2.42 KB

OCR Settings and User-Input Dictionaries

Overview

The settings file defines the path structure of the PDF processing and various image processing and OCR parameters that should be tuned for each dataset. To learn about what OCR is, how it is used, and what the image processing pipeline is doing, see this overview of OCR and Pytesseract, the Tesseract manual, and this guide to image quality improvement.

User-input words and patterns

Two files allow a user amend the default dictionary with patterns or words specific to their application: eng.my-patterns and eng.my-words. Each file should contains patterns or words separated by a line return. For a simple example, see the Tesseract documentation. The pattern are input as a limit regular expression search, for details on valid patterns, see the description here. If you edit these, make sure that the files maintain the .my-pattern extension, rather than converting to eng.my-pattern.rtf! Blacklisting or whitelisting characters may also be helpful, and is described below.

Parameters

When using pdf_to_text_script.py, make sure you check that the following parameters in settings.py work well for your dataset:

Parameter Description
DPI dots per inch to convert PDF to image
KERNEL denoising kernel, change from 3 to any odd number if text is fuzzy or disappearing, see OpenCV Image Transformations
THRESH BW image threshold, pixels > THRESH go to 255, see OpenCV Image Thresholding
N_LINES total possible lines of text in your PDF file, including empty lines, for binning
BLACKLIST list of characters to try to exclude from search library
CONFIG PyTesseract configuration parameters, change page segmentation method (PSM) if desired