The settings file defines the path structure of the PDF processing and various image processing and OCR parameters that should be tuned for each dataset. To learn about what OCR is, how it is used, and what the image processing pipeline is doing, see this overview of OCR and Pytesseract, the Tesseract manual, and this guide to image quality improvement.
Two files allow a user amend the default dictionary with patterns or words specific to their application: eng.my-patterns
and eng.my-words
. Each file should contains patterns or words separated by a line return. For a simple example, see the Tesseract documentation. The pattern are input as a limit regular expression search, for details on valid patterns, see the description here. If you edit these, make sure that the files maintain the .my-pattern extension, rather than converting to eng.my-pattern.rtf! Blacklisting or whitelisting characters may also be helpful, and is described below.
When using pdf_to_text_script.py
, make sure you check that the following parameters in settings.py
work well for your dataset:
Parameter | Description |
---|---|
DPI | dots per inch to convert PDF to image |
KERNEL | denoising kernel, change from 3 to any odd number if text is fuzzy or disappearing, see OpenCV Image Transformations |
THRESH | BW image threshold, pixels > THRESH go to 255, see OpenCV Image Thresholding |
N_LINES | total possible lines of text in your PDF file, including empty lines, for binning |
BLACKLIST | list of characters to try to exclude from search library |
CONFIG | PyTesseract configuration parameters, change page segmentation method (PSM) if desired |