Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
crop.py	crop.py
hocr_output.sh	hocr_output.sh
hocr_parsing.py	hocr_parsing.py
hocr_to_crop.py	hocr_to_crop.py

Phase Walkthrough

Features used in Phase 1:

crop.py - If the image file contains characters from previous/next pages it may interfere with the OCR. This file will crop all images in a directory to concentrate the text on the image. This is a basic preprocessing step and is highly recommended that before any of the below scripts are run, the images are cropped to remove any stray characters.

hocr_output.sh - This shell script will produce a hocr file for each image file in a folder.

hocr_parsing.py - This python script is used to convert hocr output to cropped images based on bounding boxes of each entry.

hocr_to_crop.py - This python script is used to convert hocr output to cropped images based on bounding boxes of each entry for a folder containing many images.

Note: Bash files will not run directly on the Windows Command Prompt/ PowerShell. You will have to run the file on Git Bash. Please refer here to learn more.

Dependencies in Phase 1:

1. BeautifulSoup - pip install beautifulsoup4
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree prevalent in HTML files. Since the hOCR output is in an XML/HTML format, the relevant data in the image can be extracted using BeautifulSoup.

2. Pillow - pip install pillow
Python Imaging Library (in newer versions known as Pillow) is a free library for the Python programming language that adds support for opening, manipulating, and saving many different image file formats.

3. Tesseract-OCR
A. To install tesseract on Mac, first ensure that you have Homebrew installed. To install Homebrew, refer to this link. Once Homebrew is installed, type in the given command - brew install tesseract
B. To install tesseract on Linux, type in the following command,sudo apt-get install tesseract-ocr
C. To install tesseract on Windows, please refer to the documentation here.

4. lxml - pip install lxml
XML/ HTML parser for Python. lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Using the pip install commands, we can install Pillow and BeautifulSoup. Ensure that the virtual-environment is activated by typing source <virtualenv-name>/bin/activate.

Expected Output:

crop.py - This file will crop all images in a directory to concentrate the text on the image.The width and height of the image that needs to be obtained is taken for all images.
python3 crop.py <source_directory> <destination_directory>

Note: The paths mentioned in the terminal is not absolute. Running the script is relative to where you save your folders and files.

hocr_output.sh - The directory should contain image files for which you want to produce hOCR files. The shell script will produce the hOCR files of all the images present in the directory. The directory path is given as an argument. To execute shell scripts, you need to give it appropriate permissions.
chmod +x hocr_output.sh
To execute the script now type,
./hocr_output.sh <source_directory>

hocr_parsing.py - The script will give the resulting cropped entries on a given image. The input consists of the source image, source image hocr and destination directory. It uses two algorithms, descriptions of which are given in the file. You need to choose which algorithm to use.
python3 hocr_parsing.py <path_to_image> <path_to_hocr_file> <destination_path>

hocr_to_crop.py - This is a batch script that is used to find the cropped entries using bounding boxes. The input to the script is the source directory containing the images that need to be cropped, the path containing all the hocr files for the same image files and the destination path which will store all the cropped images.
python3 hocr_to_crop.py <source_directory> <path_to_hocr_files> <destination_path>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bounding_Boxes

Bounding_Boxes

README.md

Phase Walkthrough

Features used in Phase 1:

Dependencies in Phase 1:

Expected Output:

Files

Bounding_Boxes

Directory actions

More options

Directory actions

More options

Latest commit

History

Bounding_Boxes

Folders and files

parent directory

README.md

Phase Walkthrough

Features used in Phase 1:

Dependencies in Phase 1:

Expected Output: