A simple OCR script to convert PDF files to DOCX or TXT formats using Tesseract OCR. This script is designed to extract text without preserving the original formatting of the document.
Clone the repository to your local machine:
git clone https://github.com/lancer1911/pdf2text.git
cd pdf2textAlternatively, you can download the ZIP file and extract it to your desired location.
Open your terminal and create a new Conda environment:
conda create -n ocr_env python=3.9Activate the newly created environment:
conda activate ocr_envInstall the required libraries using the requirements.txt file:
pip install -r requirements.txt- Download and install Tesseract OCR from Tesseract at UB Mannheim.
 - Download and install Poppler from Poppler for Windows.
 - Add Tesseract and Poppler to your system PATH.
 - Ensure the 
TESSDATA_PREFIXenvironment variable points to the Tesseract language data directory. You can set it in your terminal: 
setx TESSDATA_PREFIX "C:\Program Files\Tesseract-OCR\tessdata"Install Tesseract OCR and Poppler using Homebrew:
brew install tesseract
brew install tesseract-lang
brew install popplerEnsure the TESSDATA_PREFIX environment variable points to the Tesseract language data directory. You can set it in your terminal:
export TESSDATA_PREFIX=/usr/local/share/Install Tesseract OCR and Poppler using your package manager. For example, on Ubuntu:
sudo apt update
sudo apt install tesseract-ocr poppler-utilsTo install additional language packs for Tesseract, you can use the following command:
sudo apt install tesseract-ocr-<language-code>Replace <language-code> with the code for the language you want to install. Here are some examples:
engfor Englishchi-simfor Simplified Chinesechi-trafor Traditional Chinesedeufor Germanjpnfor Japanesefrafor French
To install multiple languages at once, you can use:
sudo apt install tesseract-ocr-eng tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-deu tesseract-ocr-jpn tesseract-ocr-fraEnsure the TESSDATA_PREFIX environment variable points to the Tesseract language data directory. You can set it in your terminal:
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/In the directory containing pdf2text.py, use the following commands to run the script:
- 
To perform OCR and output as DOCX format with Chinese and English recognition, specifying the PDF file through the command line:
python pdf2text.py -f path_to_file.pdf -o docx -l ce
 - 
To perform OCR and output as TXT format with Chinese recognition, specifying the PDF file through the command line:
python pdf2text.py -f path_to_file.pdf -o txt -l c
 - 
To perform OCR and output as DOCX format with default English recognition, using a GUI to select the PDF file:
python pdf2text.py
 
The script pdf2text.py performs the following steps:
- Converts the PDF to images.
 - Applies OCR on each image using Tesseract.
 - Outputs the extracted text to either a DOCX or TXT file.
 
This script is intended for basic text extraction and does not retain the original formatting of the PDF.
This project is licensed under the MIT License.