A simple command-line client for LLMWhisperer, a powerful document extraction service from Unstract that converts complex documents (PDFs, images, scanned files) into LLM-ready text.
- Extract text from PDFs, images, and scanned documents
- Multiple extraction modes for different document types
- Table structure preservation with optional border recreation
- Page-specific extraction
- Save output to file or display in terminal
- Environment-based API key configuration
- Python 3.7 or higher
- pip package manager
- Clone this repository:
git clone https://github.com/Zipstack/llmwhisperer-cli-test-script.git
cd llmwhisperer-cli-test-script
- Create and activate a virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configure your API key:
Create a .env
file in the project directory:
LLMWHISPERER_API_KEY=your_api_key_here
Alternatively, set it as an environment variable:
export LLMWHISPERER_API_KEY=your_api_key_here
Get your API key from Unstract LLMWhisperer
Extract text from a document:
python llmwhisperer_cli.py document.pdf
python llmwhisperer_cli.py document.pdf -o extracted_text.txt
LLMWhisperer supports different extraction modes optimized for various document types:
native_text
: For digitally created PDFs with embedded text (fastest)low_cost
: For clean, printed documents with good scan qualityhigh_quality
(default): For challenging documents including handwritten textform
: For documents with forms, checkboxes, and structured layoutstable
: For documents with dense table structures
Example:
python llmwhisperer_cli.py document.pdf -m table
Extract specific pages or page ranges:
# Extract pages 1-5
python llmwhisperer_cli.py document.pdf -p "1-5"
# Extract pages 1-5 and page 7
python llmwhisperer_cli.py document.pdf -p "1-5,7"
# Extract from page 21 to end
python llmwhisperer_cli.py document.pdf -p "21-"
For better table structure preservation:
# Add vertical borders
python llmwhisperer_cli.py document.pdf --vert
# Add both vertical and horizontal borders
python llmwhisperer_cli.py document.pdf --vert --horiz
Note:
--horiz
requires--vert
to be enabled
Extract tables from specific pages with borders and save to file:
python llmwhisperer_cli.py financial_report.pdf \
-m table \
-p "10-15" \
--vert --horiz \
-o tables_output.txt
Option | Description | Default |
---|---|---|
file_path |
Path to the document to process | Required |
-o, --output |
Output file to save extracted text | None (prints to console) |
-m, --mode |
Extraction mode (see modes above) | high_quality |
-p, --pages |
Pages to extract (e.g., "1-5,7,21-") | All pages |
--vert |
Recreate vertical table borders | False |
--horiz |
Recreate horizontal table borders | False |
-h, --help |
Show help message | - |
The client provides:
- Extracted text (to console or file)
- Total number of pages processed
- Processing status and progress indicators
The client uses the following environment variables:
LLMWHISPERER_API_KEY
: Your API key (required)LLMWHISPERER_BASE_URL_V2
: API endpoint (optional, defaults to US region)
For EU region, set:
LLMWHISPERER_BASE_URL_V2=https://llmwhisperer-api.eu-west.unstract.com/api/v2
python llmwhisperer_cli.py scanned_document.pdf -m low_cost
python llmwhisperer_cli.py application_form.pdf -m form -o form_data.txt
python llmwhisperer_cli.py data_tables.pdf -m table --vert --horiz
python llmwhisperer_cli.py manual.pdf -p "1-10,50-55" -o summary.txt
-
"LLMWHISPERER_API_KEY not found": Ensure your
.env
file is in the same directory as the script or set the environment variable. -
"--horiz requires --vert": Horizontal borders can only be added when vertical borders are enabled.
-
Timeout errors: For large documents, the default timeout is 200 seconds. The script will wait for processing to complete.
This project is provided under the MIT license.