LLMWhisperer CLI Test Script

A simple command-line client for LLMWhisperer, a powerful document extraction service from Unstract that converts complex documents (PDFs, images, scanned files) into LLM-ready text.

Features

Extract text from PDFs, images, and scanned documents
Multiple extraction modes for different document types
Table structure preservation with optional border recreation
Page-specific extraction
Save output to file or display in terminal
Environment-based API key configuration

Installation

Prerequisites

Python 3.7 or higher
pip package manager

Setup

Clone this repository:

git clone https://github.com/Zipstack/llmwhisperer-cli-test-script.git
cd llmwhisperer-cli-test-script

Create and activate a virtual environment:

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configure your API key:

Create a .env file in the project directory:

LLMWHISPERER_API_KEY=your_api_key_here

Alternatively, set it as an environment variable:

export LLMWHISPERER_API_KEY=your_api_key_here

Get your API key from Unstract LLMWhisperer

Usage

Basic Usage

Extract text from a document:

python llmwhisperer_cli.py document.pdf

Save Output to File

python llmwhisperer_cli.py document.pdf -o extracted_text.txt

Extraction Modes

LLMWhisperer supports different extraction modes optimized for various document types:

native_text: For digitally created PDFs with embedded text (fastest)
low_cost: For clean, printed documents with good scan quality
high_quality (default): For challenging documents including handwritten text
form: For documents with forms, checkboxes, and structured layouts
table: For documents with dense table structures

Example:

python llmwhisperer_cli.py document.pdf -m table

Page Selection

Extract specific pages or page ranges:

# Extract pages 1-5
python llmwhisperer_cli.py document.pdf -p "1-5"

# Extract pages 1-5 and page 7
python llmwhisperer_cli.py document.pdf -p "1-5,7"

# Extract from page 21 to end
python llmwhisperer_cli.py document.pdf -p "21-"

Table Border Recreation

For better table structure preservation:

# Add vertical borders
python llmwhisperer_cli.py document.pdf --vert

# Add both vertical and horizontal borders
python llmwhisperer_cli.py document.pdf --vert --horiz

Note: --horiz requires --vert to be enabled

Complete Example

Extract tables from specific pages with borders and save to file:

python llmwhisperer_cli.py financial_report.pdf \
  -m table \
  -p "10-15" \
  --vert --horiz \
  -o tables_output.txt

Command-Line Options

Option	Description	Default
`file_path`	Path to the document to process	Required
`-o, --output`	Output file to save extracted text	None (prints to console)
`-m, --mode`	Extraction mode (see modes above)	`high_quality`
`-p, --pages`	Pages to extract (e.g., "1-5,7,21-")	All pages
`--vert`	Recreate vertical table borders	False
`--horiz`	Recreate horizontal table borders	False
`-h, --help`	Show help message	-

Output

The client provides:

Extracted text (to console or file)
Total number of pages processed
Processing status and progress indicators

API Configuration

The client uses the following environment variables:

LLMWHISPERER_API_KEY: Your API key (required)
LLMWHISPERER_BASE_URL_V2: API endpoint (optional, defaults to US region)

For EU region, set:

LLMWHISPERER_BASE_URL_V2=https://llmwhisperer-api.eu-west.unstract.com/api/v2

Examples

Extract text from a scanned document

python llmwhisperer_cli.py scanned_document.pdf -m low_cost

Process a form with checkboxes

python llmwhisperer_cli.py application_form.pdf -m form -o form_data.txt

Extract tables with structure preservation

python llmwhisperer_cli.py data_tables.pdf -m table --vert --horiz

Extract specific pages from a large document

python llmwhisperer_cli.py manual.pdf -p "1-10,50-55" -o summary.txt

Troubleshooting

"LLMWHISPERER_API_KEY not found": Ensure your .env file is in the same directory as the script or set the environment variable.
"--horiz requires --vert": Horizontal borders can only be added when vertical borders are enabled.
Timeout errors: For large documents, the default timeout is 200 seconds. The script will wait for processing to complete.

License

This project is provided under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
llmwhisperer_cli.py		llmwhisperer_cli.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMWhisperer CLI Test Script

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Save Output to File

Extraction Modes

Page Selection

Table Border Recreation

Complete Example

Command-Line Options

Output

API Configuration

Examples

Extract text from a scanned document

Process a form with checkboxes

Extract tables with structure preservation

Extract specific pages from a large document

Troubleshooting

License

Additional Resources

About

Uh oh!

Releases

Packages

Languages

License

Zipstack/llmwhisperer-cli-test-script

Folders and files

Latest commit

History

Repository files navigation

LLMWhisperer CLI Test Script

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Save Output to File

Extraction Modes

Page Selection

Table Border Recreation

Complete Example

Command-Line Options

Output

API Configuration

Examples

Extract text from a scanned document

Process a form with checkboxes

Extract tables with structure preservation

Extract specific pages from a large document

Troubleshooting

License

Additional Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages