Skip to content

Zipstack/llmwhisperer-cli-test-script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLMWhisperer CLI Test Script

A simple command-line client for LLMWhisperer, a powerful document extraction service from Unstract that converts complex documents (PDFs, images, scanned files) into LLM-ready text.

Features

  • Extract text from PDFs, images, and scanned documents
  • Multiple extraction modes for different document types
  • Table structure preservation with optional border recreation
  • Page-specific extraction
  • Save output to file or display in terminal
  • Environment-based API key configuration

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Setup

  1. Clone this repository:
git clone https://github.com/Zipstack/llmwhisperer-cli-test-script.git
cd llmwhisperer-cli-test-script
  1. Create and activate a virtual environment:
# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure your API key:

Create a .env file in the project directory:

LLMWHISPERER_API_KEY=your_api_key_here

Alternatively, set it as an environment variable:

export LLMWHISPERER_API_KEY=your_api_key_here

Get your API key from Unstract LLMWhisperer

Usage

Basic Usage

Extract text from a document:

python llmwhisperer_cli.py document.pdf

Save Output to File

python llmwhisperer_cli.py document.pdf -o extracted_text.txt

Extraction Modes

LLMWhisperer supports different extraction modes optimized for various document types:

  • native_text: For digitally created PDFs with embedded text (fastest)
  • low_cost: For clean, printed documents with good scan quality
  • high_quality (default): For challenging documents including handwritten text
  • form: For documents with forms, checkboxes, and structured layouts
  • table: For documents with dense table structures

Example:

python llmwhisperer_cli.py document.pdf -m table

Page Selection

Extract specific pages or page ranges:

# Extract pages 1-5
python llmwhisperer_cli.py document.pdf -p "1-5"

# Extract pages 1-5 and page 7
python llmwhisperer_cli.py document.pdf -p "1-5,7"

# Extract from page 21 to end
python llmwhisperer_cli.py document.pdf -p "21-"

Table Border Recreation

For better table structure preservation:

# Add vertical borders
python llmwhisperer_cli.py document.pdf --vert

# Add both vertical and horizontal borders
python llmwhisperer_cli.py document.pdf --vert --horiz

Note: --horiz requires --vert to be enabled

Complete Example

Extract tables from specific pages with borders and save to file:

python llmwhisperer_cli.py financial_report.pdf \
  -m table \
  -p "10-15" \
  --vert --horiz \
  -o tables_output.txt

Command-Line Options

Option Description Default
file_path Path to the document to process Required
-o, --output Output file to save extracted text None (prints to console)
-m, --mode Extraction mode (see modes above) high_quality
-p, --pages Pages to extract (e.g., "1-5,7,21-") All pages
--vert Recreate vertical table borders False
--horiz Recreate horizontal table borders False
-h, --help Show help message -

Output

The client provides:

  • Extracted text (to console or file)
  • Total number of pages processed
  • Processing status and progress indicators

API Configuration

The client uses the following environment variables:

  • LLMWHISPERER_API_KEY: Your API key (required)
  • LLMWHISPERER_BASE_URL_V2: API endpoint (optional, defaults to US region)

For EU region, set:

LLMWHISPERER_BASE_URL_V2=https://llmwhisperer-api.eu-west.unstract.com/api/v2

Examples

Extract text from a scanned document

python llmwhisperer_cli.py scanned_document.pdf -m low_cost

Process a form with checkboxes

python llmwhisperer_cli.py application_form.pdf -m form -o form_data.txt

Extract tables with structure preservation

python llmwhisperer_cli.py data_tables.pdf -m table --vert --horiz

Extract specific pages from a large document

python llmwhisperer_cli.py manual.pdf -p "1-10,50-55" -o summary.txt

Troubleshooting

  1. "LLMWHISPERER_API_KEY not found": Ensure your .env file is in the same directory as the script or set the environment variable.

  2. "--horiz requires --vert": Horizontal borders can only be added when vertical borders are enabled.

  3. Timeout errors: For large documents, the default timeout is 200 seconds. The script will wait for processing to complete.

License

This project is provided under the MIT license.

Additional Resources

About

A simple CLI test script for LLMWhisperer, which can also double as a reference client implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages