PDF Data Extractor

A Streamlit application that extracts structured information from academic PDFs using various LLM providers (OpenAI, Anthropic Claude, or Meta LLama). The app allows users to customize extraction prompts, manage categories, and export results to Excel.

Features

Multiple LLM Providers

OpenAI (GPT-3.5, GPT-4)
Anthropic Claude (Claude 3 Haiku, Claude 3 Sonnet)
Meta LLama (Meta-Llama-3.1-8B-Instruct)

Prompt Management

Organize prompts into customizable categories
Add, edit, and delete categories
Add, edit, and delete prompts within categories
Specify exact format requirements for each prompt

Data Extraction

Process multiple PDFs in batch
Extract information based on customized prompts
Format validation for extracted data
Progress tracking during extraction

Export

View results in an interactive table
Export results to Excel
Each PDF gets its own row with columns matching prompts

Installation

Clone the repository:

git clone https://github.com/yourusername/pdf-data-extractor.git
cd pdf-data-extractor

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install required packages:

pip install streamlit pandas PyPDF2 anthropic openai openpyxl

Usage

Start the application:

streamlit run app.py

Open your web browser and navigate to the provided URL (typically http://localhost:8501)
Configure the application:
- Select your preferred LLM provider
- Enter API key if required (not needed for Llama)
- Customize categories and prompts if needed
Upload PDFs and extract data:
- Click "Upload PDF Files" to select one or more PDFs
- Click "Extract Data" to begin processing
- Monitor progress in the progress bar
- View results in the interactive table
- Download results as Excel file

Default Categories and Prompts

The application comes with three default categories:

Study Characteristics

First author last name
Publication year
Journal
Country of corresponding author
Funding source
Author financial conflicts of interest

Participants

Main eligibility criteria
Country(ies) of participants
N included
N (%) females/women

Trial Arms

Trial arm name
Group description

Customization

Adding New Categories

Click "Add New Category"
Enter category name
Click "Save Category"

Adding New Prompts

Navigate to desired category
Click "Add New Prompt"
Fill in:
- Title: Column name in results
- Prompt: Instructions for the LLM
- Format: Expected format of the response
Click "Save Prompt"

Editing Categories/Prompts

Use "Edit" buttons to modify existing categories
Use expanders to modify existing prompts
Click "Update" to save changes

API Keys

OpenAI

Obtain API key from: https://platform.openai.com/api-keys
Required for GPT-3.5 and GPT-4 models

Anthropic

Obtain API key from: https://console.anthropic.com/
Required for Claude models

Llama

No API key required
Uses provided endpoint: http://3.15.181.146:8000/v1/

Configuration

The application uses several configuration dictionaries that can be modified in the code:

Models Configuration

MODELS = {
    "OpenAI": {
        "name": "OpenAI GPT-3.5",
        "models": ["gpt-3.5-turbo", "gpt-4"],
        "requires_key": True,
        "base_url": None
    },
    # ... other providers
}

Default Prompts

DEFAULT_PROMPTS = {
    "Study characteristics": [
        {
            "title": "First author last name",
            "prompt": "State the last name of first author only...",
            "format": "Text with first letter capitalized"
        },
        # ... other prompts
    ],
    # ... other categories
}

Error Handling

The application includes error handling for:

Invalid API keys
Failed API calls
PDF processing errors
Duplicate category names
Missing required fields

Limitations

PDF text extraction quality depends on the PDF format
Maximum context length varies by model
Processing time increases with document length
Session state is not persistent between restarts

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with Streamlit
Uses OpenAI, Anthropic, and Meta Llama APIs
PDF processing with PyPDF2
Excel export with openpyxl

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
app_4.py		app_4.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Extractor

Features

Multiple LLM Providers

Prompt Management

Data Extraction

Export

Installation

Usage

Default Categories and Prompts

Study Characteristics

Participants

Trial Arms

Customization

Adding New Categories

Adding New Prompts

Editing Categories/Prompts

API Keys

OpenAI

Anthropic

Llama

Configuration

Models Configuration

Default Prompts

Error Handling

Limitations

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

JJneid/research_extract

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extractor

Features

Multiple LLM Providers

Prompt Management

Data Extraction

Export

Installation

Usage

Default Categories and Prompts

Study Characteristics

Participants

Trial Arms

Customization

Adding New Categories

Adding New Prompts

Editing Categories/Prompts

API Keys

OpenAI

Anthropic

Llama

Configuration

Models Configuration

Default Prompts

Error Handling

Limitations

Contributing

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages