Product Requirements Document: Financial Report Processing API

1. Overview

1.1 Product Purpose

The Financial Report Processing API automates the extraction and consolidation of financial data from PDF reports using AI-powered document processing. The system processes reports in batches, extracts appendix sections, and generates structured JSON outputs based on predefined field configurations.

1.2 Target Users

Financial analysts
Data processing systems
Business intelligence platforms
Automated reporting pipelines

2. Functional Requirements

2.1 Core Processing Flow

2.1.1 Initial Validation (Step 1)

Requirement: Check modification timestamp consistency
Process:
- Read ./processed/result.json file
- Compare LAST_MODIFIED field value with .env file's corresponding field
- If values match: skip to Step 6 (response generation)
- If values differ: proceed to Step 2 (batch processing)

2.1.2 Parallel File Processing (Step 2)

Requirement: Process all files in ./reports directory concurrently
Scope: Each PDF file in the reports directory
Execution: Parallel processing for optimal performance

2.1.3 Sequential File Operations (Step 3)

For each report file, execute the following sub-steps sequentially:

2.1.3.1 Processed File Validation

Check if [filename].json exists in ./processed folder
If exists: skip to sub-step 6 (save processed data)
If not exists: continue to sub-step 2

2.1.3.2 Preprocessing File Validation

Check if [filename] exists in ./preprocessing folder
If exists: skip to sub-step 5 (Gemini data extraction)
If not exists: continue to sub-step 3

2.1.3.3 Appendix Detection

AI Integration: Send PDF file to Gemini API
Request: "Identify the starting and ending page numbers where the Appendix section begins and ends in this report"
Input: Original PDF file from ./reports directory
Output: Page range for appendix section

2.1.3.4 PDF Appendix Extraction

Process: Extract pages from appendix start to end
Output: Save extracted PDF to ./preprocessing folder with same filename as source
Format: Maintain original filename convention

2.1.3.5 Data Extraction with Gemini

AI Integration: Process appendix PDF with configuration
Inputs:
- Appendix PDF from ./preprocessing folder
- values.json configuration from ./config folder
Request: "Extract field values from the PDF according to the JSON format specified in values.json. Complete the JSON structure with information found in the PDF. Do not include source citations."
Output: Structured JSON data

2.1.3.6 Individual Result Storage

Process: Save Gemini response as JSON file
Location: ./processed folder
Filename: [original_pdf_filename].json
Format: Valid JSON structure

2.1.4 Completion Synchronization (Step 4)

Requirement: Wait for all individual JSON files to be generated
Process: Monitor completion of all parallel processing tasks
Validation: Ensure all expected JSON files exist before proceeding

2.1.5 Result Consolidation (Step 5)

Process: Create consolidated result file
Structure: Nested JSON with company names as keys
Data Source: All individual JSON files from ./processed folder
Metadata: Add last_modified field with value from .env LAST_MODIFIED field
Output: Save as ./processed/result.json

2.1.6 API Response Generation (Step 6)

Source: Read ./processed/result.json
Process: Return all content except LAST_MODIFIED field
Format: JSON response
Status: 200 OK

2.2 Directory and File Management

2.2.1 Directory Validation

Required Directories:
- ./reports (source PDF files)
- ./processed (output JSON files)
- ./preprocessing (intermediate PDF files)
- ./config (configuration files)
Behavior: Create directories if they don't exist

2.2.2 File Validation

Required Files:
- ./config/values.json (field configuration)
- .env (environment configuration with LAST_MODIFIED field)
Error Handling: Return 500 error if required files are missing

2.3 Error Handling

2.3.1 File System Errors

Missing Required Files: 500 Internal Server Error
Permission Issues: 500 Internal Server Error
Disk Space Issues: 500 Internal Server Error

2.3.2 AI Processing Errors

Gemini API Failures: 500 Internal Server Error
Invalid PDF Format: 500 Internal Server Error
Extraction Failures: 500 Internal Server Error

2.3.3 Data Processing Errors

JSON Parsing Errors: 500 Internal Server Error
Invalid Configuration: 500 Internal Server Error

3. Technical Requirements

3.1 External Dependencies

Gemini API: For PDF analysis and data extraction
PDF Processing Library: For page extraction and manipulation
File System Access: For directory and file operations

3.2 Configuration Files

3.2.1 values.json Structure

Location: ./config/values.json
Purpose: Define fields and structure for data extraction
Format: JSON schema defining expected output format

3.2.2 Environment Configuration

File: .env
Required Fields: LAST_MODIFIED timestamp field
Purpose: Track processing timestamps for cache validation

3.3 Performance Requirements

Parallel Processing: Concurrent processing of multiple PDF files
Caching: Skip processing if timestamps match (Step 1 validation)
Efficiency: Reuse preprocessed files when available

3.4 Data Format Requirements

3.4.1 Individual JSON Files

Filename: [original_pdf_name].json
Content: Structured data extracted from appendix
Required Field: Company name (for consolidation key)

3.4.2 Consolidated Result File

Filename: result.json
Structure: { "company_name_1": {...}, "company_name_2": {...}, "last_modified": "timestamp" }
Purpose: Single source for all processed data

4. API Specification

4.1 Endpoint

Method: Not specified (recommend GET or POST)
Purpose: Process financial reports and return consolidated data

4.2 Response Format

Success: 200 OK with JSON data (result.json content minus LAST_MODIFIED field)
Error: 500 Internal Server Error for any processing failures

4.3 Processing Behavior

Idempotent: Same results for repeated calls with unchanged data
Cacheable: Uses timestamp comparison for efficient processing
Batch-oriented: Processes all reports in single API call

5. Success Criteria

5.1 Functional Success

Successfully processes all PDF files in reports directory
Accurately extracts appendix sections using AI
Generates valid JSON outputs according to configuration
Provides consolidated results in single API response

5.2 Performance Success

Parallel processing reduces total processing time
Caching mechanism prevents unnecessary reprocessing
Handles multiple files efficiently

5.3 Reliability Success

Robust error handling for file system and AI processing errors
Consistent results across multiple API calls
Proper validation of all required dependencies

FilesExpand file tree

AGENTS.md

Latest commit

History