Skip to content

Latest commit

 

History

History
186 lines (145 loc) · 7.13 KB

File metadata and controls

186 lines (145 loc) · 7.13 KB

Product Requirements Document: Financial Report Processing API

1. Overview

1.1 Product Purpose

The Financial Report Processing API automates the extraction and consolidation of financial data from PDF reports using AI-powered document processing. The system processes reports in batches, extracts appendix sections, and generates structured JSON outputs based on predefined field configurations.

1.2 Target Users

  • Financial analysts
  • Data processing systems
  • Business intelligence platforms
  • Automated reporting pipelines

2. Functional Requirements

2.1 Core Processing Flow

2.1.1 Initial Validation (Step 1)

  • Requirement: Check modification timestamp consistency
  • Process:
    • Read ./processed/result.json file
    • Compare LAST_MODIFIED field value with .env file's corresponding field
    • If values match: skip to Step 6 (response generation)
    • If values differ: proceed to Step 2 (batch processing)

2.1.2 Parallel File Processing (Step 2)

  • Requirement: Process all files in ./reports directory concurrently
  • Scope: Each PDF file in the reports directory
  • Execution: Parallel processing for optimal performance

2.1.3 Sequential File Operations (Step 3)

For each report file, execute the following sub-steps sequentially:

2.1.3.1 Processed File Validation
  • Check if [filename].json exists in ./processed folder
  • If exists: skip to sub-step 6 (save processed data)
  • If not exists: continue to sub-step 2
2.1.3.2 Preprocessing File Validation
  • Check if [filename] exists in ./preprocessing folder
  • If exists: skip to sub-step 5 (Gemini data extraction)
  • If not exists: continue to sub-step 3
2.1.3.3 Appendix Detection
  • AI Integration: Send PDF file to Gemini API
  • Request: "Identify the starting and ending page numbers where the Appendix section begins and ends in this report"
  • Input: Original PDF file from ./reports directory
  • Output: Page range for appendix section
2.1.3.4 PDF Appendix Extraction
  • Process: Extract pages from appendix start to end
  • Output: Save extracted PDF to ./preprocessing folder with same filename as source
  • Format: Maintain original filename convention
2.1.3.5 Data Extraction with Gemini
  • AI Integration: Process appendix PDF with configuration
  • Inputs:
    • Appendix PDF from ./preprocessing folder
    • values.json configuration from ./config folder
  • Request: "Extract field values from the PDF according to the JSON format specified in values.json. Complete the JSON structure with information found in the PDF. Do not include source citations."
  • Output: Structured JSON data
2.1.3.6 Individual Result Storage
  • Process: Save Gemini response as JSON file
  • Location: ./processed folder
  • Filename: [original_pdf_filename].json
  • Format: Valid JSON structure

2.1.4 Completion Synchronization (Step 4)

  • Requirement: Wait for all individual JSON files to be generated
  • Process: Monitor completion of all parallel processing tasks
  • Validation: Ensure all expected JSON files exist before proceeding

2.1.5 Result Consolidation (Step 5)

  • Process: Create consolidated result file
  • Structure: Nested JSON with company names as keys
  • Data Source: All individual JSON files from ./processed folder
  • Metadata: Add last_modified field with value from .env LAST_MODIFIED field
  • Output: Save as ./processed/result.json

2.1.6 API Response Generation (Step 6)

  • Source: Read ./processed/result.json
  • Process: Return all content except LAST_MODIFIED field
  • Format: JSON response
  • Status: 200 OK

2.2 Directory and File Management

2.2.1 Directory Validation

  • Required Directories:
    • ./reports (source PDF files)
    • ./processed (output JSON files)
    • ./preprocessing (intermediate PDF files)
    • ./config (configuration files)
  • Behavior: Create directories if they don't exist

2.2.2 File Validation

  • Required Files:
    • ./config/values.json (field configuration)
    • .env (environment configuration with LAST_MODIFIED field)
  • Error Handling: Return 500 error if required files are missing

2.3 Error Handling

2.3.1 File System Errors

  • Missing Required Files: 500 Internal Server Error
  • Permission Issues: 500 Internal Server Error
  • Disk Space Issues: 500 Internal Server Error

2.3.2 AI Processing Errors

  • Gemini API Failures: 500 Internal Server Error
  • Invalid PDF Format: 500 Internal Server Error
  • Extraction Failures: 500 Internal Server Error

2.3.3 Data Processing Errors

  • JSON Parsing Errors: 500 Internal Server Error
  • Invalid Configuration: 500 Internal Server Error

3. Technical Requirements

3.1 External Dependencies

  • Gemini API: For PDF analysis and data extraction
  • PDF Processing Library: For page extraction and manipulation
  • File System Access: For directory and file operations

3.2 Configuration Files

3.2.1 values.json Structure

  • Location: ./config/values.json
  • Purpose: Define fields and structure for data extraction
  • Format: JSON schema defining expected output format

3.2.2 Environment Configuration

  • File: .env
  • Required Fields: LAST_MODIFIED timestamp field
  • Purpose: Track processing timestamps for cache validation

3.3 Performance Requirements

  • Parallel Processing: Concurrent processing of multiple PDF files
  • Caching: Skip processing if timestamps match (Step 1 validation)
  • Efficiency: Reuse preprocessed files when available

3.4 Data Format Requirements

3.4.1 Individual JSON Files

  • Filename: [original_pdf_name].json
  • Content: Structured data extracted from appendix
  • Required Field: Company name (for consolidation key)

3.4.2 Consolidated Result File

  • Filename: result.json
  • Structure: { "company_name_1": {...}, "company_name_2": {...}, "last_modified": "timestamp" }
  • Purpose: Single source for all processed data

4. API Specification

4.1 Endpoint

  • Method: Not specified (recommend GET or POST)
  • Purpose: Process financial reports and return consolidated data

4.2 Response Format

  • Success: 200 OK with JSON data (result.json content minus LAST_MODIFIED field)
  • Error: 500 Internal Server Error for any processing failures

4.3 Processing Behavior

  • Idempotent: Same results for repeated calls with unchanged data
  • Cacheable: Uses timestamp comparison for efficient processing
  • Batch-oriented: Processes all reports in single API call

5. Success Criteria

5.1 Functional Success

  • Successfully processes all PDF files in reports directory
  • Accurately extracts appendix sections using AI
  • Generates valid JSON outputs according to configuration
  • Provides consolidated results in single API response

5.2 Performance Success

  • Parallel processing reduces total processing time
  • Caching mechanism prevents unnecessary reprocessing
  • Handles multiple files efficiently

5.3 Reliability Success

  • Robust error handling for file system and AI processing errors
  • Consistent results across multiple API calls
  • Proper validation of all required dependencies