The Financial Report Processing API automates the extraction and consolidation of financial data from PDF reports using AI-powered document processing. The system processes reports in batches, extracts appendix sections, and generates structured JSON outputs based on predefined field configurations.
- Financial analysts
- Data processing systems
- Business intelligence platforms
- Automated reporting pipelines
- Requirement: Check modification timestamp consistency
- Process:
- Read
./processed/result.jsonfile - Compare
LAST_MODIFIEDfield value with.envfile's corresponding field - If values match: skip to Step 6 (response generation)
- If values differ: proceed to Step 2 (batch processing)
- Read
- Requirement: Process all files in
./reportsdirectory concurrently - Scope: Each PDF file in the reports directory
- Execution: Parallel processing for optimal performance
For each report file, execute the following sub-steps sequentially:
- Check if
[filename].jsonexists in./processedfolder - If exists: skip to sub-step 6 (save processed data)
- If not exists: continue to sub-step 2
- Check if
[filename]exists in./preprocessingfolder - If exists: skip to sub-step 5 (Gemini data extraction)
- If not exists: continue to sub-step 3
- AI Integration: Send PDF file to Gemini API
- Request: "Identify the starting and ending page numbers where the Appendix section begins and ends in this report"
- Input: Original PDF file from
./reportsdirectory - Output: Page range for appendix section
- Process: Extract pages from appendix start to end
- Output: Save extracted PDF to
./preprocessingfolder with same filename as source - Format: Maintain original filename convention
- AI Integration: Process appendix PDF with configuration
- Inputs:
- Appendix PDF from
./preprocessingfolder values.jsonconfiguration from./configfolder
- Appendix PDF from
- Request: "Extract field values from the PDF according to the JSON format specified in values.json. Complete the JSON structure with information found in the PDF. Do not include source citations."
- Output: Structured JSON data
- Process: Save Gemini response as JSON file
- Location:
./processedfolder - Filename:
[original_pdf_filename].json - Format: Valid JSON structure
- Requirement: Wait for all individual JSON files to be generated
- Process: Monitor completion of all parallel processing tasks
- Validation: Ensure all expected JSON files exist before proceeding
- Process: Create consolidated result file
- Structure: Nested JSON with company names as keys
- Data Source: All individual JSON files from
./processedfolder - Metadata: Add
last_modifiedfield with value from.envLAST_MODIFIEDfield - Output: Save as
./processed/result.json
- Source: Read
./processed/result.json - Process: Return all content except
LAST_MODIFIEDfield - Format: JSON response
- Status: 200 OK
- Required Directories:
./reports(source PDF files)./processed(output JSON files)./preprocessing(intermediate PDF files)./config(configuration files)
- Behavior: Create directories if they don't exist
- Required Files:
./config/values.json(field configuration).env(environment configuration with LAST_MODIFIED field)
- Error Handling: Return 500 error if required files are missing
- Missing Required Files: 500 Internal Server Error
- Permission Issues: 500 Internal Server Error
- Disk Space Issues: 500 Internal Server Error
- Gemini API Failures: 500 Internal Server Error
- Invalid PDF Format: 500 Internal Server Error
- Extraction Failures: 500 Internal Server Error
- JSON Parsing Errors: 500 Internal Server Error
- Invalid Configuration: 500 Internal Server Error
- Gemini API: For PDF analysis and data extraction
- PDF Processing Library: For page extraction and manipulation
- File System Access: For directory and file operations
- Location:
./config/values.json - Purpose: Define fields and structure for data extraction
- Format: JSON schema defining expected output format
- File:
.env - Required Fields:
LAST_MODIFIEDtimestamp field - Purpose: Track processing timestamps for cache validation
- Parallel Processing: Concurrent processing of multiple PDF files
- Caching: Skip processing if timestamps match (Step 1 validation)
- Efficiency: Reuse preprocessed files when available
- Filename:
[original_pdf_name].json - Content: Structured data extracted from appendix
- Required Field: Company name (for consolidation key)
- Filename:
result.json - Structure:
{ "company_name_1": {...}, "company_name_2": {...}, "last_modified": "timestamp" } - Purpose: Single source for all processed data
- Method: Not specified (recommend GET or POST)
- Purpose: Process financial reports and return consolidated data
- Success: 200 OK with JSON data (result.json content minus LAST_MODIFIED field)
- Error: 500 Internal Server Error for any processing failures
- Idempotent: Same results for repeated calls with unchanged data
- Cacheable: Uses timestamp comparison for efficient processing
- Batch-oriented: Processes all reports in single API call
- Successfully processes all PDF files in reports directory
- Accurately extracts appendix sections using AI
- Generates valid JSON outputs according to configuration
- Provides consolidated results in single API response
- Parallel processing reduces total processing time
- Caching mechanism prevents unnecessary reprocessing
- Handles multiple files efficiently
- Robust error handling for file system and AI processing errors
- Consistent results across multiple API calls
- Proper validation of all required dependencies