A modern, scalable REST API for OCR identity document processing with S3 storage, database metadata management, and background task processing using Celery.
- Features
- Architecture Overview
- Entity Relationship Diagram (ERD)
- Prerequisites
- Quick Start
- Docker Hub
- Logging
- CI/CD Pipeline
- Multi-Database Setup
- Configuration
- API Endpoints
- Background Processing
- Docker Services
- Development
- Monitoring
- Security
- Production Deployment
- Changelog
- Contributing
- License
- Support
- OCR Processing: Extract text from identity documents (passports, ID cards, driver licenses)
- spaCy NER: Advanced Named Entity Recognition for accurate information extraction
- S3 Storage: Secure file storage with MinIO (S3-compatible)
- Multi-Database Support: Connect to multiple databases with automatic routing
- Database Metadata: PostgreSQL with comprehensive media management
- Background Processing: Celery workers for async OCR and media processing
- Polymorphic Media: Flexible media relationships across models
- REST API: FastAPI with automatic documentation
- Microservices: Docker containers for each service
- Queue-based Processing: Redis-backed Celery for background tasks
- Object Storage: S3-compatible storage with MinIO
- Multi-Database: PostgreSQL 17 with support for multiple databases
- Database Routing: Automatic model routing to appropriate databases
- Caching: Redis for session and task management
- Email Testing: Mailpit for development email testing
- Dependency Management: Poetry for modern Python packaging
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β FastAPI App β β Celery Workers β β PostgreSQL β
β (Port 8000) β β (OCR/Media) β β (Port 5432) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β MinIO (S3) β β Redis β β Mailpit β
β (Port 9000) β β (Port 6379) β β (Port 8025) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
- Docker and Docker Compose
- Python 3.11+ (for local development)
- Poetry (for dependency management)
- Git
git clone [email protected]:turahe/ocr-identity-rest-api.git
cd ocr-identity-rest-api
# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -
# Or using pip
pip install poetry
# Copy environment template
cp config.example.env .env
# Edit environment variables
nano .env
# Install all dependencies (including dev)
poetry install
# Or install only production dependencies
poetry install --only=main
# Navigate to docker directory
cd docker
# Start development environment
./start-dev.sh
# Or use Docker Compose directly
docker-compose up -d
# View logs
docker-compose logs -f
# Check service status
docker-compose ps
# Run migrations
poetry run alembic upgrade head
# Create MinIO bucket
poetry run python scripts/setup_minio.py
# Download spaCy models
poetry run python scripts/download_spacy_models.py
- API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- MinIO Console: http://localhost:9001
- Mailpit: http://localhost:8025
The project includes scripts and commands for building and uploading Docker images to Docker Hub.
The scripts/docker-hub-upload.sh
script provides a complete workflow for building and uploading Docker images:
# Basic usage
./scripts/docker-hub-upload.sh
# With version and turahe
./scripts/docker-hub-upload.sh 2.0.0 your-turahe
# Show help
./scripts/docker-hub-upload.sh --help
# Login to Docker Hub
make docker-hub-login
# Build images for Docker Hub
make docker-hub-build
# Upload images to Docker Hub
make docker-hub-upload VERSION=2.0.0 turahe=your-turahe
# Push images to Docker Hub
make docker-hub-push VERSION=2.0.0 turahe=your-turahe
# Set Docker Hub password (optional, will prompt if not set)
export DOCKER_PASSWORD="your-dockerhub-password"
The script creates multiple tags for each upload:
turahe/ocr-identity-api:version
- Production image with specific versionturahe/ocr-identity-api:latest
- Latest production imageturahe/ocr-identity-api:version-dev
- Development image with specific versionturahe/ocr-identity-api:latest-dev
- Latest development imageturahe/ocr-identity-api:vversion
- Versioned tag (if version != latest)turahe/ocr-identity-api:vversion-dev
- Versioned development tag
- Authentication: Automatic Docker Hub login with password prompt
- Multi-target builds: Builds both production and development images
- Version tagging: Creates multiple version tags automatically
- Error handling: Comprehensive error checking and reporting
- Cleanup options: Optional local image cleanup after upload
- Colored output: Clear, colored status messages
# Upload latest version
./scripts/docker-hub-upload.sh latest myturahe
# Upload specific version
./scripts/docker-hub-upload.sh 2.0.0 myturahe
# Upload with environment variable
export DOCKER_PASSWORD="mypassword"
./scripts/docker-hub-upload.sh 2.0.0 myturahe
The application includes comprehensive logging with structured output, multiple handlers, and environment-specific configurations.
The project includes comprehensive GitHub Actions workflows for automated testing, building, and deployment.
- Triggers: Push to main/develop, Pull requests, Release published
- Features: Testing, linting, Docker building, automated deployment
- Jobs: Test & Quality Checks, Security Scan, Docker Build, Deploy
- Triggers: Manual workflow dispatch
- Purpose: Manual deployment to staging or production
- Features: Environment selection, version specification, health checks
- Triggers: Release published
- Features: Release asset creation, Docker image versioning
- Jobs: Build Release Assets, Docker Release
- Triggers: Weekly schedule, Manual dispatch, Push to main/develop
- Features: Comprehensive security scanning, vulnerability checks
- Jobs: Security Analysis, Dependency Scan, Container Scan
- Triggers: Changes to docs/, Manual dispatch
- Features: Documentation building, link validation, GitHub Pages deployment
- Jobs: Build Documentation, Check Links, Deploy Docs
# Docker Hub
DOCKER_turahe=your-docker-turahe
DOCKER_PASSWORD=your-docker-password
# Production Environment
PRODUCTION_HOST=your-production-server-ip
PRODUCTION_turahe=your-production-turahe
PRODUCTION_SSH_KEY=your-production-ssh-private-key
PRODUCTION_URL=https://your-production-domain.com
# Staging Environment
STAGING_HOST=your-staging-server-ip
STAGING_turahe=your-staging-turahe
STAGING_SSH_KEY=your-staging-ssh-private-key
STAGING_URL=https://your-staging-domain.com
# Security Tools
SNYK_TOKEN=your-snyk-token
- Go to Actions tab in GitHub
- Select "Manual Deploy" workflow
- Choose environment (staging/production)
- Optionally specify Docker image version
- Click "Run workflow"
- Create a new release in GitHub
- Tag with semantic version (e.g., v2.0.0)
- Publish the release
- Workflows automatically build assets and Docker images
For detailed workflow documentation, see .github/workflows/README.md
.
- app.log: General application logs
- error.log: Error and exception logs
- access.log: API request/response logs
- app.json: JSON-formatted logs (production only)
- DEBUG: Detailed information for debugging
- INFO: General application information
- WARNING: Warning messages
- ERROR: Error messages and exceptions
- CRITICAL: Critical system errors
# Setup logging
make setup-logging
# View logs in real-time
make view-logs # Application logs
make view-errors # Error logs
make view-access # Access logs
# Log management
make clean-logs # Clean all log files
make log-stats # Show log file statistics
# Get list of log files
GET /logging/logs
# View log content
GET /logging/logs/{filename}?lines=100
# Download log file
GET /logging/logs/{filename}/download
# Get log statistics
GET /logging/stats
# Clear all logs (admin only)
POST /logging/clear
- Structured Logging: JSON format in production
- Colored Output: Colored console output in development
- Log Rotation: Automatic log file rotation (10MB max, 5 backups)
- Request Tracking: Request IDs for tracing
- Performance Monitoring: Request timing and performance metrics
- Service Logging: Dedicated loggers for S3, Redis, Email, Celery
- Database Logging: SQL query logging and performance tracking
- Development: Colored console output, DEBUG level
- Staging: Standard output, INFO level
- Production: JSON format, INFO level, file rotation
The application supports connecting to multiple databases with automatic model routing.
# Use multi-database configuration
cp config.multi_db.example.env .env
# Start multi-database services
./start-multi-db.sh
# Test multi-database functionality
poetry run python scripts/test_multi_database.py
- Database Routing: Models automatically routed to appropriate databases
- Health Monitoring: Monitor all database connections
- Statistics: Get detailed statistics for each database
- Query Execution: Execute queries on specific databases
- Backup Information: Get backup information for all databases
- Default: Main application data (User, People, Media, etc.)
- Analytics: Analytics and reporting data
- Logging: Application logs and audit data
- Archive: Archived and historical data
# Health check all databases
curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:8000/database/health
# Get database statistics
curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:8000/database/stats
# Get configured databases
curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:8000/database/configured
For detailed multi-database documentation, see docs/MULTI_DATABASE.md.
DB_HOST=postgres
DB_PORT=5432
DB_turahe=postgres
DB_PASSWORD=postgres
DB_NAME=ocr_identity_db
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_REGION=us-east-1
S3_BUCKET_NAME=ocr-identity-bucket
S3_ENDPOINT_URL=http://minio:9000
S3_USE_SSL=false
S3_VERIFY_SSL=false
SECRET_KEY=your-secret-key-change-in-production
DEBUG=false
ENVIRONMENT=development
POST /upload-image/
Content-Type: multipart/form-data
file: [image file]
user_id: [optional]
Response:
{
"status": "uploading",
"task_id": "abc123-def456",
"filename": "passport.jpg",
"content_type": "image/jpeg",
"message": "File upload started. Use task_id to check status."
}
POST /upload-image-sync/
Content-Type: multipart/form-data
file: [image file]
user_id: [optional]
Response:
{
"status": "success",
"media_id": "uuid-here",
"s3_key": "uploads/abc123.jpg",
"s3_url": "http://minio:9000/bucket/uploads/abc123.jpg",
"file_hash": "abc123...",
"file_size": 1024000,
"ocr_task_id": "def456-ghi789",
"message": "File uploaded and OCR processing started"
}
GET /task/{task_id}
Response:
{
"task_id": "abc123-def456",
"state": "SUCCESS",
"result": {
"status": "success",
"media_id": "uuid-here",
"ocr_job_id": "job-uuid",
"extracted_text": "PASSPORT...",
"processing_time_ms": 1500
}
}
GET /media/{media_id}
GET /media/{media_id}/ocr
DELETE /media/{media_id}
GET /health
# Start OCR worker
poetry run python scripts/start_celery_worker.py --worker --queue ocr --concurrency 2
# Start media worker
poetry run python scripts/start_celery_worker.py --worker --queue media --concurrency 4
# Start beat scheduler
poetry run python scripts/start_celery_worker.py --beat
process_ocr_image
: Process single image OCRprocess_bulk_ocr
: Process multiple imagescleanup_failed_ocr_jobs
: Clean up failed jobs
upload_media_to_s3
: Upload file to S3process_media_batch
: Process multiple filescleanup_orphaned_media
: Clean up orphaned recordsgenerate_media_thumbnails
: Generate image thumbnails
The project now uses an optimized Docker setup with environment-specific configurations:
docker/
βββ docker-compose.yml # Main configuration
βββ docker-compose.dev.yml # Development overrides
βββ docker-compose.staging.yml # Staging overrides
βββ docker-compose.prod.yml # Production overrides
βββ docker-compose.multi-db.yml # Multi-database setup
βββ start-dev.sh # Development startup script
βββ README.md # Docker documentation
- app: FastAPI application
- postgres: PostgreSQL 17 database
- redis: Redis cache and message broker
- minio: S3-compatible object storage
- mailpit: Email testing service
- celery_worker_ocr: OCR processing worker
- celery_worker_media: Media processing worker
- celery_beat: Task scheduler
- Development: Hot reload, debug mode, reduced resources
- Staging: Testing optimizations, separate ports
- Production: High performance, security optimizations
- Multi-Database: Separate databases for different purposes
# Install development dependencies
poetry install
# Run migrations
poetry run alembic upgrade head
# Start services
docker compose up -d postgres redis minio
# Run application
poetry run uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# Show all available commands
make help
# Quick development setup
make dev-setup
# Quick start with Docker
make quick-start
# Run tests
make test
# Code quality checks
make check
# Development environment
cd docker && ./start-dev.sh
# Staging environment
cd docker && ./start-staging.sh
# Production environment
cd docker && ./start-prod.sh
# Multi-database setup
cd docker && docker-compose -f docker-compose.yml -f docker-compose.multi-db.yml up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
# Install dependencies
poetry install
# Add new dependency
poetry add package-name
# Add development dependency
poetry add --group dev package-name
# Update dependencies
poetry update
# Run commands in Poetry environment
poetry run python main.py
poetry run pytest
poetry run black app/
# Activate Poetry shell
poetry shell
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=app --cov-report=html
# Run specific test categories
poetry run pytest tests/test_media_models.py
poetry run pytest tests/test_ocr_tasks.py
poetry run pytest tests/test_integration.py
# Format code
poetry run black app/ tests/ scripts/
poetry run isort app/ tests/ scripts/
# Lint code
poetry run flake8 app/ tests/ scripts/
# Type checking
poetry run mypy app/ scripts/
# Check Celery worker status
poetry run celery -A app.core.celery_app inspect active
# Monitor task queues
poetry run celery -A app.core.celery_app inspect stats
# Connect to database
docker compose exec postgres psql -U postgres -d ocr_identity_db
# Check media records
SELECT COUNT(*) FROM media;
# Check OCR jobs
SELECT job_status, COUNT(*) FROM ocr_jobs GROUP BY job_status;
- Access MinIO Console: http://localhost:9001
- Default credentials: minioadmin/minioadmin
- Check bucket contents and access logs
- File size limits (configurable)
- File type validation
- Hash-based deduplication
- Secure S3 access with presigned URLs
- Connection pooling
- Prepared statements
- Soft delete for data retention
- Audit logging
- Input validation
- Error handling
- Rate limiting (configurable)
- CORS configuration
# Production environment
ENVIRONMENT=production
DEBUG=false
SECRET_KEY=<strong-secret-key>
# Database
DB_PASSWORD=<strong-password>
REDIS_PASSWORD=<strong-password>
# S3 (AWS or other provider)
AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>
S3_BUCKET_NAME=<your-bucket>
# Scale workers
docker compose up -d --scale celery_worker_ocr=3 --scale celery_worker_media=2
# Scale application
docker compose up -d --scale app=3
- Set up logging aggregation
- Configure health checks
- Monitor resource usage
- Set up alerting
- S3 Storage: Replaced local file storage with S3/MinIO
- Database Metadata: Added comprehensive media management
- Background Processing: Implemented Celery for async tasks
- Polymorphic Media: Added flexible media relationships
- API Enhancement: New endpoints for task and media management
- Docker Optimization: Multi-service architecture with Celery workers
- Poetry Migration: Modern dependency management with Poetry
- Basic OCR functionality
- Local file storage
- Simple REST API
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For support and questions:
- Create an issue on GitHub
- Check the documentation
- Review the API documentation at
/docs
Note: This is a production-ready OCR identity document processing API with modern architecture, scalable design, and comprehensive testing. The system is designed to handle high-volume document processing with background task management and secure file storage.