A Python application that reads Gmail .mbox files (from Gmail backup/export) and stores email data with metadata in a PostgreSQL database. LumiBox is evolving into an AI-powered email intelligence platform with natural language search, conversation analysis, and privacy-first local processing.
Transform from a simple backup tool to a comprehensive email intelligence system:
- Complete Gmail Backup: Secure local storage with full fidelity
- AI-Powered Search: Natural language queries using local LLMs
- Privacy-First: All processing happens on your infrastructure
- Actionable Insights: Email analytics, summaries, and relationship mapping
- Mbox File Processing: Reads Gmail .mbox files and extracts email metadata
- PostgreSQL Storage: Stores emails with comprehensive metadata in PostgreSQL
- Configuration Management: Uses YAML configuration files and environment variables
- Batch Processing: Processes multiple .mbox files from a directory
- Error Handling: Robust error handling with detailed logging
- Duplicate Prevention: Prevents duplicate emails using message ID
- Connection Pooling: Efficient database connection management
- ๐ค Natural Language Search: "Find emails about the contract negotiation with Acme Corp"
- ๐ง Intelligent Summaries: AI-generated email thread summaries and insights
- ๐ Semantic Search: Find emails by meaning, not just keywords
- ๐ Email Analytics: Communication patterns, relationship mapping, productivity insights
- ๐ฌ Conversational Interface: Chat with your email history using local LLMs
- ๐ Privacy-First AI: All AI processing happens locally on your machine
See PROJECT_ROADMAP.md for the complete feature development plan.
- Python 3.11+
- PostgreSQL 15+
- Git
-
Clone and setup:
git clone <repository-url> cd LumiBox pip install -r requirements.txt
-
Configure database:
# Create PostgreSQL database createdb gmail_mbox # Setup environment cp .env.example .env # Edit .env with your database credentials
-
Process your Gmail backup:
python example_usage.py /path/to/your/mbox/files
- ๐ Read the PROJECT_ROADMAP.md to understand the AI features coming next
- ๐ Follow the development progress for natural language email search
- ๐ก Check the Issues to contribute or suggest features
LumiBox/
โโโ src/
โ โโโ mbox_processor.py # Main MboxProcessor class
โโโ config/
โ โโโ database.yaml # Database and processing configuration
โโโ requirements.txt # Python dependencies
โโโ .env.example # Environment variables template
โโโ example_usage.py # Example usage script
โโโ PROJECT_ROADMAP.md # ๐ Complete development roadmap and AI features plan
โโโ README.md # This file
-
Clone the repository (if not already done):
git clone <repository-url> cd LumiBox
-
Install Python dependencies:
pip install -r requirements.txt
-
Set up PostgreSQL database:
- Install PostgreSQL if not already installed
- Create a database for storing emails:
CREATE DATABASE gmail_mbox;
-
Configure environment variables:
cp .env.example .env
Edit
.env
file with your database credentials:DB_HOST=localhost DB_PORT=5432 DB_NAME=gmail_mbox DB_USER=your_username DB_PASSWORD=your_password LOG_LEVEL=INFO
To export .mbox files from Gmail:
- Go to Google Takeout
- Select "Mail"
- Choose "Include all messages in Mail"
- Select format as "mbox"
- Download and extract the archive
- Use the extracted .mbox files with LumiBox
The processor extracts comprehensive metadata from each email:
- Headers: All email headers including custom Gmail headers
- Content: Both plain text and HTML versions
- Attachments: Count and metadata (content can be stored)
- Gmail Labels: Extracted from X-Gmail-Labels header
- Thread Information: Gmail thread IDs
- Dates: Both original send date and processing timestamp
- Natural Language Queries: "Show me emails about budget discussions from Q4"
- Semantic Understanding: Find emails by meaning, not just keywords
- Context-Aware Results: Understanding email threads and relationships
- Multi-Modal Search: Search by content, attachments, dates, and relationships
- Thread Summarization: AI-generated summaries of long email conversations
- Action Item Extraction: Automatically identify tasks and deadlines
- Sentiment Analysis: Understand the tone and urgency of communications
- Relationship Mapping: Visualize communication patterns and networks
- Local Processing: All AI operations happen on your machine
- No Data Transmission: Emails never leave your infrastructure
- Offline Capable: Works without internet connection
- Open Source Models: Use local LLMs like Llama, Mistral, etc.
The application automatically creates the following tables:
id
: Primary key (auto-increment)message_id
: Unique email message IDsubject
: Email subjectsender
: Sender email addressrecipient
: Recipient email addressesdate_sent
: Original send datedate_received
: Processing timestampbody_text
: Plain text bodybody_html
: HTML bodyattachments_count
: Number of attachmentslabels
: Gmail labels (array)thread_id
: Gmail thread IDraw_headers
: All email headers (JSON)created_at
: Record creation timestampupdated_at
: Record update timestamp
id
: Primary key (auto-increment)email_id
: Foreign key to emails tablefilename
: Attachment filenamecontent_type
: MIME typesize_bytes
: File sizecontent
: Binary contentcreated_at
: Record creation timestamp
-
Using the example script:
python example_usage.py /path/to/mbox/files
-
Using the MboxProcessor class directly:
from src.mbox_processor import MboxProcessor # Initialize processor processor = MboxProcessor() # Process a single .mbox file stats = processor.process_mbox_file('/path/to/file.mbox') print(f"Processed {stats['processed_emails']} emails") # Process all .mbox files in a directory stats = processor.process_mbox_directory('/path/to/mbox/directory') print(f"Processed {stats['processed_emails']} total emails") # Always close when done processor.close()
# Process a single .mbox file
python src/mbox_processor.py /path/to/file.mbox
# Process all .mbox files in a directory
python src/mbox_processor.py /path/to/mbox/directory
# Using the example script (interactive)
python example_usage.py
# Using the example script with path argument
python example_usage.py /path/to/mbox/files
Variable | Description | Default |
---|---|---|
DB_HOST |
PostgreSQL host | localhost |
DB_PORT |
PostgreSQL port | 5432 |
DB_NAME |
Database name | gmail_mbox |
DB_USER |
Database username | - |
DB_PASSWORD |
Database password | - |
LOG_LEVEL |
Logging level | INFO |
The YAML configuration file contains:
- Database connection pool settings
- Table schemas
- Processing batch size
- Retry configuration
- Logging format
You can modify these settings as needed for your environment.
To get .mbox files from Gmail:
- Go to Google Takeout
- Select "Mail"
- Choose "Include all messages in Mail"
- Select format as "mbox"
- Download and extract the archive
- Use the extracted .mbox files with this application
LumiBox is evolving beyond simple backup to become a comprehensive email intelligence platform:
- Vector database integration for semantic search
- Natural language query processing
- Local LLM integration with Ollama
- Context-aware email search
- Conversation thread analysis
- AI-powered email summaries
- Communication pattern analysis
- Relationship mapping
- Productivity insights
Get Involved:
- ๐ Check PROJECT_ROADMAP.md for detailed plans
- ๐ Report issues or suggest features
- ๐ป Contribute to the AI integration development
The processor extracts comprehensive metadata from each email:
- Headers: All email headers including custom Gmail headers
- Content: Both plain text and HTML versions
- Attachments: Count and metadata (content can be stored)
- Gmail Labels: Extracted from X-Gmail-Labels header
- Thread Information: Gmail thread IDs
- Dates: Both original send date and processing timestamp
- Duplicate Prevention: Uses message ID to prevent duplicates
- Encoding Handling: Properly decodes various character encodings
- Malformed Emails: Gracefully handles corrupted or malformed emails
- Database Errors: Comprehensive error handling with rollback
- Logging: Detailed logging for debugging and monitoring
- Connection Pooling: Efficient database connection management
- Batch Processing: Configurable batch sizes for large datasets
- Progress Tracking: Regular progress updates during processing
- Memory Efficient: Processes emails one at a time to manage memory
-
Database Connection Error:
- Verify PostgreSQL is running
- Check database credentials in
.env
- Ensure database exists
-
Permission Errors:
- Check file permissions on .mbox files
- Ensure database user has necessary privileges
-
Memory Issues with Large Files:
- Reduce batch size in
config/database.yaml
- Process files individually instead of entire directories
- Reduce batch size in
-
Encoding Errors:
- The processor handles most encoding issues automatically
- Check logs for specific encoding problems
The application provides detailed logging. To increase verbosity:
LOG_LEVEL=DEBUG
Logs include:
- Processing progress
- Error details
- Database operations
- Performance metrics
psycopg2-binary
: PostgreSQL adapterpython-dotenv
: Environment variable managementPyYAML
: YAML configuration parsingemail-validator
: Email validation utilities
This project is licensed under the MIT License - see the LICENSE file for details.
From Simple Backup โ Intelligent Email Platform
LumiBox started as a Gmail backup tool but is evolving into something much more powerful:
- Phase 1 (Current): Reliable Gmail backup and storage โ
- Phase 2 (Next): AI-powered search and natural language queries ๐
- Phase 3 (Future): Complete email intelligence platform with analytics ๐ฎ
Why This Matters:
- Privacy Control: Your email data stays on your infrastructure
- AI Without Compromise: Get AI benefits while maintaining privacy
- Future-Proof: Own your data as AI capabilities continue to evolve
- Open Source: Transparent, auditable, and extensible
Join the Journey: Star โญ this repo and watch for updates as we build the future of private email intelligence!
๐ก Have ideas for AI features? Check out PROJECT_ROADMAP.md and join the discussion!
We're actively developing AI-powered features! Priority areas:
- Vector Search Implementation: Help integrate ChromaDB or Qdrant
- LLM Integration: Ollama setup and local model management
- RAG Pipeline: Context-aware search and retrieval
- Web UI Development: React-based search interface
- Fork the repository
- Create a feature branch
- Check PROJECT_ROADMAP.md for current priorities
- Make your changes and add tests
- Submit a pull request
# Clone and setup development environment
git clone <your-fork-url>
cd LumiBox
pip install -r requirements.txt
pip install -r requirements-dev.txt # Coming soon
# Run tests
python -m pytest tests/ # Coming soon
# Start development server (future web UI)
npm run dev # Coming soon
๐ฏ See PROJECT_ROADMAP.md for:
- Detailed feature development timeline
- Technical architecture plans
- AI integration roadmap
- Success metrics and milestones
For questions, issues, or contributions:
- ๐ GitHub Issues - Bug reports and feature requests
- ๐ฌ Discussions - General questions and ideas
- ๐ง Email: [your-email] - Direct contact for sensitive issues
- ๐ Wiki - Extended documentation and guides