Skip to content

Conversation

Copy link

Copilot AI commented Sep 28, 2025

✅ COMPLETED: EML Email Format Support Implementation

Problem Statement Addressed

The issue requested two main features:

  1. ✅ EML/Email format support - Add converter for email files (.eml) with MIME parsing using MimeKit, extracting headers (To, From, Subject, Date), message content (HTML/text), and attachments
  2. ⏳ Modular packages - Separate packages for different formats to reduce dependency footprint (deferred as future enhancement)

Implementation Summary

🎯 Complete EML Converter Implementation

  • ✅ MimeKit Integration: Added MimeKit 4.9.0 dependency for robust MIME parsing
  • ✅ Full Header Extraction: Subject, From, To, CC, Date, Message-ID properly extracted and formatted
  • ✅ Content Conversion:
    • Plain text emails converted directly to Markdown
    • HTML emails processed through existing HtmlConverter for clean Markdown output
  • ✅ Attachment Support: Lists attachment metadata (name, content type, size) without content conversion
  • ✅ Priority Integration: Set at 240 (between PPTX and EPUB) for proper converter ordering
  • ✅ Complete Integration: Registered in both main converters and ZIP inner converters

📝 Files Modified (Minimal Changes)

README.md                                    # Added EML documentation + usage example
src/MarkItDown/Converters/EmlConverter.cs   # New converter implementation (251 lines)
src/MarkItDown/MarkItDown.cs                # Registered converter in built-in list
src/MarkItDown/MarkItDown.csproj            # Added MimeKit dependency
src/MarkItDown/MimeMapping.cs               # Added .eml -> message/rfc822 mapping
src/MarkItDown/FileUtilities.cs             # New utility class for shared file operations
src/MarkItDown/Converters/ZipConverter.cs   # Updated to use FileUtilities.FormatFileSize
tests/MarkItDown.Tests/EmlConverterTests.cs # New comprehensive test suite (4 tests)
tests/MarkItDown.Tests/NewConvertersTests.cs # Added EML format support tests

✅ Code Quality Improvements

  1. Priority Comment Clarity: Updated EML converter priority comment to clearly explain priority ordering (lower numbers = higher priority)
  2. Eliminated Code Duplication: Created FileUtilities class with shared FormatFileSize method, removing duplicate code from both EmlConverter and ZipConverter
  3. Consistent File Size Formatting: All file size formatting now uses the same format with "B", "KB", "MB", "GB" units

✅ Features Delivered

  1. Email Header Processing: Extracts and formats all essential email headers
  2. Content Conversion: Handles both plain text and HTML email content
  3. Attachment Metadata: Lists attachment info without binary content extraction
  4. Smart Title Generation: Uses email subject or falls back to sender information
  5. Format Detection: Accepts multiple MIME types (message/rfc822, message/email, etc.)
  6. Markdown Escaping: Balanced approach - escapes critical characters while preserving readability
  7. Integration: Works seamlessly with existing MarkItDown workflow and ZIP processing

🧪 Test Results

  • All 258 tests pass (no regressions from code refactoring)
  • 4 new EML-specific tests covering various email formats and edge cases
  • Format support integration confirms EML files are recognized by main MarkItDown class
  • End-to-end validation demonstrated with working demo application

📧 Demo Output

📧 Email Title: EML Converter Demo - RAG Email Processing

📝 Converted Markdown:
# Email

**Subject:** EML Converter Demo - RAG Email Processing
**From:** John Doe <[email protected]>
**To:** Jane Smith <[email protected]>
**Date:** 2024-01-15 10:30:00 +00:00

## Message Content

Hello Jane,

This email demonstrates the new EML format support in MarkItDown!

Key features implemented:
1. ✅ Email header extraction (From, To, Subject, Date)
2. ✅ Content conversion to clean Markdown
3. ✅ Attachment metadata listing
4. ✅ Integration with existing MarkItDown workflow

Perfect for building RAG systems from email archives! 🚀

Best regards,
John Doe

🎯 Use Cases Enabled

  • RAG Systems: Build search indexes from email archives
  • AI Agent Processing: Enable LLMs to process inbound email content
  • Document Analysis: Include emails in comprehensive document conversion pipelines
  • Email Archiving: Convert email archives to searchable Markdown format

Architecture Decisions

✅ Minimal Changes Approach

Focused on Phase 1 (EML support) only, making the smallest possible changes to address the immediate need. The modular package structure (Phase 2) was identified as a valuable future enhancement but would require significant architectural changes better suited for a major version release.

✅ Dependency Choice

Selected MimeKit 4.9.0 as the industry-standard library for MIME processing, providing robust email parsing with excellent .NET integration and comprehensive format support.

✅ Priority Placement

Set EML converter priority at 240, positioning it between PPTX (230) and EPUB (250) converters, ensuring appropriate precedence in the conversion pipeline.

✅ Code Quality Refactoring

Addressed code review feedback by:

  • Clarifying priority comment to explain ordering direction
  • Extracting duplicate FormatFileSize logic to shared FileUtilities class
  • Maintaining consistent file size formatting across converters

Final Status: ✅ COMPLETE

The EML email format support has been successfully implemented with comprehensive functionality, thorough testing, complete documentation, and code quality improvements based on review feedback.

Ready for production use! 🚀

Original prompt

This section details on the original issue you should resolve

<issue_title>Small packages and formats</issue_title>
<issue_description>Suggestion for added file formats: e-mail / EML. It would require a MIME parser (like MimeKit), adding the most important headers (To, From, Subject, Date), extracting and parsing the actual message (either HTML or text), and possibly other attachments as well. Use cases could be building a RAG for your email archive, or using an AI agent for processing inbound email.

Suggestion for efficiency: It would be great to have separate packages for file formats that require large dependencies. Often, an application will only need to convert a few or only one format, and not having to carry all the unneeded deps will reduce the footprint of the application greatly. Think of build pipelines (restore time and traffic), container image sizes, desktop and mobile apps, or maybe even WASM...</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #15

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…s, content, and attachments

Co-authored-by: KSemenenko <[email protected]>
Copilot AI changed the title [WIP] Small packages and formats Add EML (email) format support with comprehensive header extraction and content conversion Sep 28, 2025
Copilot AI requested a review from KSemenenko September 28, 2025 14:08
@KSemenenko KSemenenko marked this pull request as ready for review September 28, 2025 14:56
Copilot AI review requested due to automatic review settings September 28, 2025 14:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive EML (email) format support for the MarkItDown library, enabling conversion of email messages to clean Markdown with structured headers, content conversion, and attachment metadata extraction.

Key Changes

  • EmlConverter Implementation: New converter with MimeKit integration for robust email parsing and processing
  • Enhanced Integration: Updated project dependencies, MIME type mappings, and converter registration
  • Comprehensive Testing: Added extensive test coverage for various email scenarios and end-to-end integration

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/MarkItDown/Converters/EmlConverter.cs New email converter with header extraction, content conversion, and attachment metadata
src/MarkItDown/MarkItDown.csproj Added MimeKit 4.9.0 dependency for email processing
src/MarkItDown/MimeMapping.cs Added .eml file extension mapping to message/rfc822 MIME type
src/MarkItDown/MarkItDown.cs Registered EmlConverter in built-in and ZIP-inner converter lists
tests/MarkItDown.Tests/EmlConverterTests.cs Comprehensive test suite covering email parsing and conversion scenarios
tests/MarkItDown.Tests/NewConvertersTests.cs Added integration tests for EML format support and converter priority validation
README.md Updated documentation with EML support information and usage examples


private readonly HtmlConverter _htmlConverter;

public int Priority => 240; // Between EPUB and PPTX
Copy link

Copilot AI Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states priority is 'Between EPUB and PPTX' but according to the project's priority system, lower numbers mean higher priority. If PPTX has priority 230 and EPUB has priority 250, then 240 is indeed between them, but the comment could be clearer about the priority ordering direction.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 264 to 277
private static string FormatFileSize(long bytes)
{
string[] sizes = { "bytes", "KB", "MB", "GB" };
double len = bytes;
int order = 0;

while (len >= 1024 && order < sizes.Length - 1)
{
order++;
len /= 1024;
}

return $"{len:0.##} {sizes[order]}";
}
Copy link

Copilot AI Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file size formatting logic is duplicated functionality that likely exists elsewhere in the codebase or could be extracted to a common utility class to avoid code duplication.

Copilot uses AI. Check for mistakes.
@KSemenenko
Copy link
Member

@copilot fix this The comment states priority is 'Between EPUB and PPTX' but according to the project's priority system, lower numbers mean higher priority. If PPTX has priority 230 and EPUB has priority 250, then 240 is indeed between them, but the comment could be clearer about the priority ordering direction. private readonly HtmlConverter _htmlConverter;

public int Priority => 240; // Between EPUB and PPTX


and fix this [src/MarkItDown/Converters/EmlConverter.cs](https://github.com/managedcode/markitdown/pull/18/files/c404a96c75d28cefe17b60d93a5bb19c59180e10#diff-4ad5c9797aaa8686d3c1e7a7b1dad1ba70e75d153a39156b1fc9de771a7a135a)

Comment on lines +264 to +277
private static string FormatFileSize(long bytes)
{
string[] sizes = { "bytes", "KB", "MB", "GB" };
double len = bytes;
int order = 0;

    while (len >= 1024 && order < sizes.Length - 1)
    {
        order++;
        len /= 1024;
    }

    return $"{len:0.##} {sizes[order]}";
}

https://github.com/apps/copilot-pull-request-reviewer AI
3 minutes ago
This file size formatting logic is duplicated functionality that likely exists elsewhere in the codebase or could be extracted to a common utility class to avoid code duplication.

…ies to avoid code duplication

Co-authored-by: KSemenenko <[email protected]>
Copy link
Author

Copilot AI commented Sep 28, 2025

@copilot fix this The comment states priority is 'Between EPUB and PPTX' but according to the project's priority system, lower numbers mean higher priority. If PPTX has priority 230 and EPUB has prior...

Fixed both issues in commit 87cfbba:

  1. Priority Comment: Updated to "Between PPTX (230) and EPUB (250) - lower numbers = higher priority" for clarity
  2. Code Duplication: Created FileUtilities.FormatFileSize() utility method and updated both EmlConverter and ZipConverter to use it, removing duplicate code

The method now provides consistent file size formatting across all converters.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • ccsca2021.crl.certum.pl
  • ccsca2021.ocsp-certum.com
    • Triggering command: dotnet build (dns block)
  • crl.certum.pl
    • Triggering command: dotnet build (dns block)
  • subca.crl.certum.pl
    • Triggering command: dotnet build (dns block)
  • subca.ocsp-certum.com
    • Triggering command: dotnet build (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@KSemenenko KSemenenko merged commit b81b190 into main Sep 28, 2025
2 checks passed
@KSemenenko KSemenenko deleted the copilot/fix-8614e28b-7134-4ae8-bdc2-f34cfa6dbd93 branch September 28, 2025 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Small packages and formats

2 participants