diff --git a/README.md b/README.md
index 9a437c476..6d1f18070 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,38 @@
[](https://www.nuget.org/packages/ManagedCode.MarkItDown)
[](https://opensource.org/licenses/MIT)
-A modern C#/.NET library for converting a wide range of document formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, etc.) into high-quality Markdown suitable for Large Language Models (LLMs), search indexing, and text analytics. The project mirrors the original Microsoft Python implementation while embracing .NET idioms, async APIs, and new integrations.
+🚀 **Transform any document into LLM-ready Markdown with this powerful C#/.NET library!**
+
+MarkItDown is a comprehensive document conversion library that transforms diverse file formats (HTML, PDF, DOCX, XLSX, EPUB, archives, URLs, and more) into clean, high-quality Markdown. Perfect for AI workflows, RAG (Retrieval-Augmented Generation) systems, content processing pipelines, and text analytics applications.
+
+**Why MarkItDown for .NET?**
+- 🎯 **Built for modern C# developers** - Native .NET 9 library with async/await throughout
+- 🧠 **LLM-optimized output** - Clean Markdown that AI models love to consume
+- 📦 **Zero-friction NuGet package** - Just `dotnet add package ManagedCode.MarkItDown` and go
+- 🔄 **Stream-based processing** - Handle large documents efficiently without temporary files
+- 🛠️ **Highly extensible** - Add custom converters or integrate with AI services for captions/transcription
+
+This is a high-fidelity C# port of Microsoft's original [MarkItDown Python library](https://github.com/microsoft/markitdown), reimagined for the .NET ecosystem with modern async patterns, improved performance, and enterprise-ready features.
+
+## 🌟 Why Choose MarkItDown?
+
+### For AI & LLM Applications
+- **Perfect for RAG systems** - Convert documents to searchable, contextual Markdown chunks
+- **Token-efficient** - Clean output maximizes your LLM token budget
+- **Structured data preservation** - Tables, headers, and lists maintain semantic meaning
+- **Metadata extraction** - Rich document properties for enhanced context
+
+### For .NET Developers
+- **Native performance** - Built from the ground up for .NET, not a wrapper
+- **Modern async/await** - Non-blocking I/O with full cancellation support
+- **Memory efficient** - Stream-based processing avoids loading entire files into memory
+- **Enterprise ready** - Proper error handling, logging, and configuration options
+
+### For Content Processing
+- **22+ file formats supported** - From Office documents to web pages to archives
+- **Batch processing ready** - Handle hundreds of documents efficiently
+- **Extensible architecture** - Add custom converters for proprietary formats
+- **Smart format detection** - Automatic MIME type and encoding detection
## Table of Contents
@@ -152,13 +183,113 @@ Install-Package ManagedCode.MarkItDown
dotnet add package ManagedCode.MarkItDown
# PackageReference (add to your .csproj)
-
+
```
### Prerequisites
- .NET 9.0 SDK or later
- Compatible with .NET 9 apps and libraries
+### 🏃♂️ 60-Second Quick Start
+
+```csharp
+using MarkItDown;
+
+// Create converter instance
+var markItDown = new MarkItDown();
+
+// Convert any file to Markdown
+var result = await markItDown.ConvertAsync("document.pdf");
+Console.WriteLine(result.Markdown);
+
+// That's it! MarkItDown handles format detection automatically
+```
+
+### 📚 Real-World Examples
+
+**RAG System Document Ingestion**
+```csharp
+using MarkItDown;
+using Microsoft.Extensions.Logging;
+
+// Set up logging to track conversion progress
+using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
+var logger = loggerFactory.CreateLogger();
+var markItDown = new MarkItDown(logger: logger);
+
+// Convert documents for vector database ingestion
+string[] documents = { "report.pdf", "data.xlsx", "webpage.html" };
+var markdownChunks = new List();
+
+foreach (var doc in documents)
+{
+ try
+ {
+ var result = await markItDown.ConvertAsync(doc);
+ markdownChunks.Add($"# Document: {result.Title ?? Path.GetFileName(doc)}\n\n{result.Markdown}");
+ logger.LogInformation("Converted {Document} ({Length} characters)", doc, result.Markdown.Length);
+ }
+ catch (UnsupportedFormatException ex)
+ {
+ logger.LogWarning("Skipped unsupported file {Document}: {Error}", doc, ex.Message);
+ }
+}
+
+// markdownChunks now ready for embedding and vector storage
+```
+
+**Batch Email Processing**
+```csharp
+using MarkItDown;
+
+var markItDown = new MarkItDown();
+var emailFolder = @"C:\Emails\Exports";
+var outputFolder = @"C:\ProcessedEmails";
+
+await foreach (var emlFile in Directory.EnumerateFiles(emailFolder, "*.eml").ToAsyncEnumerable())
+{
+ var result = await markItDown.ConvertAsync(emlFile);
+
+ // Extract metadata
+ Console.WriteLine($"Email: {result.Title}");
+ Console.WriteLine($"Converted to {result.Markdown.Length} characters of Markdown");
+
+ // Save processed version
+ var outputPath = Path.Combine(outputFolder, Path.ChangeExtension(Path.GetFileName(emlFile), ".md"));
+ await File.WriteAllTextAsync(outputPath, result.Markdown);
+}
+```
+
+**Web Content Processing**
+```csharp
+using MarkItDown;
+using Microsoft.Extensions.Logging;
+
+using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
+using var httpClient = new HttpClient();
+
+var markItDown = new MarkItDown(
+ logger: loggerFactory.CreateLogger(),
+ httpClient: httpClient);
+
+// Convert web pages directly
+var urls = new[]
+{
+ "https://en.wikipedia.org/wiki/Machine_learning",
+ "https://docs.microsoft.com/en-us/dotnet/csharp/",
+ "https://github.com/microsoft/semantic-kernel"
+};
+
+foreach (var url in urls)
+{
+ var result = await markItDown.ConvertFromUrlAsync(url);
+ Console.WriteLine($"📄 {result.Title}");
+ Console.WriteLine($"🔗 Source: {url}");
+ Console.WriteLine($"📝 Content: {result.Markdown.Length} characters");
+ Console.WriteLine("---");
+}
+```
+
### Optional Dependencies for Advanced Features
- **PDF Support**: Provided via PdfPig (bundled)
- **Office Documents**: Provided via DocumentFormat.OpenXml (bundled)
@@ -297,6 +428,254 @@ var markItDown = new MarkItDown();
markItDown.RegisterConverter(new MyCustomConverter());
```
+## 🎯 Advanced Usage Patterns
+
+### Custom Format Converters
+
+```csharp
+using MarkItDown;
+
+public class PowerBIConverter : IDocumentConverter
+{
+ public int Priority => 150; // Between HTML and PlainText
+
+ public bool AcceptsInput(StreamInfo streamInfo) =>
+ streamInfo.Extension?.ToLowerInvariant() == ".pbix" ||
+ streamInfo.MimeType?.Contains("powerbi") == true;
+
+ public async Task ConvertAsync(
+ Stream stream,
+ StreamInfo streamInfo,
+ CancellationToken cancellationToken = default)
+ {
+ // Custom PowerBI file processing logic here
+ var markdown = await ProcessPowerBIFile(stream, cancellationToken);
+ return new DocumentConverterResult(markdown, "PowerBI Report");
+ }
+
+ private async Task ProcessPowerBIFile(Stream stream, CancellationToken cancellationToken)
+ {
+ // Implementation details...
+ await Task.Delay(100, cancellationToken); // Placeholder
+ return "# PowerBI Report\n\nProcessed PowerBI content here...";
+ }
+}
+```
+
+### Batch Processing with Progress Tracking
+
+```csharp
+using MarkItDown;
+using Microsoft.Extensions.Logging;
+
+public class DocumentProcessor
+{
+ private readonly MarkItDown _markItDown;
+ private readonly ILogger _logger;
+
+ public DocumentProcessor(ILogger logger)
+ {
+ _logger = logger;
+ _markItDown = new MarkItDown(logger: logger);
+ }
+
+ public async Task> ProcessDirectoryAsync(
+ string directoryPath,
+ string outputPath,
+ IProgress? progress = null)
+ {
+ var files = Directory.EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories)
+ .Where(f => !Path.GetFileName(f).StartsWith('.'))
+ .ToList();
+
+ var results = new List();
+ var processed = 0;
+
+ await Parallel.ForEachAsync(files, new ParallelOptions
+ {
+ MaxDegreeOfParallelism = Environment.ProcessorCount
+ },
+ async (file, cancellationToken) =>
+ {
+ try
+ {
+ var result = await _markItDown.ConvertAsync(file, cancellationToken: cancellationToken);
+ var outputFile = Path.Combine(outputPath,
+ Path.ChangeExtension(Path.GetRelativePath(directoryPath, file), ".md"));
+
+ Directory.CreateDirectory(Path.GetDirectoryName(outputFile)!);
+ await File.WriteAllTextAsync(outputFile, result.Markdown, cancellationToken);
+
+ lock (results)
+ {
+ results.Add(new ProcessedDocument(file, outputFile, result.Markdown.Length));
+ processed++;
+ progress?.Report(new ProcessingProgress(processed, files.Count, file));
+ }
+ }
+ catch (Exception ex)
+ {
+ _logger.LogError(ex, "Failed to process {File}", file);
+ }
+ });
+
+ return results;
+ }
+}
+
+public record ProcessedDocument(string InputPath, string OutputPath, int CharacterCount);
+public record ProcessingProgress(int Processed, int Total, string CurrentFile);
+```
+
+### Integration with Vector Databases
+
+```csharp
+using MarkItDown;
+using Microsoft.Extensions.VectorData;
+
+public class DocumentIndexer
+{
+ private readonly MarkItDown _markItDown;
+ private readonly IVectorStore _vectorStore;
+
+ public DocumentIndexer(IVectorStore vectorStore)
+ {
+ _vectorStore = vectorStore;
+ _markItDown = new MarkItDown();
+ }
+
+ public async Task IndexDocumentAsync(string filePath) where T : class
+ {
+ // Convert to Markdown
+ var result = await _markItDown.ConvertAsync(filePath);
+
+ // Split into chunks for better vector search
+ var chunks = SplitIntoChunks(result.Markdown, maxChunkSize: 500);
+
+ var collection = _vectorStore.GetCollection("documents");
+
+ for (int i = 0; i < chunks.Count; i++)
+ {
+ var document = new DocumentChunk
+ {
+ Id = $"{Path.GetFileName(filePath)}_{i}",
+ Content = chunks[i],
+ Title = result.Title ?? Path.GetFileName(filePath),
+ Source = filePath,
+ ChunkIndex = i
+ };
+
+ await collection.UpsertAsync(document);
+ }
+ }
+
+ private List SplitIntoChunks(string markdown, int maxChunkSize)
+ {
+ // Smart chunking logic that preserves markdown structure
+ var chunks = new List();
+ var lines = markdown.Split('\n');
+ var currentChunk = new StringBuilder();
+
+ foreach (var line in lines)
+ {
+ if (currentChunk.Length + line.Length > maxChunkSize && currentChunk.Length > 0)
+ {
+ chunks.Add(currentChunk.ToString().Trim());
+ currentChunk.Clear();
+ }
+ currentChunk.AppendLine(line);
+ }
+
+ if (currentChunk.Length > 0)
+ chunks.Add(currentChunk.ToString().Trim());
+
+ return chunks;
+ }
+}
+
+public class DocumentChunk
+{
+ public string Id { get; set; } = "";
+ public string Content { get; set; } = "";
+ public string Title { get; set; } = "";
+ public string Source { get; set; } = "";
+ public int ChunkIndex { get; set; }
+}
+```
+
+### Cloud Function Integration
+
+```csharp
+// Azure Functions example
+using Microsoft.Azure.Functions.Worker;
+using Microsoft.Azure.Functions.Worker.Http;
+using Microsoft.Extensions.Logging;
+using MarkItDown;
+
+public class DocumentConversionFunction
+{
+ private readonly MarkItDown _markItDown;
+ private readonly ILogger _logger;
+
+ public DocumentConversionFunction(ILogger logger)
+ {
+ _logger = logger;
+ _markItDown = new MarkItDown(logger: logger);
+ }
+
+ [Function("ConvertDocument")]
+ public async Task ConvertDocument(
+ [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequestData req)
+ {
+ try
+ {
+ var formData = await req.ReadFormAsync();
+ var file = formData.Files.FirstOrDefault();
+
+ if (file == null)
+ {
+ var badResponse = req.CreateResponse(System.Net.HttpStatusCode.BadRequest);
+ await badResponse.WriteStringAsync("No file uploaded");
+ return badResponse;
+ }
+
+ var streamInfo = new StreamInfo(
+ extension: Path.GetExtension(file.FileName),
+ fileName: file.FileName,
+ mimeType: file.ContentType
+ );
+
+ var result = await _markItDown.ConvertAsync(file.OpenReadStream(), streamInfo);
+
+ var response = req.CreateResponse(System.Net.HttpStatusCode.OK);
+ response.Headers.Add("Content-Type", "application/json");
+
+ await response.WriteAsJsonAsync(new
+ {
+ title = result.Title,
+ markdown = result.Markdown,
+ characterCount = result.Markdown.Length
+ });
+
+ return response;
+ }
+ catch (UnsupportedFormatException ex)
+ {
+ var response = req.CreateResponse(System.Net.HttpStatusCode.UnsupportedMediaType);
+ await response.WriteStringAsync($"Unsupported file format: {ex.Message}");
+ return response;
+ }
+ catch (Exception ex)
+ {
+ _logger.LogError(ex, "Document conversion failed");
+ var response = req.CreateResponse(System.Net.HttpStatusCode.InternalServerError);
+ await response.WriteStringAsync("Internal server error");
+ return response;
+ }
+ }
+}
+```
+
## 🏗️ Architecture
### Core Components
@@ -309,17 +688,27 @@ markItDown.RegisterConverter(new MyCustomConverter());
### Built-in Converters
-- **`PlainTextConverter`** - Handles text, JSON, NDJSON, Markdown, etc.
-- **`HtmlConverter`** - Converts HTML to Markdown using AngleSharp
-- **`PdfConverter`** - PdfPig-based extraction with Markdown heuristics
-- **`Docx/Xlsx/Pptx` Converters** - Office Open XML processing
-- **`ImageConverter`** - Exif metadata + optional captions
-- **`AudioConverter`** - Metadata + optional transcription
-- **`WikipediaConverter`** - Article-only extraction from Wikipedia
-- **`BingSerpConverter`** - Summaries for Bing search result pages
-- **`YouTubeUrlConverter`** - Video metadata markdown
-- **`ZipConverter`** - Recursive archive handling
-- **`RssFeedConverter`**, **`JsonConverter`**, **`CsvConverter`**, **`XmlConverter`**, **`JupyterNotebookConverter`**, **`EpubConverter`**
+MarkItDown includes these converters in priority order:
+
+- **`YouTubeUrlConverter`** - Video metadata from YouTube URLs
+- **`HtmlConverter`** - HTML to Markdown using AngleSharp
+- **`WikipediaConverter`** - Clean article extraction from Wikipedia pages
+- **`BingSerpConverter`** - Search result summaries from Bing
+- **`RssFeedConverter`** - RSS/Atom feeds with article processing
+- **`JsonConverter`** - Structured JSON data with formatting
+- **`JupyterNotebookConverter`** - Python notebooks with code and markdown cells
+- **`CsvConverter`** - CSV files as Markdown tables
+- **`EpubConverter`** - E-book content and metadata
+- **`EmlConverter`** - Email files with headers and attachments
+- **`XmlConverter`** - XML documents with structure preservation
+- **`ZipConverter`** - Archive processing with recursive conversion
+- **`PdfConverter`** - PDF text extraction using PdfPig
+- **`DocxConverter`** - Microsoft Word documents
+- **`XlsxConverter`** - Microsoft Excel spreadsheets
+- **`PptxConverter`** - Microsoft PowerPoint presentations
+- **`AudioConverter`** - Audio metadata and optional transcription
+- **`ImageConverter`** - Image metadata via ExifTool and optional captions
+- **`PlainTextConverter`** - Plain text, Markdown, and other text formats (fallback)
### Converter Priority & Detection
@@ -327,8 +716,129 @@ markItDown.RegisterConverter(new MyCustomConverter());
- Automatic stream sniffing via `StreamInfoGuesser`
- Manual overrides via `MarkItDownOptions` or `StreamInfo`
+## 🚨 Error Handling & Troubleshooting
+
+### Common Exceptions
+
+```csharp
+using MarkItDown;
+
+var markItDown = new MarkItDown();
+
+try
+{
+ var result = await markItDown.ConvertAsync("document.pdf");
+ Console.WriteLine(result.Markdown);
+}
+catch (UnsupportedFormatException ex)
+{
+ // File format not supported by any converter
+ Console.WriteLine($"Cannot process this file type: {ex.Message}");
+}
+catch (FileNotFoundException ex)
+{
+ // File path doesn't exist
+ Console.WriteLine($"File not found: {ex.Message}");
+}
+catch (UnauthorizedAccessException ex)
+{
+ // Permission issues
+ Console.WriteLine($"Access denied: {ex.Message}");
+}
+catch (MarkItDownException ex)
+{
+ // General conversion errors (corrupt files, parsing issues, etc.)
+ Console.WriteLine($"Conversion failed: {ex.Message}");
+ if (ex.InnerException != null)
+ Console.WriteLine($"Details: {ex.InnerException.Message}");
+}
+```
+
+### Troubleshooting Tips
+
+**File Format Detection Issues:**
+```csharp
+// Force specific format detection
+var streamInfo = new StreamInfo(
+ mimeType: "application/pdf", // Explicit MIME type
+ extension: ".pdf", // Explicit extension
+ fileName: "document.pdf" // Original filename
+);
+
+var result = await markItDown.ConvertAsync(stream, streamInfo);
+```
+
+**Memory Issues with Large Files:**
+```csharp
+// Use cancellation tokens to prevent runaway processing
+using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(10));
+
+try
+{
+ var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
+}
+catch (OperationCanceledException)
+{
+ Console.WriteLine("Conversion timed out - file may be too large or complex");
+}
+```
+
+**Network Issues (URLs):**
+```csharp
+// Configure HttpClient for better reliability
+using var httpClient = new HttpClient();
+httpClient.Timeout = TimeSpan.FromSeconds(30);
+httpClient.DefaultRequestHeaders.Add("User-Agent", "MarkItDown/1.0");
+
+var markItDown = new MarkItDown(httpClient: httpClient);
+```
+
+**Logging for Diagnostics:**
+```csharp
+using Microsoft.Extensions.Logging;
+
+using var loggerFactory = LoggerFactory.Create(builder =>
+ builder.AddConsole().SetMinimumLevel(LogLevel.Debug));
+
+var logger = loggerFactory.CreateLogger();
+var markItDown = new MarkItDown(logger: logger);
+
+// Now you'll see detailed conversion progress in console output
+```
+
## 🔄 Development & Contributing
+### Migration from Python MarkItDown
+
+If you're familiar with the original Python library, here are the key differences:
+
+| Python | C#/.NET | Notes |
+|---------|---------|--------|
+| `MarkItDown()` | `new MarkItDown()` | Similar constructor |
+| `markitdown.convert("file.pdf")` | `await markItDown.ConvertAsync("file.pdf")` | Async pattern |
+| `markitdown.convert(stream, file_extension=".pdf")` | `await markItDown.ConvertAsync(stream, streamInfo)` | StreamInfo object |
+| `markitdown.convert_url("https://...")` | `await markItDown.ConvertFromUrlAsync("https://...")` | Async URL conversion |
+| `llm_client=...` parameter | `ImageCaptioner`, `AudioTranscriber` delegates | More flexible callback system |
+| Plugin system | Not yet implemented | Planned for future release |
+
+**Example Migration:**
+
+```python
+# Python version
+import markitdown
+md = markitdown.MarkItDown()
+result = md.convert("document.pdf")
+print(result.text_content)
+```
+
+```csharp
+// C# version
+using MarkItDown;
+var markItDown = new MarkItDown();
+var result = await markItDown.ConvertAsync("document.pdf");
+Console.WriteLine(result.Markdown);
+```
+
### Building from Source
```bash
@@ -361,15 +871,14 @@ HTML or Markdown dashboards.
```
├── src/
-│ ├── MarkItDown/ # Core library
-│ │ ├── Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
-│ │ ├── MarkItDown.cs # Main conversion engine
-│ │ ├── StreamInfoGuesser.cs # MIME/charset/extension detection helpers
-│ │ ├── MarkItDownOptions.cs # Runtime configuration flags
-│ │ └── ... # Shared utilities (UriUtilities, MimeMapping, etc.)
-│ └── MarkItDown.Cli/ # CLI host (under active development)
+│ └── MarkItDown/ # Core library
+│ ├── Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
+│ ├── MarkItDown.cs # Main conversion engine
+│ ├── StreamInfoGuesser.cs # MIME/charset/extension detection helpers
+│ ├── MarkItDownOptions.cs # Runtime configuration flags
+│ └── ... # Shared utilities (UriUtilities, MimeMapping, etc.)
├── tests/
-│ └── MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors (WIP)
+│ └── MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors
├── Directory.Build.props # Shared build + packaging settings
└── README.md # This document
```
@@ -387,55 +896,176 @@ HTML or Markdown dashboards.
### 🎯 Near-Term
- Azure Document Intelligence converter (options already scaffolded)
-- Outlook `.msg` ingestion via MIT-friendly dependencies
-- Expanded CLI commands (batch mode, globbing, JSON output)
-- Richer regression suite mirroring Python test vectors
+- Outlook `.msg` ingestion via MIT-friendly dependencies
+- Performance optimizations and memory usage improvements
+- Enhanced test coverage mirroring Python test vectors
### 🎯 Future Ideas
-- Plugin discovery & sandboxing
-- Built-in LLM caption/transcription providers
-- Incremental/streaming conversion APIs
-- Cloud-native samples (Functions, Containers, Logic Apps)
+- Plugin discovery & sandboxing for custom converters
+- Built-in LLM caption/transcription providers (OpenAI, Azure AI)
+- Incremental/streaming conversion APIs for large documents
+- Cloud-native integration samples (Azure Functions, AWS Lambda)
+- Command-line interface (CLI) for batch processing
## 📈 Performance
-MarkItDown is designed for high performance with:
-- **Stream-based processing** – Avoids writing temporary files by default
-- **Async/await everywhere** – Non-blocking I/O with cancellation support
-- **Minimal allocations** – Smart buffer reuse and pay-for-play converters
-- **Fast detection** – Lightweight sniffing before converter dispatch
-- **Extensible hooks** – Offload captions/transcripts to background workers
+MarkItDown is designed for high-performance document processing in production environments:
+
+### 🚀 Performance Characteristics
+
+| Feature | Benefit | Impact |
+|---------|---------|--------|
+| **Stream-based processing** | No temporary files created | Faster I/O, lower disk usage |
+| **Async/await throughout** | Non-blocking operations | Better scalability, responsive UIs |
+| **Memory efficient** | Smart buffer reuse | Lower memory footprint for large documents |
+| **Fast format detection** | Lightweight MIME/extension sniffing | Quick routing to appropriate converter |
+| **Parallel processing ready** | Thread-safe converter instances | Handle multiple documents concurrently |
+
+### 📊 Performance Considerations
+
+MarkItDown's performance depends on:
+- **Document size and complexity** - Larger files with more formatting take longer to process
+- **File format** - Some formats (like PDF) require more processing than others (like plain text)
+- **Available system resources** - Memory, CPU, and I/O capabilities
+- **Optional services** - Image captioning and audio transcription add processing time
+
+Performance will vary based on your specific documents and environment. For production workloads, we recommend benchmarking with your actual document types and sizes.
+
+### ⚡ Optimization Tips
+
+```csharp
+// 1. Reuse MarkItDown instances (they're thread-safe)
+var markItDown = new MarkItDown();
+await Task.WhenAll(
+ markItDown.ConvertAsync("file1.pdf"),
+ markItDown.ConvertAsync("file2.docx"),
+ markItDown.ConvertAsync("file3.html")
+);
+
+// 2. Use cancellation tokens for timeouts
+using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(5));
+var result = await markItDown.ConvertAsync("large-file.pdf", cancellationToken: cts.Token);
+
+// 3. Configure HttpClient for web content (reuse connections)
+using var httpClient = new HttpClient();
+var markItDown = new MarkItDown(httpClient: httpClient);
+
+// 4. Pre-specify StreamInfo to skip format detection
+var streamInfo = new StreamInfo(mimeType: "application/pdf", extension: ".pdf");
+var result = await markItDown.ConvertAsync(stream, streamInfo);
+```
## 🔧 Configuration
+### Basic Configuration
+
+```csharp
+var options = new MarkItDownOptions
+{
+ EnableBuiltins = true, // Use built-in converters (default: true)
+ EnablePlugins = false, // Plugin system (reserved for future use)
+ ExifToolPath = "/usr/local/bin/exiftool" // Path to exiftool binary (optional)
+};
+
+var markItDown = new MarkItDown(options);
+```
+
+### Advanced AI Integration
+
```csharp
+using Azure;
+using OpenAI;
+
var options = new MarkItDownOptions
{
- EnableBuiltins = true,
- EnablePlugins = false,
- ExifToolPath = "/usr/local/bin/exiftool",
+ // Azure AI Vision for image captions
ImageCaptioner = async (bytes, info, token) =>
{
- // Call your preferred vision or LLM service here
- return await Task.FromResult("A scenic mountain landscape at sunset.");
+ var client = new VisionServiceClient("your-endpoint", new AzureKeyCredential("your-key"));
+ var result = await client.AnalyzeImageAsync(bytes, token);
+ return $"Image: {result.Description?.Captions?.FirstOrDefault()?.Text ?? "Visual content"}";
},
+
+ // OpenAI Whisper for audio transcription
AudioTranscriber = async (bytes, info, token) =>
{
- // Route to speech-to-text provider
- return await Task.FromResult("Welcome to the MarkItDown demo.");
+ var client = new OpenAIClient("your-api-key");
+ using var stream = new MemoryStream(bytes);
+ var result = await client.AudioEndpoint.CreateTranscriptionAsync(
+ stream,
+ Path.GetFileName(info.FileName) ?? "audio",
+ cancellationToken: token);
+ return result.Text;
+ },
+
+ // Azure Document Intelligence for enhanced PDF/form processing
+ DocumentIntelligence = new DocumentIntelligenceOptions
+ {
+ Endpoint = "https://your-resource.cognitiveservices.azure.com/",
+ Credential = new AzureKeyCredential("your-document-intelligence-key"),
+ ApiVersion = "2023-10-31-preview"
}
};
var markItDown = new MarkItDown(options);
```
+### Production Configuration with Error Handling
+
+```csharp
+using Microsoft.Extensions.Logging;
+using Microsoft.Extensions.DependencyInjection;
+
+// Set up dependency injection
+var services = new ServiceCollection();
+services.AddLogging(builder => builder.AddConsole().SetMinimumLevel(LogLevel.Information));
+services.AddHttpClient();
+
+var serviceProvider = services.BuildServiceProvider();
+var logger = serviceProvider.GetRequiredService>();
+var httpClientFactory = serviceProvider.GetRequiredService();
+
+var options = new MarkItDownOptions
+{
+ // Graceful degradation for image processing
+ ImageCaptioner = async (bytes, info, token) =>
+ {
+ try
+ {
+ // Your AI service call here
+ return await CallVisionServiceAsync(bytes, token);
+ }
+ catch (Exception ex)
+ {
+ logger.LogWarning("Image captioning failed: {Error}", ex.Message);
+ return $"[Image: {info.FileName ?? "unknown"}]"; // Fallback
+ }
+ }
+};
+
+var markItDown = new MarkItDown(options, logger, httpClientFactory.CreateClient());
+```
+
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
-This project is a C# conversion of the original [Microsoft MarkItDown](https://github.com/microsoft/markitdown) Python library. The original project was created by the Microsoft AutoGen team.
+This project is a comprehensive C# port of the original [Microsoft MarkItDown](https://github.com/microsoft/markitdown) Python library, created by the Microsoft AutoGen team. We've reimagined it specifically for the .NET ecosystem while maintaining compatibility with the original's design philosophy and capabilities.
+
+**Key differences in this .NET version:**
+- 🎯 **Native .NET performance** - Built from scratch in C#, not a Python wrapper
+- 🔄 **Modern async patterns** - Full async/await support with cancellation tokens
+- 📦 **NuGet ecosystem integration** - Easy installation and dependency management
+- 🛠️ **Enterprise features** - Comprehensive logging, error handling, and configuration
+- 🚀 **Enhanced performance** - Stream-based processing and memory optimizations
+
+**Maintained by:** [ManagedCode](https://github.com/managedcode) team
+**Original inspiration:** Microsoft AutoGen team
+**License:** MIT (same as the original Python version)
+
+We're committed to maintaining feature parity with the upstream Python project while delivering the performance and developer experience that .NET developers expect.
## 📞 Support