-
Notifications
You must be signed in to change notification settings - Fork 453
Add Microsoft OneDrive DOCX/Markdown conversion functions #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughThis update introduces two new methods to the Microsoft OneDrive connector for bidirectional conversion between Markdown and DOCX formats, utilizing the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MicrosoftOnedrive
participant OneDrive
participant md2docx-python
Client->>MicrosoftOnedrive: create_docx_from_markdown(markdown_data, folder_id, filename)
MicrosoftOnedrive->>md2docx-python: Convert Markdown to DOCX (temp files)
md2docx-python-->>MicrosoftOnedrive: DOCX file
MicrosoftOnedrive->>OneDrive: Upload DOCX file
OneDrive-->>MicrosoftOnedrive: Upload metadata
MicrosoftOnedrive-->>Client: Return metadata and stats
Client->>MicrosoftOnedrive: read_markdown_from_docx(item_id)
MicrosoftOnedrive->>OneDrive: Download DOCX file
OneDrive-->>MicrosoftOnedrive: DOCX file
MicrosoftOnedrive->>md2docx-python: Convert DOCX to Markdown (temp files)
md2docx-python-->>MicrosoftOnedrive: Markdown content
MicrosoftOnedrive-->>Client: Return Markdown and metadata
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
✨ No issues found! Your code is sparkling clean! ✨ Need help? Join our Discord for support! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic analysis
No issues found across 4 files. Review in cubic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/aci/server/app_connectors/microsoft_onedrive.py (1)
251-259: Improve error handling in temporary file cleanupThe current cleanup silently ignores OSError exceptions. Consider logging these errors for debugging purposes.
try: os.unlink(md_file_path) os.unlink(docx_file_path) except OSError: - pass # Files already cleaned up + logger.debug(f"Failed to clean up temporary files: {md_file_path}, {docx_file_path}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
backend/aci/server/app_connectors/microsoft_onedrive.py(2 hunks)backend/apps/microsoft_onedrive/functions.json(1 hunks)backend/pyproject.toml(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: cubic · AI code reviewer
- GitHub Check: Format, Lint, and Test
- GitHub Check: Compose Tests
- GitHub Check: Format & Lint
🔇 Additional comments (4)
backend/pyproject.toml (2)
145-148: LGTM! MyPy override follows existing patternsThe mypy override for
md2docx_python.*is correctly configured and follows the established pattern in the file.
38-38: md2docx-python dependency verified
- md2docx-python is actively maintained (latest release v1.0.0 on 2025-03-14).
- No known security vulnerabilities are reported in its PyPI metadata.
No further action required.
backend/apps/microsoft_onedrive/functions.json (1)
1131-1180: LGTM! Well-structured function definitionsThe new function definitions for
MICROSOFT_ONEDRIVE__CREATE_DOCX_FROM_MARKDOWNandMICROSOFT_ONEDRIVE__READ_MARKDOWN_FROM_DOCXare properly structured with:
- Clear descriptions
- Appropriate parameter schemas with validation
- Consistent pattern with other connector functions
backend/aci/server/app_connectors/microsoft_onedrive.py (1)
295-296: Good validation of file extensionsThe case-insensitive check for Word document extensions is appropriate and helps prevent processing non-Word files.
| def create_docx_from_markdown( | ||
| self, markdown_data: str, parent_folder_id: str, filename: str | None = None | ||
| ) -> dict[str, str | int]: | ||
| """ | ||
| Convert Markdown text to a formatted DOCX document and save it to OneDrive. | ||
| Uses the md2docx-python library for robust conversion. | ||
| Args: | ||
| markdown_data: The Markdown text as a string to convert | ||
| parent_folder_id: The identifier of the parent folder where the DOCX file will be created | ||
| filename: Optional custom name for the DOCX file (without .docx extension) | ||
| Returns: | ||
| dict: Response containing the created DOCX file metadata | ||
| """ | ||
| logger.info(f"Creating DOCX file from Markdown on OneDrive, folder: {parent_folder_id}") | ||
|
|
||
| try: | ||
| from md2docx_python.src.md2docx_python import markdown_to_word | ||
|
|
||
| # Determine filename | ||
| if not filename: | ||
| filename = "converted_document" | ||
|
|
||
| # Ensure .docx extension | ||
| if not filename.endswith(".docx"): | ||
| filename += ".docx" | ||
|
|
||
| # Create temporary files for conversion | ||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as md_file: | ||
| md_file.write(markdown_data) | ||
| md_file_path = md_file.name | ||
|
|
||
| with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as docx_file: | ||
| docx_file_path = docx_file.name | ||
|
|
||
| try: | ||
| # Convert markdown to DOCX using the well-maintained library | ||
| markdown_to_word(md_file_path, docx_file_path) | ||
|
|
||
| # Read the generated DOCX file | ||
| with open(docx_file_path, "rb") as docx_file: | ||
| docx_bytes = docx_file.read() | ||
|
|
||
| # Upload DOCX file to OneDrive | ||
| upload_url = ( | ||
| f"{self.base_url}/me/drive/items/{parent_folder_id}:/{filename}:/content" | ||
| ) | ||
|
|
||
| headers = { | ||
| "Authorization": f"Bearer {self.access_token}", | ||
| "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", | ||
| } | ||
|
|
||
| upload_response = requests.put( | ||
| upload_url, headers=headers, data=docx_bytes, timeout=60 | ||
| ) | ||
| upload_response.raise_for_status() | ||
|
|
||
| result = upload_response.json() | ||
|
|
||
| # Count some basic stats for the response | ||
| lines = markdown_data.split("\n") | ||
| word_count = len(markdown_data.split()) | ||
|
|
||
| logger.info( | ||
| f"Successfully created DOCX file: {filename}, ID: {result.get('id', '')}" | ||
| ) | ||
|
|
||
| return { | ||
| "id": result.get("id", ""), | ||
| "name": result.get("name", ""), | ||
| "path": result.get("parentReference", {}).get("path", "") | ||
| + "/" | ||
| + result.get("name", ""), | ||
| "size": result.get("size", 0), | ||
| "mime_type": result.get("file", {}).get("mimeType", ""), | ||
| "created_datetime": result.get("createdDateTime", ""), | ||
| "modified_datetime": result.get("lastModifiedDateTime", ""), | ||
| "download_url": result.get("@microsoft.graph.downloadUrl", ""), | ||
| "lines_converted": len(lines), | ||
| "word_count": word_count, | ||
| "note": "DOCX file created successfully from Markdown using md2docx-python library.", | ||
| } | ||
|
|
||
| finally: | ||
| # Clean up temporary files | ||
| import os | ||
|
|
||
| try: | ||
| os.unlink(md_file_path) | ||
| os.unlink(docx_file_path) | ||
| except OSError: | ||
| pass # Files already cleaned up | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Failed to create DOCX file from Markdown data: {e}") | ||
| raise Exception(f"Failed to create DOCX file: {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Move imports to module level and verify import paths
Several issues with imports and code organization:
- Import statements should be at the module level, not inside methods
- The import path
md2docx_python.src.md2docx_pythonseems unusual - typically src directories are not part of the import path - The
osmodule is imported inside the finally block but should be at module level
Apply this refactor to move imports to module level:
import csv
import io
+import os
import tempfile
from typing import override
import requests
+from md2docx_python import markdown_to_word
+from md2docx_python import word_to_markdownAnd update the method to remove internal imports:
- from md2docx_python.src.md2docx_python import markdown_to_word
# Determine filename finally:
# Clean up temporary files
- import os
-
try:🏁 Script executed:
#!/bin/bash
# Verify the correct import path for md2docx-python library
pip show md2docx-python | grep Location || echo "Package not found"
# Try to find the correct import structure
python -c "import md2docx_python; print(dir(md2docx_python))" 2>/dev/null || echo "Failed to import md2docx_python"
# Check if the src path is part of the package
find . -name "*.py" -type f -exec grep -l "from md2docx_python" {} \; 2>/dev/null | head -5Length of output: 412
Ensure module-level imports and correct md2docx-python dependency
The inline import and unusual path for the md2docx-python library will break at runtime (the package isn’t found in your environment). Please:
- Add
md2docx-pythonto your project’s dependencies (e.g. requirements.txt or pyproject.toml) so it can be installed. - Move all imports out of the method to the top of backend/aci/server/app_connectors/microsoft_onedrive.py:
import os import tempfile import requests # Adjust the import to match the installed package’s API: # e.g. from md2docx_python import markdown_to_word # or from md2docx_python.src.md2docx_python import markdown_to_word from md2docx_python import markdown_to_word
- Remove the
from md2docx_python.src.md2docx_python import markdown_to_wordinsidecreate_docx_from_markdownand theimport osin thefinallyblock. - Verify the correct top-level import path by consulting the library’s documentation and your installed package.
These changes will prevent import errors, follow best practices, and make the code easier to maintain.
🤖 Prompt for AI Agents
In backend/aci/server/app_connectors/microsoft_onedrive.py around lines 166 to
263, the import of md2docx-python is done inside the create_docx_from_markdown
method using an unusual path that will cause runtime errors. To fix this, add
md2docx-python to your project dependencies (requirements.txt or
pyproject.toml), move all imports including os, tempfile, requests, and the
correct import of markdown_to_word from md2docx_python to the top of the file,
remove the inline import inside the method and the os import in the finally
block, and verify the correct import path for markdown_to_word by checking the
installed package or its documentation.
| def read_markdown_from_docx(self, item_id: str) -> dict[str, str | int]: | ||
| """ | ||
| Convert a DOCX file from OneDrive to Markdown text. | ||
| Uses the md2docx-python library for robust conversion. | ||
| Args: | ||
| item_id: The identifier of the DOCX file in OneDrive to convert | ||
| Returns: | ||
| dict: Response containing the markdown content and metadata | ||
| """ | ||
| logger.info(f"Converting DOCX file to Markdown from OneDrive: {item_id}") | ||
|
|
||
| try: | ||
| from md2docx_python.src.docx2md_python import word_to_markdown | ||
|
|
||
| # Download the DOCX file from OneDrive | ||
| download_url = f"{self.base_url}/me/drive/items/{item_id}/content" | ||
| headers = {"Authorization": f"Bearer {self.access_token}"} | ||
|
|
||
| download_response = requests.get(download_url, headers=headers, timeout=30) | ||
| download_response.raise_for_status() | ||
|
|
||
| # Get file metadata for response details | ||
| metadata_url = f"{self.base_url}/me/drive/items/{item_id}" | ||
| metadata_response = requests.get(metadata_url, headers=headers, timeout=30) | ||
| metadata_response.raise_for_status() | ||
| metadata = metadata_response.json() | ||
|
|
||
| # Verify it's a DOCX file | ||
| if not metadata.get("name", "").lower().endswith((".docx", ".doc")): | ||
| raise Exception(f"File '{metadata.get('name', '')}' is not a Word document") | ||
|
|
||
| # Create temporary files for conversion | ||
| with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as docx_file: | ||
| docx_file.write(download_response.content) | ||
| docx_file_path = docx_file.name | ||
|
|
||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as md_file: | ||
| md_file_path = md_file.name | ||
|
|
||
| try: | ||
| # Convert DOCX to Markdown using the well-maintained library | ||
| word_to_markdown(docx_file_path, md_file_path) | ||
|
|
||
| # Read the generated Markdown file | ||
| with open(md_file_path, encoding="utf-8") as md_file: | ||
| markdown_content = md_file.read() | ||
|
|
||
| # Count some basic stats for the response | ||
| lines = markdown_content.split("\n") | ||
| word_count = len(markdown_content.split()) | ||
|
|
||
| logger.info( | ||
| f"Successfully converted DOCX to Markdown: {item_id}, {len(markdown_content)} characters" | ||
| ) | ||
|
|
||
| return { | ||
| "content": markdown_content, | ||
| "id": metadata.get("id", ""), | ||
| "name": metadata.get("name", ""), | ||
| "path": metadata.get("parentReference", {}).get("path", "") | ||
| + "/" | ||
| + metadata.get("name", ""), | ||
| "size": metadata.get("size", 0), | ||
| "mime_type": metadata.get("file", {}).get("mimeType", ""), | ||
| "created_datetime": metadata.get("createdDateTime", ""), | ||
| "modified_datetime": metadata.get("lastModifiedDateTime", ""), | ||
| "lines_extracted": len(lines), | ||
| "word_count": word_count, | ||
| "note": "DOCX file successfully converted to Markdown using md2docx-python library.", | ||
| } | ||
|
|
||
| finally: | ||
| # Clean up temporary files | ||
| import os | ||
|
|
||
| try: | ||
| os.unlink(docx_file_path) | ||
| os.unlink(md_file_path) | ||
| except OSError: | ||
| pass # Files already cleaned up | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Failed to convert DOCX file to Markdown: {item_id}, error: {e}") | ||
| raise Exception(f"Failed to convert DOCX file: {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicate code patterns
This method has the same issues as create_docx_from_markdown:
- Import statements inside the method
- Silent exception handling in cleanup
- Duplicate cleanup logic
Consider extracting the temporary file cleanup logic into a helper method:
def _cleanup_temp_files(self, *file_paths: str) -> None:
"""Clean up temporary files, logging any errors."""
for file_path in file_paths:
try:
os.unlink(file_path)
except OSError as e:
logger.debug(f"Failed to clean up temporary file {file_path}: {e}")Then use it in both methods:
finally:
- # Clean up temporary files
- import os
-
- try:
- os.unlink(docx_file_path)
- os.unlink(md_file_path)
- except OSError:
- pass # Files already cleaned up
+ self._cleanup_temp_files(docx_file_path, md_file_path)🤖 Prompt for AI Agents
In backend/aci/server/app_connectors/microsoft_onedrive.py around lines 265 to
350, the method read_markdown_from_docx has import statements inside the method,
silent exception handling during temporary file cleanup, and duplicate cleanup
logic similar to create_docx_from_markdown. To fix this, move the import
statements to the top of the file, extract the temporary file cleanup code into
a new helper method _cleanup_temp_files that accepts file paths and logs any
cleanup errors, then replace the existing cleanup code in
read_markdown_from_docx and create_docx_from_markdown with calls to this helper
method.
🏷️ Ticket
https://www.notion.so/Microsoft-Word-Integration-23b8378d6a47800f86f5fd06708d7643?source=copy_link
📝 Description
create_docx_from_markdownfunction to convert Markdown text to DOCX files on OneDriveread_markdown_from_docxfunction to extract Markdown content from DOCX filesmd2docx-pythondependency for robust document conversion🎥 Demo (if applicable)
📸 Screenshots (if applicable)
✅ Checklist
Summary by cubic
Added functions to convert Markdown to DOCX and extract Markdown from DOCX files in Microsoft OneDrive, enabling easy document format conversion.
New Features
create_docx_from_markdownto generate DOCX files from Markdown and save them to OneDrive.read_markdown_from_docxto extract Markdown content from DOCX files stored in OneDrive.Dependencies
md2docx-pythonfor reliable Markdown and DOCX conversion.Summary by CodeRabbit
New Features
Chores