Skip to content

Conversation

@jiwei-aipolabs
Copy link
Contributor

@jiwei-aipolabs jiwei-aipolabs commented Jul 25, 2025

🏷️ Ticket

https://www.notion.so/Microsoft-Word-Integration-23b8378d6a47800f86f5fd06708d7643?source=copy_link

📝 Description

  • Added create_docx_from_markdown function to convert Markdown text to DOCX files on OneDrive
  • Added read_markdown_from_docx function to extract Markdown content from DOCX files
  • Added md2docx-python dependency for robust document conversion
docker compose exec runner python -m aci.cli fuzzy-test-function-execution \
  --function-name MICROSOFT_ONEDRIVE__CREATE_DOCX_FROM_MARKDOWN \
  --linked-account-owner-id jiwei \
  --aci-api-key e1739ca6580c787275a2c53c31e12139fabf9f6a1c3b4d76d250ac5e145361d1 \
  --prompt "Create a Word document from this markdown content in my root OneDrive folder: # Test Document\n\nThis is a **test document** with:\n\n- *Italic text*\n- **Bold text**\n- A code block:\n\n\`\`\`python\nprint('Hello World')\n\`\`\`\n\n## Section 2\n\nSome more content here."

docker compose exec runner python -m aci.cli fuzzy-test-function-execution \
  --function-name MICROSOFT_ONEDRIVE__READ_MARKDOWN_FROM_DOCX \
  --linked-account-owner-id jiwei \
  --aci-api-key e1739ca6580c787275a2c53c31e12139fabf9f6a1c3b4d76d250ac5e145361d1 \
  --prompt "Read the markdown content from the DOCX file with ID: 7006ADAF2D3C1355\!s33b7a92aca244913b6c4cf0e6afc186c"
image image

🎥 Demo (if applicable)

📸 Screenshots (if applicable)

✅ Checklist

  • I have signed the Contributor License Agreement (CLA) and read the contributing guide (required)
  • I have linked this PR to an issue or a ticket (required)
  • I have updated the documentation related to my change if needed
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • All checks on CI passed

Summary by cubic

Added functions to convert Markdown to DOCX and extract Markdown from DOCX files in Microsoft OneDrive, enabling easy document format conversion.

  • New Features

    • Added create_docx_from_markdown to generate DOCX files from Markdown and save them to OneDrive.
    • Added read_markdown_from_docx to extract Markdown content from DOCX files stored in OneDrive.
  • Dependencies

    • Added md2docx-python for reliable Markdown and DOCX conversion.

Summary by CodeRabbit

  • New Features

    • Added the ability to convert Markdown text into DOCX files and save them directly to Microsoft OneDrive.
    • Added the ability to extract and download Markdown content from DOCX files stored in OneDrive.
  • Chores

    • Added a new dependency to support Markdown and DOCX file conversion.

@vercel
Copy link

vercel bot commented Jul 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
aci-dev-portal ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 25, 2025 2:15pm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 25, 2025

Walkthrough

This update introduces two new methods to the Microsoft OneDrive connector for bidirectional conversion between Markdown and DOCX formats, utilizing the md2docx-python library. The corresponding function definitions are added to the app's JSON schema, and the necessary library dependency is declared in the project configuration.

Changes

File(s) Change Summary
backend/aci/server/app_connectors/microsoft_onedrive.py Added create_docx_from_markdown and read_markdown_from_docx methods for Markdown↔DOCX conversion, with error handling and temporary file management.
backend/apps/microsoft_onedrive/functions.json Added two public connector functions for Markdown→DOCX upload and DOCX→Markdown extraction, with strict parameter schemas.
backend/pyproject.toml Added md2docx-python dependency and mypy override for the new module.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant MicrosoftOnedrive
    participant OneDrive
    participant md2docx-python

    Client->>MicrosoftOnedrive: create_docx_from_markdown(markdown_data, folder_id, filename)
    MicrosoftOnedrive->>md2docx-python: Convert Markdown to DOCX (temp files)
    md2docx-python-->>MicrosoftOnedrive: DOCX file
    MicrosoftOnedrive->>OneDrive: Upload DOCX file
    OneDrive-->>MicrosoftOnedrive: Upload metadata
    MicrosoftOnedrive-->>Client: Return metadata and stats

    Client->>MicrosoftOnedrive: read_markdown_from_docx(item_id)
    MicrosoftOnedrive->>OneDrive: Download DOCX file
    OneDrive-->>MicrosoftOnedrive: DOCX file
    MicrosoftOnedrive->>md2docx-python: Convert DOCX to Markdown (temp files)
    md2docx-python-->>MicrosoftOnedrive: Markdown content
    MicrosoftOnedrive-->>Client: Return Markdown and metadata
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Poem

🐇✨
From Markdown to DOCX, and back again,
The rabbit leaps through digital rain.
With temp files swept and formats spun,
OneDrive’s dance is deftly done.
Dependencies added, schemas defined—
In every hop, new features aligned!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch microsoft_docs_integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@recurseml
Copy link

recurseml bot commented Jul 25, 2025

✨ No issues found! Your code is sparkling clean! ✨

Need help? Join our Discord for support!
https://discord.gg/qEjHQk64Z9

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

No issues found across 4 files. Review in cubic

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/aci/server/app_connectors/microsoft_onedrive.py (1)

251-259: Improve error handling in temporary file cleanup

The current cleanup silently ignores OSError exceptions. Consider logging these errors for debugging purposes.

                 try:
                     os.unlink(md_file_path)
                     os.unlink(docx_file_path)
                 except OSError:
-                    pass  # Files already cleaned up
+                    logger.debug(f"Failed to clean up temporary files: {md_file_path}, {docx_file_path}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebc1474 and 99e9cf8.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • backend/aci/server/app_connectors/microsoft_onedrive.py (2 hunks)
  • backend/apps/microsoft_onedrive/functions.json (1 hunks)
  • backend/pyproject.toml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Format, Lint, and Test
  • GitHub Check: Compose Tests
  • GitHub Check: Format & Lint
🔇 Additional comments (4)
backend/pyproject.toml (2)

145-148: LGTM! MyPy override follows existing patterns

The mypy override for md2docx_python.* is correctly configured and follows the established pattern in the file.


38-38: md2docx-python dependency verified

  • md2docx-python is actively maintained (latest release v1.0.0 on 2025-03-14).
  • No known security vulnerabilities are reported in its PyPI metadata.

No further action required.

backend/apps/microsoft_onedrive/functions.json (1)

1131-1180: LGTM! Well-structured function definitions

The new function definitions for MICROSOFT_ONEDRIVE__CREATE_DOCX_FROM_MARKDOWN and MICROSOFT_ONEDRIVE__READ_MARKDOWN_FROM_DOCX are properly structured with:

  • Clear descriptions
  • Appropriate parameter schemas with validation
  • Consistent pattern with other connector functions
backend/aci/server/app_connectors/microsoft_onedrive.py (1)

295-296: Good validation of file extensions

The case-insensitive check for Word document extensions is appropriate and helps prevent processing non-Word files.

Comment on lines +166 to +263
def create_docx_from_markdown(
self, markdown_data: str, parent_folder_id: str, filename: str | None = None
) -> dict[str, str | int]:
"""
Convert Markdown text to a formatted DOCX document and save it to OneDrive.
Uses the md2docx-python library for robust conversion.
Args:
markdown_data: The Markdown text as a string to convert
parent_folder_id: The identifier of the parent folder where the DOCX file will be created
filename: Optional custom name for the DOCX file (without .docx extension)
Returns:
dict: Response containing the created DOCX file metadata
"""
logger.info(f"Creating DOCX file from Markdown on OneDrive, folder: {parent_folder_id}")

try:
from md2docx_python.src.md2docx_python import markdown_to_word

# Determine filename
if not filename:
filename = "converted_document"

# Ensure .docx extension
if not filename.endswith(".docx"):
filename += ".docx"

# Create temporary files for conversion
with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as md_file:
md_file.write(markdown_data)
md_file_path = md_file.name

with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as docx_file:
docx_file_path = docx_file.name

try:
# Convert markdown to DOCX using the well-maintained library
markdown_to_word(md_file_path, docx_file_path)

# Read the generated DOCX file
with open(docx_file_path, "rb") as docx_file:
docx_bytes = docx_file.read()

# Upload DOCX file to OneDrive
upload_url = (
f"{self.base_url}/me/drive/items/{parent_folder_id}:/{filename}:/content"
)

headers = {
"Authorization": f"Bearer {self.access_token}",
"Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
}

upload_response = requests.put(
upload_url, headers=headers, data=docx_bytes, timeout=60
)
upload_response.raise_for_status()

result = upload_response.json()

# Count some basic stats for the response
lines = markdown_data.split("\n")
word_count = len(markdown_data.split())

logger.info(
f"Successfully created DOCX file: {filename}, ID: {result.get('id', '')}"
)

return {
"id": result.get("id", ""),
"name": result.get("name", ""),
"path": result.get("parentReference", {}).get("path", "")
+ "/"
+ result.get("name", ""),
"size": result.get("size", 0),
"mime_type": result.get("file", {}).get("mimeType", ""),
"created_datetime": result.get("createdDateTime", ""),
"modified_datetime": result.get("lastModifiedDateTime", ""),
"download_url": result.get("@microsoft.graph.downloadUrl", ""),
"lines_converted": len(lines),
"word_count": word_count,
"note": "DOCX file created successfully from Markdown using md2docx-python library.",
}

finally:
# Clean up temporary files
import os

try:
os.unlink(md_file_path)
os.unlink(docx_file_path)
except OSError:
pass # Files already cleaned up

except Exception as e:
logger.error(f"Failed to create DOCX file from Markdown data: {e}")
raise Exception(f"Failed to create DOCX file: {e}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Move imports to module level and verify import paths

Several issues with imports and code organization:

  1. Import statements should be at the module level, not inside methods
  2. The import path md2docx_python.src.md2docx_python seems unusual - typically src directories are not part of the import path
  3. The os module is imported inside the finally block but should be at module level

Apply this refactor to move imports to module level:

import csv
import io
+import os
import tempfile
from typing import override

import requests
+from md2docx_python import markdown_to_word
+from md2docx_python import word_to_markdown

And update the method to remove internal imports:

-            from md2docx_python.src.md2docx_python import markdown_to_word
             
             # Determine filename
             finally:
                 # Clean up temporary files
-                import os
-
                 try:

🏁 Script executed:

#!/bin/bash
# Verify the correct import path for md2docx-python library
pip show md2docx-python | grep Location || echo "Package not found"

# Try to find the correct import structure
python -c "import md2docx_python; print(dir(md2docx_python))" 2>/dev/null || echo "Failed to import md2docx_python"

# Check if the src path is part of the package
find . -name "*.py" -type f -exec grep -l "from md2docx_python" {} \; 2>/dev/null | head -5

Length of output: 412


Ensure module-level imports and correct md2docx-python dependency

The inline import and unusual path for the md2docx-python library will break at runtime (the package isn’t found in your environment). Please:

  • Add md2docx-python to your project’s dependencies (e.g. requirements.txt or pyproject.toml) so it can be installed.
  • Move all imports out of the method to the top of backend/aci/server/app_connectors/microsoft_onedrive.py:
    import os
    import tempfile
    import requests
    
    # Adjust the import to match the installed package’s API:
    # e.g. from md2docx_python import markdown_to_word
    # or from md2docx_python.src.md2docx_python import markdown_to_word
    from md2docx_python import markdown_to_word
  • Remove the from md2docx_python.src.md2docx_python import markdown_to_word inside create_docx_from_markdown and the import os in the finally block.
  • Verify the correct top-level import path by consulting the library’s documentation and your installed package.

These changes will prevent import errors, follow best practices, and make the code easier to maintain.

🤖 Prompt for AI Agents
In backend/aci/server/app_connectors/microsoft_onedrive.py around lines 166 to
263, the import of md2docx-python is done inside the create_docx_from_markdown
method using an unusual path that will cause runtime errors. To fix this, add
md2docx-python to your project dependencies (requirements.txt or
pyproject.toml), move all imports including os, tempfile, requests, and the
correct import of markdown_to_word from md2docx_python to the top of the file,
remove the inline import inside the method and the os import in the finally
block, and verify the correct import path for markdown_to_word by checking the
installed package or its documentation.

Comment on lines +265 to +350
def read_markdown_from_docx(self, item_id: str) -> dict[str, str | int]:
"""
Convert a DOCX file from OneDrive to Markdown text.
Uses the md2docx-python library for robust conversion.
Args:
item_id: The identifier of the DOCX file in OneDrive to convert
Returns:
dict: Response containing the markdown content and metadata
"""
logger.info(f"Converting DOCX file to Markdown from OneDrive: {item_id}")

try:
from md2docx_python.src.docx2md_python import word_to_markdown

# Download the DOCX file from OneDrive
download_url = f"{self.base_url}/me/drive/items/{item_id}/content"
headers = {"Authorization": f"Bearer {self.access_token}"}

download_response = requests.get(download_url, headers=headers, timeout=30)
download_response.raise_for_status()

# Get file metadata for response details
metadata_url = f"{self.base_url}/me/drive/items/{item_id}"
metadata_response = requests.get(metadata_url, headers=headers, timeout=30)
metadata_response.raise_for_status()
metadata = metadata_response.json()

# Verify it's a DOCX file
if not metadata.get("name", "").lower().endswith((".docx", ".doc")):
raise Exception(f"File '{metadata.get('name', '')}' is not a Word document")

# Create temporary files for conversion
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as docx_file:
docx_file.write(download_response.content)
docx_file_path = docx_file.name

with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as md_file:
md_file_path = md_file.name

try:
# Convert DOCX to Markdown using the well-maintained library
word_to_markdown(docx_file_path, md_file_path)

# Read the generated Markdown file
with open(md_file_path, encoding="utf-8") as md_file:
markdown_content = md_file.read()

# Count some basic stats for the response
lines = markdown_content.split("\n")
word_count = len(markdown_content.split())

logger.info(
f"Successfully converted DOCX to Markdown: {item_id}, {len(markdown_content)} characters"
)

return {
"content": markdown_content,
"id": metadata.get("id", ""),
"name": metadata.get("name", ""),
"path": metadata.get("parentReference", {}).get("path", "")
+ "/"
+ metadata.get("name", ""),
"size": metadata.get("size", 0),
"mime_type": metadata.get("file", {}).get("mimeType", ""),
"created_datetime": metadata.get("createdDateTime", ""),
"modified_datetime": metadata.get("lastModifiedDateTime", ""),
"lines_extracted": len(lines),
"word_count": word_count,
"note": "DOCX file successfully converted to Markdown using md2docx-python library.",
}

finally:
# Clean up temporary files
import os

try:
os.unlink(docx_file_path)
os.unlink(md_file_path)
except OSError:
pass # Files already cleaned up

except Exception as e:
logger.error(f"Failed to convert DOCX file to Markdown: {item_id}, error: {e}")
raise Exception(f"Failed to convert DOCX file: {e}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicate code patterns

This method has the same issues as create_docx_from_markdown:

  1. Import statements inside the method
  2. Silent exception handling in cleanup
  3. Duplicate cleanup logic

Consider extracting the temporary file cleanup logic into a helper method:

def _cleanup_temp_files(self, *file_paths: str) -> None:
    """Clean up temporary files, logging any errors."""
    for file_path in file_paths:
        try:
            os.unlink(file_path)
        except OSError as e:
            logger.debug(f"Failed to clean up temporary file {file_path}: {e}")

Then use it in both methods:

             finally:
-                # Clean up temporary files
-                import os
-
-                try:
-                    os.unlink(docx_file_path)
-                    os.unlink(md_file_path)
-                except OSError:
-                    pass  # Files already cleaned up
+                self._cleanup_temp_files(docx_file_path, md_file_path)
🤖 Prompt for AI Agents
In backend/aci/server/app_connectors/microsoft_onedrive.py around lines 265 to
350, the method read_markdown_from_docx has import statements inside the method,
silent exception handling during temporary file cleanup, and duplicate cleanup
logic similar to create_docx_from_markdown. To fix this, move the import
statements to the top of the file, extract the temporary file cleanup code into
a new helper method _cleanup_temp_files that accepts file paths and logs any
cleanup errors, then replace the existing cleanup code in
read_markdown_from_docx and create_docx_from_markdown with calls to this helper
method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants