Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OneNote support #55

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

HendricksJudy
Copy link

Fixes #47

Add support for OneNote file conversion.

  • README.md: Add OneNote to the list of supported file formats. Add a note about using one-extract for OneNote support. Provide an example of converting OneNote files.
  • pyproject.toml: Add onenote to the list of dependencies. Add a note about OneNote support.
  • src/markitdown/_markitdown.py: Import one_extract as onenote. Add a new class OneNoteConverter to handle OneNote files. Register the OneNoteConverter in the MarkItDown class.
  • tests/test_markitdown.py: Add test strings for OneNote. Add a test case for OneNote file conversion.

Fixes microsoft#47

Add support for OneNote file conversion.

* **README.md**: Add OneNote to the list of supported file formats. Add a note about using `one-extract` for OneNote support. Provide an example of converting OneNote files.
* **pyproject.toml**: Add `onenote` to the list of dependencies. Add a note about OneNote support.
* **src/markitdown/_markitdown.py**: Import `one_extract` as `onenote`. Add a new class `OneNoteConverter` to handle OneNote files. Register the `OneNoteConverter` in the `MarkItDown` class.
* **tests/test_markitdown.py**: Add test strings for OneNote. Add a test case for OneNote file conversion.
@HendricksJudy
Copy link
Author

@microsoft-github-policy-service agree [company="{your company}"]

@HendricksJudy
Copy link
Author

@microsoft-github-policy-service agree

@@ -179,7 +194,7 @@ def test_markitdown_exiftool() -> None:
assert target in result.text_content


if __name__ == "__main__":
if __name__main__":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's a small typo in the if __name__main__": line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your checking. I have already fix it.

@@ -164,6 +173,12 @@ def test_markitdown_local() -> None:
for test_string in SERP_TEST_STRINGS:
assert test_string in text_content

# Test OneNote processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.one"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to add the test file, test.one?

@@ -12,6 +12,9 @@ It presently supports:
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- OneNote (.one)

Note: OneNote is not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a typo? the comment doesn't seem consistent?

@gagb gagb self-assigned this Dec 17, 2024
@afourney
Copy link
Member

@HendricksJudy there are now some small merge conflicts (should be easy to address).

Let's fix those conflicts. add the test.one file, and address @gagb's comment, and we can likely include this in the next alpha release.

@@ -51,6 +54,18 @@ result = md.convert("example.jpg")
print(result.text_content)
```

To convert OneNote files, you can use the following example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is redundant, there’s one already for converting files.

@HendricksJudy
Copy link
Author

@HendricksJudy there are now some small merge conflicts (should be easy to address).

Let's fix those conflicts. add the test.one file, and address @gagb's comment, and we can likely include this in the next alpha release.

Thanks for your review, I will fix those conflicts ASAP.

@gagb gagb added the awaiting op response The PR is awaiting response/edits from the original poster. label Dec 20, 2024
@@ -38,6 +38,7 @@ dependencies = [
"youtube-transcript-api",
"SpeechRecognition",
"pathvalidate",
"onenote",
Copy link
Member

@lalo lalo Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the correct package name? could not find it in pip registry, but found this one instead https://pypi.org/project/one-extract/ can you link which is the appropriate one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting op response The PR is awaiting response/edits from the original poster.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any plan on onenote?
6 participants