Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arxiv html papers #209

Closed
dai-shuo opened this issue Nov 3, 2024 · 1 comment
Closed

Support arxiv html papers #209

dai-shuo opened this issue Nov 3, 2024 · 1 comment

Comments

@dai-shuo
Copy link

dai-shuo commented Nov 3, 2024

Arxiv provides static html version of most papers using LateXML. The html contents are well structured by rich ltx_xxxx CSS classnames. It should be lightning fast parsing those paper htmls and get very precise info.
It would be cool to support arxiv html parsing, as a much faster branch or a strong hint for the pipeline.

@PeterStaar-IBM
Copy link
Contributor

@dai-shuo Excellent point! In my latest PR (#240), this is indeed possible,

For example, if you run this,

poetry run docling --from html --to md "https://arxiv.org/html/2408.09869v3" --output ./scratch/

you will get,

Screenshot 2024-11-05 at 07 23 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants