Support arxiv html papers #209

dai-shuo · 2024-11-03T01:43:34Z

Arxiv provides static html version of most papers using LateXML. The html contents are well structured by rich ltx_xxxx CSS classnames. It should be lightning fast parsing those paper htmls and get very precise info.
It would be cool to support arxiv html parsing, as a much faster branch or a strong hint for the pipeline.

PeterStaar-IBM · 2024-11-05T06:23:50Z

@dai-shuo Excellent point! In my latest PR (#240), this is indeed possible,

For example, if you run this,

poetry run docling --from html --to md "https://arxiv.org/html/2408.09869v3" --output ./scratch/

you will get,

PeterStaar-IBM closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arxiv html papers #209

Support arxiv html papers #209

dai-shuo commented Nov 3, 2024

PeterStaar-IBM commented Nov 5, 2024

Support arxiv html papers #209

Support arxiv html papers #209

Comments

dai-shuo commented Nov 3, 2024

PeterStaar-IBM commented Nov 5, 2024