-
Notifications
You must be signed in to change notification settings - Fork 118
Open
Labels
Description
Exception like this can be raised by functions from extruct.utils:
document = parse_xmldom_html(html_string, encoding=encoding)
File "/usr/local/lib/python3.6/dist-packages/extruct/utils.py", line 16, in parse_xmldom_html
return lxml.html.fromstring(html, parser=parser)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
In parsel this is worked around: empty documents are handled explicitly. There is also an issue with null bytes handled. I think we should bring similar fixes to extruct. See https://github.com/scrapy/parsel/blob/e01093cf6342c90445028de28034b3cc3d2ead8b/parsel/selector.py#L38.