html parsing fail on empty documents

Exception like this can be raised by functions from extruct.utils:
```
document = parse_xmldom_html(html_string, encoding=encoding)
File "/usr/local/lib/python3.6/dist-packages/extruct/utils.py", line 16, in parse_xmldom_html
return lxml.html.fromstring(html, parser=parser)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
```

In parsel this is worked around: empty documents are handled explicitly. There is also an issue with null bytes handled. I think we should bring similar fixes to extruct. See https://github.com/scrapy/parsel/blob/e01093cf6342c90445028de28034b3cc3d2ead8b/parsel/selector.py#L38.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

html parsing fail on empty documents #112

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

html parsing fail on empty documents #112

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions