Pagelib is currently underdevelopment and is not ready for production or development environments.
Pagelib turns nasty HTML strings into friendly HTML objects.
An HtmlPage object is construced from an HTML string:
>>> from pagelib import HtmlPage
>>> page = HtmlPage('<html><head><title>Hello</title><meta name="description" content="Some page you've downloaded from the web and now have to parse."></meta></head><body><p>Hello, world!</p></body></html>')
>>> page
HtmlPage(title=Hello, bytes=121)
Components of the page can be accessed through it's properties:
>>> page.title
'Hello'
>>> page.description
'Some page you've downloaded from the web and now have to parse.'
>>> page.language_code
'en'
>>> page.language
'English'
>>> page.text
'Hello, world!'
Pagelib exposes a parsel selector that can be used to extract further elements from the page using xpaths or css:
>>> page.selector.xpath('//p/text()').extract()
['Hello, world!']
$ pip install pagelib
Pagelib depends on libicu-dev, which can be installed by running the following command:
$ sudo apt install libicu-dev