Pagelib

Pagelib is currently underdevelopment and is not ready for production or development environments.

Introduction

Pagelib turns nasty HTML strings into friendly HTML objects.

An HtmlPage object is construced from an HTML string:

>>> from pagelib import HtmlPage
>>> page = HtmlPage('<html><head><title>Hello</title><meta name="description" content="Some page you've downloaded from the web and now have to parse."></meta></head><body><p>Hello, world!</p></body></html>')
>>> page
HtmlPage(title=Hello, bytes=121)

Components of the page can be accessed through it's properties:

>>> page.title
'Hello'
>>> page.description
'Some page you've downloaded from the web and now have to parse.'
>>> page.language_code
'en'
>>> page.language
'English'
>>> page.text
'Hello, world!'

Pagelib exposes a parsel selector that can be used to extract further elements from the page using xpaths or css:

>>> page.selector.xpath('//p/text()').extract()
['Hello, world!']

Installation

Installing from PyPI

$ pip install pagelib

Dependencies

Pagelib depends on libicu-dev, which can be installed by running the following command:

$ sudo apt install libicu-dev

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
pagelib		pagelib
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pagelib

Introduction

Installation

Installing from PyPI

Dependencies

About

Releases

Packages

Languages

HyperionGray/pagelib

Folders and files

Latest commit

History

Repository files navigation

Pagelib

Introduction

Installation

Installing from PyPI

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages