urlclean provides functions:
- to follow a http redirect,
- to follow a HTML META redirect,
- to remove Urchin and Facebook tracker URL parameters,
- plugins for futher cleaning power,
- combines all these to unshorten and resolve various URLS
Try it out from the commandline:
python -m urlclean <some url>
urlcleaner a module that resolves redirected urls and removes tracking url params
urlclean.weedparams(url)
removes Urchin Tracker and Facebook surveillance params from urls.
Args:
url (str): The url to scrubReturns:
(str). The return cleaned url
urlclean.httpresolve(url, ua=None, proxyhost='', proxyport='')
resolve one redirection of a http request.
Args:
url (str): The url to follow one redirect
ua (fn): A function returning a User Agent string (optional)
proxyhost (str): http proxy server (optional)
proxyport (int): http proxy server port (optional)
- Returns: (str, http.client.response). The return resolved url, and
- the response from the http query
urlclean.unmeta(url, res)
Finds any meta redirects a http.client.response object that has text/html as content-type.
Args:
url (str): The url to follow one redirect
res (http.client.response): a http response object
Returns: (str). The return resolved url
urlclean.unshorten(url, cache=None, ua=None, >>**<<kwargs)
resolves all HTTP/META redirects and optionally caches them in any object supporting a __getitem__, __setitem__ interface
Args:
url (str): The url to follow one redirect
cache (PersistentCryptoDict): an optional PersistentCryptoDict instance
ua (fn): A function returning a User Agent string (optional), the default is googlebot.
>>**<<kwargs (dict): optional proxy args for urlclean.httpresolve (default: localhost:8118)
Returns: (str). The return final cleaned url.
Plugins should have a convert function that receives and returns a URL. In case of an error an unchanged URL should be returned.
- v0.6.0 - migrated to python 3
- v0.5.4 - fixed httpresolve for relative urls
- v0.5.1 - install/doc fixes
- v0.5 - added plugins