prepare v0.9.5 (#66)

* prepare v0.9.5 * update readme * update readme * update readme
adbar · Nov 28, 2023 · 28b7425 · 28b7425
1 parent b61b1b3
commit 28b7425
Show file tree

Hide file tree

Showing 4 changed files with 51 additions and 18 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,6 +1,14 @@
 ## History / Changelog
 
 
+### 0.9.5
+
+- IRI to URI normalization: encode path, query and fragments (#58, #60)
+- normalization: strip common trackers (#65)
+- new function `is_valid_url()` (#63)
+- hardening of domain filter (#64)
+
+
 ### 0.9.4
 
 - new UrlStore functions: `add_from_html()` (#42), `discard()` (#44), `get_unvisited_domains`

diff --git a/README.rst b/README.rst
@@ -27,31 +27,33 @@ Why coURLan?
     “Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001)
 
 
-This library provides an additional “brain” for web crawling, scraping and management of web archives:
+This library provides an additional “brain” for web crawling, scraping and document management. It facilitates web navigation through a set of filters, enhancing the quality of resulting document collections:
 
-- Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort.
-- Stay away from pages with little text content or explicitly target synoptic pages to gather links.
+- Save bandwidth and processing time by steering clear of pages deemed low-value
+- Identify specific pages based on language or text content
+- Pinpoint pages relevant for efficient link gathering
 
-Using content and language-focused filters, Courlan helps navigating the Web so as to improve the resulting document collections. Additional functions include straightforward domain name extraction and URL sampling.
+Additional utilities needed include URL storage, filtering, and deduplication.
 
 
 Features
 --------
 
-Separate `the wheat from the chaff <https://en.wiktionary.org/wiki/separate_the_wheat_from_the_chaff>`_ and optimize crawls by focusing on non-spam HTML pages containing primarily text.
+Separate the wheat from the chaff and optimize document discovery and retrieval:
 
-- Heuristics for triage of links
-   - Targeting spam and unsuitable content-types
-   - Language-aware filtering
-   - Crawl management
 - URL handling
    - Validation
-   - Canonicalization/Normalization
+   - Normalization
    - Sampling
+- Heuristics for link filtering
+   - Spam, trackers, and content-types
+   - Language/Locale-aware processing
+   - Web crawling (frontier, scheduling)
+- Data store specifically designed for URLs
 - Usable with Python or on the command-line
 
 
-**Let the coURLan fish out juicy bits for you!**
+**Let the coURLan fish up juicy bits for you!**
 
 .. image:: courlan_harns-march.jpg
     :alt: Courlan 
@@ -91,29 +93,43 @@ All useful operations chained in ``check_url(url)``:
 .. code-block:: python
 
     >>> from courlan import check_url
-    # returns url and domain name
+
+    # return url and domain name
     >>> check_url('https://github.com/adbar/courlan')
     ('https://github.com/adbar/courlan', 'github.com')
-    # noisy query parameters can be removed
-    my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
+
+    # filter out bogus domains
+    >>> check_url('http://666.0.0.1/')
+    >>>
+
+    # tracker removal
+    >>> check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')
+    ('http://test.net/foo.html', 'test.net')
+
+    # use strict for further trimming
+    >>> my_url = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'
     >>> check_url(my_url, strict=True)
     ('https://httpbin.org/redirect-to', 'httpbin.org')
-    # Check for redirects (HEAD request)
+
+    # check for redirects (HEAD request)
     >>> url, domain_name = check_url(my_url, with_redirects=True)
 
 
 Language-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:
 
 .. code-block:: python
 
-    # optional argument targeting webpages in English or German
+    # optional language argument
     >>> url = 'https://www.un.org/en/about-us'
+
     # success: returns clean URL and domain name
     >>> check_url(url, language='en')
     ('https://www.un.org/en/about-us', 'un.org')
+
     # failure: doesn't return anything
     >>> check_url(url, language='de')
     >>>
+
     # optional argument: strict
     >>> url = 'https://en.wikipedia.org/'
     >>> check_url(url, language='de', strict=False)
@@ -176,12 +192,16 @@ Other useful functions dedicated to URL handling:
 
     >>> from courlan import *
     >>> url = 'https://www.un.org/en/about-us'
+
     >>> get_base_url(url)
     'https://www.un.org'
+
     >>> get_host_and_path(url)
     ('https://www.un.org', '/en/about-us')
+
     >>> get_hostinfo(url)
     ('un.org', 'https://www.un.org')
+
     >>> fix_relative_urls('https://www.un.org', 'en/about-us')
     'https://www.un.org/en/about-us'
 
@@ -390,7 +410,7 @@ Software ecosystem: see `this graphic <https://github.com/adbar/trafilatura/blob
 Similar work
 ------------
 
-These Python libraries perform similar normalization tasks but do not entail language or content filters. They also do not focus on crawl optimization:
+These Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization:
 
 - `furl <https://github.com/gruns/furl>`_
 - `ural <https://github.com/medialab/ural>`_

diff --git a/courlan/__init__.py b/courlan/__init__.py
@@ -8,7 +8,7 @@
 __author__ = "Adrien Barbaresi"
 __license__ = "GNU GPL v3+"
 __copyright__ = "Copyright 2020-2023, Adrien Barbaresi"
-__version__ = "0.9.4"
+__version__ = "0.9.5"
 
 
 # imports

diff --git a/tests/unit_tests.py b/tests/unit_tests.py
@@ -1135,6 +1135,11 @@ def test_examples():
         "https://github.com/adbar/courlan",
         "github.com",
     )
+    assert check_url("http://666.0.0.1/") is None
+    assert check_url("http://test.net/foo.html?utm_source=twitter#gclid=123") == (
+        "http://test.net/foo.html",
+        "test.net",
+    )
     assert check_url(
         "https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org", strict=True
     ) == ("https://httpbin.org/redirect-to", "httpbin.org")