prepare v0.9.4 (#53)

* prepare v0.9.4 * update history * update changelog
adbar · Sep 6, 2023 · 869912c · 869912c
1 parent ac6589e
commit 869912c
Show file tree

Hide file tree

Showing 4 changed files with 18 additions and 1 deletion.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,17 @@
 ## History / Changelog
 
+
+### 0.9.4
+
+- new UrlStore functions: `add_from_html()` (#42), `discard()` (#44), `get_unvisited_domains`
+- CLI: removed `--samplesize`, use `--sample` with an integer instead (#54)
+- added plausibility filter for domains/hosts (#48)
+- speedups and more efficient processing (#47, #49, #50)
+- fixed handling of relative URLs with @feltcat in #46
+- fixed bugs and ensured compatibility (#41, #43, #51, #56)
+- official support for Python 3.12
+
+
 ### 0.9.3
 
 - more efficient URL parsing (#33)

diff --git a/README.rst b/README.rst
@@ -264,11 +264,14 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
 
 - URL management
    - ``add_urls(urls=[], appendleft=None, visited=False)``: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited.
+   - ``add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)``: Extract and filter links in a HTML string.
+   - ``discard(domains)``: Declare domains void and prune the store.
    - ``dump_urls()``: Return a list of all known URLs.
    - ``print_urls()``: Print all URLs in store (URL + TAB + visited or not).
    - ``print_unvisited_urls()``: Print all unvisited URLs in store.
    - ``get_all_counts()``: Return all download counts for the hosts in store.
    - ``get_known_domains()``: Return all known domains as a list.
+   - ``get_unvisited_domains()``: Find all domains for which there are unvisited URLs.
    - ``total_url_number()``: Find number of all URLs in store.
    - ``is_known(url)``: Check if the given URL has already been stored.
    - ``has_been_visited(url)``: Check if the given URL has already been visited.
@@ -281,6 +284,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
 - Crawling and downloads
    - ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp).
    - ``get_rules(domain)``: Return the stored crawling rules for the given website.
+   - ``store_rules(website, rules=None)``: Store crawling rules for a given website.
    - ``get_crawl_delay()``: Return the delay as extracted from robots.txt, or a given default.
    - ``get_download_urls(timelimit=10)``: Get a list of immediately downloadable URLs according to the given time limit per domain.
    - ``establish_download_schedule(max_urls=100, time_limit=10)``: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds).

diff --git a/courlan/__init__.py b/courlan/__init__.py
@@ -8,7 +8,7 @@
 __author__ = "Adrien Barbaresi"
 __license__ = "GNU GPL v3+"
 __copyright__ = "Copyright 2020-2023, Adrien Barbaresi"
-__version__ = "0.9.3"
+__version__ = "0.9.4"
 
 
 # imports

diff --git a/setup.py b/setup.py
@@ -77,6 +77,7 @@ def get_long_description():
         "Programming Language :: Python :: 3.9",
         "Programming Language :: Python :: 3.10",
         "Programming Language :: Python :: 3.11",
+        "Programming Language :: Python :: 3.12",
         "Topic :: Internet :: WWW/HTTP",
         "Topic :: Scientific/Engineering :: Information Analysis",
         "Topic :: Text Processing :: Filters",