diff --git a/HISTORY.md b/HISTORY.md index 6c43ceb..ce91407 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -1,5 +1,17 @@ ## History / Changelog + +### 0.9.4 + +- new UrlStore functions: `add_from_html()` (#42), `discard()` (#44), `get_unvisited_domains` +- CLI: removed `--samplesize`, use `--sample` with an integer instead (#54) +- added plausibility filter for domains/hosts (#48) +- speedups and more efficient processing (#47, #49, #50) +- fixed handling of relative URLs with @feltcat in #46 +- fixed bugs and ensured compatibility (#41, #43, #51, #56) +- official support for Python 3.12 + + ### 0.9.3 - more efficient URL parsing (#33) diff --git a/README.rst b/README.rst index cf7757e..316ac81 100644 --- a/README.rst +++ b/README.rst @@ -264,11 +264,14 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs, - URL management - ``add_urls(urls=[], appendleft=None, visited=False)``: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited. + - ``add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)``: Extract and filter links in a HTML string. + - ``discard(domains)``: Declare domains void and prune the store. - ``dump_urls()``: Return a list of all known URLs. - ``print_urls()``: Print all URLs in store (URL + TAB + visited or not). - ``print_unvisited_urls()``: Print all unvisited URLs in store. - ``get_all_counts()``: Return all download counts for the hosts in store. - ``get_known_domains()``: Return all known domains as a list. + - ``get_unvisited_domains()``: Find all domains for which there are unvisited URLs. - ``total_url_number()``: Find number of all URLs in store. - ``is_known(url)``: Check if the given URL has already been stored. - ``has_been_visited(url)``: Check if the given URL has already been visited. @@ -281,6 +284,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs, - Crawling and downloads - ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp). - ``get_rules(domain)``: Return the stored crawling rules for the given website. + - ``store_rules(website, rules=None)``: Store crawling rules for a given website. - ``get_crawl_delay()``: Return the delay as extracted from robots.txt, or a given default. - ``get_download_urls(timelimit=10)``: Get a list of immediately downloadable URLs according to the given time limit per domain. - ``establish_download_schedule(max_urls=100, time_limit=10)``: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds). diff --git a/courlan/__init__.py b/courlan/__init__.py index facb8f8..62c51f2 100644 --- a/courlan/__init__.py +++ b/courlan/__init__.py @@ -8,7 +8,7 @@ __author__ = "Adrien Barbaresi" __license__ = "GNU GPL v3+" __copyright__ = "Copyright 2020-2023, Adrien Barbaresi" -__version__ = "0.9.3" +__version__ = "0.9.4" # imports diff --git a/setup.py b/setup.py index 095bfbb..011cf42 100644 --- a/setup.py +++ b/setup.py @@ -77,6 +77,7 @@ def get_long_description(): "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", "Topic :: Internet :: WWW/HTTP", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing :: Filters",