Skip to content

Commit

Permalink
prepare v0.9.4 (#53)
Browse files Browse the repository at this point in the history
* prepare v0.9.4

* update history

* update changelog
  • Loading branch information
adbar committed Sep 6, 2023
1 parent ac6589e commit 869912c
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 1 deletion.
12 changes: 12 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
## History / Changelog


### 0.9.4

- new UrlStore functions: `add_from_html()` (#42), `discard()` (#44), `get_unvisited_domains`
- CLI: removed `--samplesize`, use `--sample` with an integer instead (#54)
- added plausibility filter for domains/hosts (#48)
- speedups and more efficient processing (#47, #49, #50)
- fixed handling of relative URLs with @feltcat in #46
- fixed bugs and ensured compatibility (#41, #43, #51, #56)
- official support for Python 3.12


### 0.9.3

- more efficient URL parsing (#33)
Expand Down
4 changes: 4 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,11 +264,14 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,

- URL management
- ``add_urls(urls=[], appendleft=None, visited=False)``: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited.
- ``add_from_html(htmlstring, url, external=False, lang=None, with_nav=True)``: Extract and filter links in a HTML string.
- ``discard(domains)``: Declare domains void and prune the store.
- ``dump_urls()``: Return a list of all known URLs.
- ``print_urls()``: Print all URLs in store (URL + TAB + visited or not).
- ``print_unvisited_urls()``: Print all unvisited URLs in store.
- ``get_all_counts()``: Return all download counts for the hosts in store.
- ``get_known_domains()``: Return all known domains as a list.
- ``get_unvisited_domains()``: Find all domains for which there are unvisited URLs.
- ``total_url_number()``: Find number of all URLs in store.
- ``is_known(url)``: Check if the given URL has already been stored.
- ``has_been_visited(url)``: Check if the given URL has already been visited.
Expand All @@ -281,6 +284,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- Crawling and downloads
- ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp).
- ``get_rules(domain)``: Return the stored crawling rules for the given website.
- ``store_rules(website, rules=None)``: Store crawling rules for a given website.
- ``get_crawl_delay()``: Return the delay as extracted from robots.txt, or a given default.
- ``get_download_urls(timelimit=10)``: Get a list of immediately downloadable URLs according to the given time limit per domain.
- ``establish_download_schedule(max_urls=100, time_limit=10)``: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds).
Expand Down
2 changes: 1 addition & 1 deletion courlan/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
__author__ = "Adrien Barbaresi"
__license__ = "GNU GPL v3+"
__copyright__ = "Copyright 2020-2023, Adrien Barbaresi"
__version__ = "0.9.3"
__version__ = "0.9.4"


# imports
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ def get_long_description():
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Internet :: WWW/HTTP",
"Topic :: Scientific/Engineering :: Information Analysis",
"Topic :: Text Processing :: Filters",
Expand Down

0 comments on commit 869912c

Please sign in to comment.