Skip to content

Commit

Permalink
prepare version 1 (#84)
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar authored Feb 1, 2024
1 parent 9dde0f0 commit 1cfb7db
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 2 deletions.
8 changes: 8 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
## History / Changelog


### 1.0.0

- license change from GPLv3+ to Apache 2.0 (#81)
- UrlStore: `write()` method and `load_store()` function added (#83)
- add parameter `trailing_slash` to keep of discard slashes at the end of URLs (#52)
- maintenance: fix whitespace in `clean_url()` (#77), simplify code (#79)


### 0.9.5

- IRI to URI normalization: encode path, query and fragments (#58, #60)
Expand Down
10 changes: 10 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,12 @@ All useful operations chained in ``check_url(url)``:
# check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# include navigation pages instead of discarding them
>>> check_url('http://www.example.org/page/10/', with_nav=True)
# remove trailing slash
>>> check_url('https://github.com/adbar/courlan/', trailing_slash=False)
Language-aware heuristics, notably internationalization in URLs, are available in ``lang_filter(url, language)``:

Expand Down Expand Up @@ -311,6 +317,10 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- ``download_threshold_reached(threshold)``: Find out if the download limit (in seconds) has been reached for one of the websites in store.
- ``unvisited_websites_number()``: Return the number of websites for which there are still URLs to visit.
- ``is_exhausted_domain(domain)``: Tell if all known URLs for the website have been visited.
- Persistance
- ``write(filename)``: Save the store to disk.
- ``load_store(filename)``: Read a UrlStore from disk (separate function, not class method).


Optional settings:
- ``compressed=True``: activate compression of URLs and rules
Expand Down
4 changes: 2 additions & 2 deletions courlan/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
__title__ = "courlan"
__author__ = "Adrien Barbaresi"
__license__ = "Apache-2.0"
__copyright__ = "Copyright 2020-2023, Adrien Barbaresi"
__version__ = "0.9.5"
__copyright__ = "Copyright 2020-2024, Adrien Barbaresi"
__version__ = "1.0.0"


# imports
Expand Down
1 change: 1 addition & 0 deletions courlan/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ def check_url(
with_redirects: set to True for redirection test (per HTTP HEAD request)
language: set target language (ISO 639-1 codes)
with_nav: set to True to include navigation pages instead of discarding them
trailing_slash: set to False to trim trailing slashes
Returns:
A tuple consisting of canonical URL and extracted domain
Expand Down

0 comments on commit 1cfb7db

Please sign in to comment.