Skip to content

Commit

Permalink
roundup before release 0.9.0
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Mar 7, 2023
1 parent 23e415c commit b13571f
Show file tree
Hide file tree
Showing 6 changed files with 23 additions and 7 deletions.
10 changes: 10 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
## History / Changelog


### 0.9.0

- hardening of filters and URL parses (#14)
- normalize punicode to unicode
- methods added to `UrlStore`: `get_crawl_delay()`, `print_unvisited_urls()`
- `UrlStore` now triggers exit code 1 when interrupted
- argument added to `extract_links()`: `no_filter`
- code refactoring: simplifications


### 0.8.3

- fixed bug in domain name extraction
Expand Down
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- ``add_urls(urls=[], appendleft=None, visited=False)``: Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited.
- ``dump_urls()``: Return a list of all known URLs.
- ``print_urls()``: Print all URLs in store (URL + TAB + visited or not).
- ``print_unvisited_urls()``: Print all unvisited URLs in store.
- ``get_known_domains()``: Return all known domains as a list.
- ``total_url_number()``: Find number of all URLs in store.
- ``is_known(url)``: Check if the given URL has already been stored.
Expand All @@ -265,6 +266,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- Crawling and downloads
- ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp).
- ``get_rules(domain)``: Return the stored crawling rules for the given website.
- ``get_crawl_delay()``: Return the delay as extracted from robots.txt, or a given default.
- ``get_download_urls(timelimit=10)``: Get a list of immediately downloadable URLs according to the given time limit per domain.
- ``establish_download_schedule(max_urls=100, time_limit=10)``: Get up to the specified number of URLs along with a suitable backoff schedule (in seconds).
- ``download_threshold_reached(threshold)``: Find out if the download limit (in seconds) has been reached for one of the websites in store.
Expand Down
4 changes: 2 additions & 2 deletions courlan/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
__title__ = "courlan"
__author__ = "Adrien Barbaresi"
__license__ = "GNU GPL v3+"
__copyright__ = "Copyright 2020-2022, Adrien Barbaresi"
__version__ = "0.8.3"
__copyright__ = "Copyright 2020-2023, Adrien Barbaresi"
__version__ = "0.9.0"


# imports
Expand Down
2 changes: 1 addition & 1 deletion courlan/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import re

from typing import Optional, Union
from urllib.parse import parse_qs, urlencode, urlparse, ParseResult
from urllib.parse import parse_qs, urlencode, ParseResult

from .filters import validate_url
from .settings import ALLOWED_PARAMS, CONTROL_PARAMS, TARGET_LANG_DE, TARGET_LANG_EN
Expand Down
6 changes: 4 additions & 2 deletions courlan/langinfo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
Constants containing info about languages and countries.
"""

from typing import Set

LANGUAGE_CODES = {

LANGUAGE_CODES: Set[str] = {
"aa",
"ab",
"ae",
Expand Down Expand Up @@ -191,7 +193,7 @@
}


COUNTRY_CODES = {
COUNTRY_CODES: Set[str] = {
"aw",
"af",
"ao",
Expand Down
6 changes: 4 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

def get_version(package):
"Return package version as listed in `__version__` in `init.py`"
initfile = Path(package, '__init__.py').read_text() # Python >= 3.5
initfile = Path(package, "__init__.py").read_text(encoding="utf-8")
return re.search("__version__ = ['\"]([^'\"]+)['\"]", initfile)[1]


Expand All @@ -38,6 +38,7 @@ def get_long_description():
"courlan/clean.py",
"courlan/core.py",
"courlan/filters.py",
"courlan/langinfo.py",
"courlan/settings.py",
"courlan/urlstore.py",
"courlan/urlutils.py",
Expand Down Expand Up @@ -105,7 +106,8 @@ def get_long_description():
python_requires=">=3.6",
install_requires=[
"langcodes >= 3.3.0",
"tld >= 0.12.6",
"tld == 0.12.6; python_version < '3.7'",
"tld >= 0.13; python_version >= '3.7'",
"urllib3 >= 1.26, < 2",
],
# extras_require=extras,
Expand Down

0 comments on commit b13571f

Please sign in to comment.