diff --git a/.github/ISSUE_TEMPLATE/01-bug_report.yml b/.github/ISSUE_TEMPLATE/01-bug_report.yml index 6a34895..0340da1 100644 --- a/.github/ISSUE_TEMPLATE/01-bug_report.yml +++ b/.github/ISSUE_TEMPLATE/01-bug_report.yml @@ -65,7 +65,7 @@ body: - type: textarea attributes: - label: "Actual behavior (Remember to use `debug` parameter)" + label: "Actual behavior" validations: required: true diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index 473435b..0376d52 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -17,10 +17,6 @@ jobs: fail-fast: false matrix: include: - - python-version: "3.8" - os: ubuntu-latest - env: - TOXENV: py - python-version: "3.9" os: ubuntu-latest env: diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 9e3cf04..4b90529 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -16,4 +16,4 @@ repos: rev: v1.6.0 hooks: - id: vermin - args: ['-t=3.8-', '--violations', '--eval-annotations', '--no-tips'] + args: ['-t=3.9-', '--violations', '--eval-annotations', '--no-tips'] diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 2e3b3b2..8adf4d1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -19,7 +19,11 @@ tests/test_parser_functions.py ................ [100%] =============================== 16 passed in 0.22s ================================ ``` -Also, consider setting `debug` to `True` while initializing the Adaptor object so it's easier to know what's happening in the background. +Also, consider setting the scrapling logging level to `debug` so it's easier to know what's happening in the background. +```python +>>> import logging +>>> logging.getLogger("scrapling").setLevel(logging.DEBUG) +``` ### The process is straight-forward. diff --git a/README.md b/README.md index 9f7a86e..92c76b2 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ Dealing with failing web scrapers due to anti-bot protections or website changes Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity. ```python ->> from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher +>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher # Fetch websites' source under the radar! >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) >> print(page.status) @@ -35,7 +35,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha ## Table of content * [Key Features](#key-features) - * [Fetch websites as you prefer](#fetch-websites-as-you-prefer) + * [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support) * [Adaptive Scraping](#adaptive-scraping) * [Performance](#performance) * [Developing Experience](#developing-experience) @@ -76,7 +76,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha ## Key Features -### Fetch websites as you prefer +### Fetch websites as you prefer with async support - **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher` - **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration! - **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright - All is possible with `PlayWrightFetcher`! @@ -167,7 +167,7 @@ Scrapling can find elements with more methods and it returns full element `Adapt > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons. ## Installation -Scrapling is a breeze to get started with - Starting from version 0.2, we require at least Python 3.8 to work. +Scrapling is a breeze to get started with - Starting from version 0.2.9, we require at least Python 3.9 to work. ```bash pip3 install scrapling ``` @@ -219,11 +219,11 @@ You might be slightly confused by now so let me clear things up. All fetcher-typ ```python from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher ``` -All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug`, which are the same ones you give to the `Adaptor` class. +All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class. If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code: ```python -from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher +from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher ``` then use it right away without initializing like: ```python @@ -236,21 +236,32 @@ Also, the `Response` object returned from all fetchers is the same as the `Adapt ### Fetcher This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests. -For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. +For all methods, you have `stealthy_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument `retries` for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all `Fetcher` methods is 3. You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030` ```python ->> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True) +>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True) >> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'}) >> page = Fetcher().delete('https://httpbin.org/delete') ``` +For Async requests, you will just replace the import like below: +```python +>> from scrapling import AsyncFetcher +>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True) +>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') +>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'}) +>> page = await AsyncFetcher().delete('https://httpbin.org/delete') +``` ### StealthyFetcher This class is built on top of [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further. ```python >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default >> page.status == 200 True +>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch +>> page.status == 200 +True ``` > Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :) @@ -268,7 +279,8 @@ True | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ | | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ | | humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ | -| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ | +| allow_webgl | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. | ✔️ | +| geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ | | disable_ads | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. | ✔️ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ | @@ -287,6 +299,9 @@ This class is built on top of [Playwright](https://playwright.dev/python/) which >> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option >> page.css_first("#search a::attr(href)") 'https://github.com/D4Vinci/Scrapling' +>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # the async version of fetch +>> page.css_first("#search a::attr(href)") +'https://github.com/D4Vinci/Scrapling' ``` > Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :) @@ -391,6 +406,9 @@ You can select elements by their text content in multiple ways, here's a full ex >>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text +>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href']) # We use `page.urljoin` to return the full URL from the relative `href` +'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html' + >>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more [] @@ -804,7 +822,6 @@ This project includes code adapted from: ## Known Issues - In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone. -- Currently, Scrapling is not compatible with async/await. ---
Designed & crafted with ❤️ by Karim Shoair.

diff --git a/benchmarks.py b/benchmarks.py index de647d6..28af680 100644 --- a/benchmarks.py +++ b/benchmarks.py @@ -64,9 +64,9 @@ def test_pyquery(): @benchmark def test_scrapling(): # No need to do `.extract()` like parsel to extract text - # Also, this is faster than `[t.text for t in Adaptor(large_html, auto_match=False, debug=False).css('.item')]` + # Also, this is faster than `[t.text for t in Adaptor(large_html, auto_match=False).css('.item')]` # for obvious reasons, of course. - return Adaptor(large_html, auto_match=False, debug=False).css('.item::text') + return Adaptor(large_html, auto_match=False).css('.item::text') @benchmark @@ -103,7 +103,7 @@ def test_scrapling_text(request_html): # Will loop over resulted elements to get text too to make comparison even more fair otherwise Scrapling will be even faster return [ element.text for element in Adaptor( - request_html, auto_match=False, debug=False + request_html, auto_match=False ).find_by_text('Tipping the Velvet', first_match=True).find_similar(ignore_attributes=['title']) ] diff --git a/images/CapSolver.png b/images/CapSolver.png deleted file mode 100644 index 6f5b3bd..0000000 Binary files a/images/CapSolver.png and /dev/null differ diff --git a/pytest.ini b/pytest.ini index df7eb7e..11c2331 100644 --- a/pytest.ini +++ b/pytest.ini @@ -1,2 +1,4 @@ [pytest] +asyncio_mode = auto +asyncio_default_fixture_loop_scope = function addopts = -p no:warnings --doctest-modules --ignore=setup.py --verbose \ No newline at end of file diff --git a/scrapling/__init__.py b/scrapling/__init__.py index 9240821..b8c7396 100644 --- a/scrapling/__init__.py +++ b/scrapling/__init__.py @@ -1,12 +1,12 @@ # Declare top-level shortcuts from scrapling.core.custom_types import AttributesHandler, TextHandler -from scrapling.fetchers import (CustomFetcher, Fetcher, PlayWrightFetcher, - StealthyFetcher) +from scrapling.fetchers import (AsyncFetcher, CustomFetcher, Fetcher, + PlayWrightFetcher, StealthyFetcher) from scrapling.parser import Adaptor, Adaptors __author__ = "Karim Shoair (karim.shoair@pm.me)" -__version__ = "0.2.8" +__version__ = "0.2.9" __copyright__ = "Copyright (c) 2024 Karim Shoair" -__all__ = ['Adaptor', 'Fetcher', 'StealthyFetcher', 'PlayWrightFetcher'] +__all__ = ['Adaptor', 'Fetcher', 'AsyncFetcher', 'StealthyFetcher', 'PlayWrightFetcher'] diff --git a/scrapling/core/custom_types.py b/scrapling/core/custom_types.py index b8cb44f..0419406 100644 --- a/scrapling/core/custom_types.py +++ b/scrapling/core/custom_types.py @@ -14,11 +14,70 @@ class TextHandler(str): __slots__ = () def __new__(cls, string): - # Because str is immutable and we can't override __init__ - if type(string) is str: + if isinstance(string, str): return super().__new__(cls, string) - else: - return super().__new__(cls, '') + return super().__new__(cls, '') + + # Make methods from original `str` class return `TextHandler` instead of returning `str` again + # Of course, this stupid workaround is only so we can keep the auto-completion working without issues in your IDE + # and I made sonnet write it for me :) + def strip(self, chars=None): + return TextHandler(super().strip(chars)) + + def lstrip(self, chars=None): + return TextHandler(super().lstrip(chars)) + + def rstrip(self, chars=None): + return TextHandler(super().rstrip(chars)) + + def capitalize(self): + return TextHandler(super().capitalize()) + + def casefold(self): + return TextHandler(super().casefold()) + + def center(self, width, fillchar=' '): + return TextHandler(super().center(width, fillchar)) + + def expandtabs(self, tabsize=8): + return TextHandler(super().expandtabs(tabsize)) + + def format(self, *args, **kwargs): + return TextHandler(super().format(*args, **kwargs)) + + def format_map(self, mapping): + return TextHandler(super().format_map(mapping)) + + def join(self, iterable): + return TextHandler(super().join(iterable)) + + def ljust(self, width, fillchar=' '): + return TextHandler(super().ljust(width, fillchar)) + + def rjust(self, width, fillchar=' '): + return TextHandler(super().rjust(width, fillchar)) + + def swapcase(self): + return TextHandler(super().swapcase()) + + def title(self): + return TextHandler(super().title()) + + def translate(self, table): + return TextHandler(super().translate(table)) + + def zfill(self, width): + return TextHandler(super().zfill(width)) + + def replace(self, old, new, count=-1): + return TextHandler(super().replace(old, new, count)) + + def upper(self): + return TextHandler(super().upper()) + + def lower(self): + return TextHandler(super().lower()) + ############## def sort(self, reverse: bool = False) -> str: """Return a sorted version of the string""" @@ -30,11 +89,21 @@ def clean(self) -> str: data = re.sub(' +', ' ', data) return self.__class__(data.strip()) + # For easy copy-paste from Scrapy/parsel code when needed :) + def get(self, default=None): + return self + + def get_all(self): + return self + + extract = get_all + extract_first = get + def json(self) -> Dict: """Return json response if the response is jsonable otherwise throw error""" - # Using __str__ function as a workaround for orjson issue with subclasses of str + # Using str function as a workaround for orjson issue with subclasses of str # Check this out: https://github.com/ijl/orjson/issues/445 - return loads(self.__str__()) + return loads(str(self)) def re( self, regex: Union[str, Pattern[str]], replace_entities: bool = True, clean_match: bool = False, @@ -127,6 +196,19 @@ def re_first(self, regex: Union[str, Pattern[str]], default=None, replace_entiti return result return default + # For easy copy-paste from Scrapy/parsel code when needed :) + def get(self, default=None): + """Returns the first item of the current list + :param default: the default value to return if the current list is empty + """ + return self[0] if len(self) > 0 else default + + def extract(self): + return self + + extract_first = get + get_all = extract + class AttributesHandler(Mapping): """A read-only mapping to use instead of the standard dictionary for the speed boost but at the same time I use it to add more functionalities. diff --git a/scrapling/core/storage_adaptors.py b/scrapling/core/storage_adaptors.py index 983e863..c614643 100644 --- a/scrapling/core/storage_adaptors.py +++ b/scrapling/core/storage_adaptors.py @@ -1,4 +1,3 @@ -import logging import sqlite3 import threading from abc import ABC, abstractmethod @@ -9,7 +8,7 @@ from tldextract import extract as tld from scrapling.core._types import Dict, Optional, Union -from scrapling.core.utils import _StorageTools, cache +from scrapling.core.utils import _StorageTools, log, lru_cache class StorageSystemMixin(ABC): @@ -20,7 +19,7 @@ def __init__(self, url: Union[str, None] = None): """ self.url = url - @cache(None, typed=True) + @lru_cache(None, typed=True) def _get_base_url(self, default_value: str = 'default') -> str: if not self.url or type(self.url) is not str: return default_value @@ -52,7 +51,7 @@ def retrieve(self, identifier: str) -> Optional[Dict]: raise NotImplementedError('Storage system must implement `save` method') @staticmethod - @cache(None, typed=True) + @lru_cache(None, typed=True) def _get_hash(identifier: str) -> str: """If you want to hash identifier in your storage system, use this safer""" identifier = identifier.lower().strip() @@ -64,7 +63,7 @@ def _get_hash(identifier: str) -> str: return f"{hash_value}_{len(identifier)}" # Length to reduce collision chance -@cache(None, typed=True) +@lru_cache(None, typed=True) class SQLiteStorageSystem(StorageSystemMixin): """The recommended system to use, it's race condition safe and thread safe. Mainly built so the library can run in threaded frameworks like scrapy or threaded tools @@ -86,7 +85,7 @@ def __init__(self, storage_file: str, url: Union[str, None] = None): self.connection.execute("PRAGMA journal_mode=WAL") self.cursor = self.connection.cursor() self._setup_database() - logging.debug( + log.debug( f'Storage system loaded with arguments (storage_file="{storage_file}", url="{url}")' ) diff --git a/scrapling/core/translator.py b/scrapling/core/translator.py index aa6211e..263a24a 100644 --- a/scrapling/core/translator.py +++ b/scrapling/core/translator.py @@ -17,7 +17,7 @@ from w3lib.html import HTML5_WHITESPACE from scrapling.core._types import Any, Optional, Protocol, Self -from scrapling.core.utils import cache +from scrapling.core.utils import lru_cache regex = f"[{HTML5_WHITESPACE}]+" replace_html5_whitespaces = re.compile(regex).sub @@ -139,6 +139,6 @@ def xpath_text_simple_pseudo_element(xpath: OriginalXPathExpr) -> XPathExpr: class HTMLTranslator(TranslatorMixin, OriginalHTMLTranslator): - @cache(maxsize=256) + @lru_cache(maxsize=256) def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: return super().css_to_xpath(css, prefix) diff --git a/scrapling/core/utils.py b/scrapling/core/utils.py index 35f8d0a..27cb884 100644 --- a/scrapling/core/utils.py +++ b/scrapling/core/utils.py @@ -9,17 +9,36 @@ # Using cache on top of a class is brilliant way to achieve Singleton design pattern without much code # functools.cache is available on Python 3.9+ only so let's keep lru_cache -from functools import lru_cache as cache # isort:skip - +from functools import lru_cache # isort:skip html_forbidden = {html.HtmlComment, } -logging.basicConfig( - level=logging.ERROR, - format='%(asctime)s - %(levelname)s - %(message)s', - handlers=[ - logging.StreamHandler() - ] -) + + +@lru_cache(1, typed=True) +def setup_logger(): + """Create and configure a logger with a standard format. + + :returns: logging.Logger: Configured logger instance + """ + logger = logging.getLogger('scrapling') + logger.setLevel(logging.INFO) + + formatter = logging.Formatter( + fmt="[%(asctime)s] %(levelname)s: %(message)s", + datefmt="%Y-%m-%d %H:%M:%S" + ) + + console_handler = logging.StreamHandler() + console_handler.setFormatter(formatter) + + # Add handler to logger (if not already added) + if not logger.handlers: + logger.addHandler(console_handler) + + return logger + + +log = setup_logger() def is_jsonable(content: Union[bytes, str]) -> bool: @@ -33,23 +52,6 @@ def is_jsonable(content: Union[bytes, str]) -> bool: return False -@cache(None, typed=True) -def setup_basic_logging(level: str = 'debug'): - levels = { - 'debug': logging.DEBUG, - 'info': logging.INFO, - 'warning': logging.WARNING, - 'error': logging.ERROR, - 'critical': logging.CRITICAL - } - formatter = logging.Formatter("[%(asctime)s] %(levelname)s: %(message)s", "%Y-%m-%d %H:%M:%S") - lvl = levels[level.lower()] - handler = logging.StreamHandler() - handler.setFormatter(formatter) - # Configure the root logger - logging.basicConfig(level=lvl, handlers=[handler]) - - def flatten(lst: Iterable): return list(chain.from_iterable(lst)) @@ -113,7 +115,7 @@ def _get_element_path(cls, element: html.HtmlElement): # return _impl -@cache(None, typed=True) +@lru_cache(None, typed=True) def clean_spaces(string): string = string.replace('\t', ' ') string = re.sub('[\n|\r]', '', string) diff --git a/scrapling/defaults.py b/scrapling/defaults.py index 73618a4..64fd1b7 100644 --- a/scrapling/defaults.py +++ b/scrapling/defaults.py @@ -1,6 +1,7 @@ -from .fetchers import Fetcher, PlayWrightFetcher, StealthyFetcher +from .fetchers import AsyncFetcher, Fetcher, PlayWrightFetcher, StealthyFetcher # If you are going to use Fetchers with the default settings, import them from this file instead for a cleaner looking code Fetcher = Fetcher() +AsyncFetcher = AsyncFetcher() StealthyFetcher = StealthyFetcher() PlayWrightFetcher = PlayWrightFetcher() diff --git a/scrapling/engines/camo.py b/scrapling/engines/camo.py index 2741206..1eb9976 100644 --- a/scrapling/engines/camo.py +++ b/scrapling/engines/camo.py @@ -1,13 +1,14 @@ -import logging - from camoufox import DefaultAddons +from camoufox.async_api import AsyncCamoufox from camoufox.sync_api import Camoufox from scrapling.core._types import (Callable, Dict, List, Literal, Optional, Union) +from scrapling.core.utils import log from scrapling.engines.toolbelt import (Response, StatusText, + async_intercept_route, check_type_validity, - construct_proxy_dict, do_nothing, + construct_proxy_dict, generate_convincing_referer, get_os_name, intercept_route) @@ -15,10 +16,11 @@ class CamoufoxEngine: def __init__( self, headless: Optional[Union[bool, Literal['virtual']]] = True, block_images: Optional[bool] = False, disable_resources: Optional[bool] = False, - block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = False, network_idle: Optional[bool] = False, humanize: Optional[Union[bool, float]] = True, - timeout: Optional[float] = 30000, page_action: Callable = do_nothing, wait_selector: Optional[str] = None, addons: Optional[List[str]] = None, + block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = True, network_idle: Optional[bool] = False, humanize: Optional[Union[bool, float]] = True, + timeout: Optional[float] = 30000, page_action: Callable = None, wait_selector: Optional[str] = None, addons: Optional[List[str]] = None, wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None, proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True, + geoip: Optional[bool] = False, adaptor_arguments: Dict = None, ): """An engine that utilizes Camoufox library, check the `StealthyFetcher` class for more documentation. @@ -32,13 +34,15 @@ def __init__( :param block_webrtc: Blocks WebRTC entirely. :param addons: List of Firefox addons to use. Must be paths to extracted addons. :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. - :param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases. + :param allow_webgl: Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. :param network_idle: Wait for the page until there are no network connections for at least 500 ms. :param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. :param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000 :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. :param wait_selector: Wait for a specific css selector to be in a specific state. + :param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. + It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`. :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ @@ -54,16 +58,20 @@ def __init__( self.google_search = bool(google_search) self.os_randomize = bool(os_randomize) self.disable_ads = bool(disable_ads) + self.geoip = bool(geoip) self.extra_headers = extra_headers or {} self.proxy = construct_proxy_dict(proxy) self.addons = addons or [] self.humanize = humanize self.timeout = check_type_validity(timeout, [int, float], 30000) - if callable(page_action): - self.page_action = page_action + if page_action is not None: + if callable(page_action): + self.page_action = page_action + else: + self.page_action = None + log.error('[Ignored] Argument "page_action" must be callable') else: - self.page_action = do_nothing - logging.error('[Ignored] Argument "page_action" must be callable') + self.page_action = None self.wait_selector = wait_selector self.wait_selector_state = wait_selector_state @@ -77,6 +85,7 @@ def fetch(self, url: str) -> Response: """ addons = [] if self.disable_ads else [DefaultAddons.UBO] with Camoufox( + geoip=self.geoip, proxy=self.proxy, addons=self.addons, exclude_addons=addons, @@ -102,7 +111,8 @@ def fetch(self, url: str) -> Response: if self.network_idle: page.wait_for_load_state('networkidle') - page = self.page_action(page) + if self.page_action is not None: + page = self.page_action(page) if self.wait_selector and type(self.wait_selector) is str: waiter = page.locator(self.wait_selector) @@ -115,11 +125,8 @@ def fetch(self, url: str) -> Response: # This will be parsed inside `Response` encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding - - status_text = res.status_text # PlayWright API sometimes give empty status text for some reason! - if not status_text: - status_text = StatusText.get(res.status) + status_text = res.status_text or StatusText.get(res.status) response = Response( url=res.url, @@ -136,3 +143,70 @@ def fetch(self, url: str) -> Response: page.close() return response + + async def async_fetch(self, url: str) -> Response: + """Opens up the browser and do your request based on your chosen options. + + :param url: Target url. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + addons = [] if self.disable_ads else [DefaultAddons.UBO] + async with AsyncCamoufox( + geoip=self.geoip, + proxy=self.proxy, + addons=self.addons, + exclude_addons=addons, + headless=self.headless, + humanize=self.humanize, + i_know_what_im_doing=True, # To turn warnings off with the user configurations + allow_webgl=self.allow_webgl, + block_webrtc=self.block_webrtc, + block_images=self.block_images, # Careful! it makes some websites doesn't finish loading at all like stackoverflow even in headful + os=None if self.os_randomize else get_os_name(), + ) as browser: + page = await browser.new_page() + page.set_default_navigation_timeout(self.timeout) + page.set_default_timeout(self.timeout) + if self.disable_resources: + await page.route("**/*", async_intercept_route) + + if self.extra_headers: + await page.set_extra_http_headers(self.extra_headers) + + res = await page.goto(url, referer=generate_convincing_referer(url) if self.google_search else None) + await page.wait_for_load_state(state="domcontentloaded") + if self.network_idle: + await page.wait_for_load_state('networkidle') + + if self.page_action is not None: + page = await self.page_action(page) + + if self.wait_selector and type(self.wait_selector) is str: + waiter = page.locator(self.wait_selector) + await waiter.first.wait_for(state=self.wait_selector_state) + # Wait again after waiting for the selector, helpful with protections like Cloudflare + await page.wait_for_load_state(state="load") + await page.wait_for_load_state(state="domcontentloaded") + if self.network_idle: + await page.wait_for_load_state('networkidle') + + # This will be parsed inside `Response` + encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding + # PlayWright API sometimes give empty status text for some reason! + status_text = res.status_text or StatusText.get(res.status) + + response = Response( + url=res.url, + text=await page.content(), + body=(await page.content()).encode('utf-8'), + status=res.status, + reason=status_text, + encoding=encoding, + cookies={cookie['name']: cookie['value'] for cookie in await page.context.cookies()}, + headers=await res.all_headers(), + request_headers=await res.request.all_headers(), + **self.adaptor_arguments + ) + await page.close() + + return response diff --git a/scrapling/engines/constants.py b/scrapling/engines/constants.py index 926e238..e26c460 100644 --- a/scrapling/engines/constants.py +++ b/scrapling/engines/constants.py @@ -1,5 +1,5 @@ # Disable loading these resources for speed -DEFAULT_DISABLED_RESOURCES = [ +DEFAULT_DISABLED_RESOURCES = { 'font', 'image', 'media', @@ -10,9 +10,9 @@ 'websocket', 'csp_report', 'stylesheet', -] +} -DEFAULT_STEALTH_FLAGS = [ +DEFAULT_STEALTH_FLAGS = ( # Explanation: https://peter.sh/experiments/chromium-command-line-switches/ # Generally this will make the browser faster and less detectable '--no-pings', @@ -87,7 +87,7 @@ '--enable-features=NetworkService,NetworkServiceInProcess,TrustTokens,TrustTokensAlwaysAllowIssuance', '--blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4', '--disable-features=AudioServiceOutOfProcess,IsolateOrigins,site-per-process,TranslateUI,BlinkGenPropertyTrees', -] +) # Defaulting to the docker mode, token doesn't matter in it as it's passed for the container NSTBROWSER_DEFAULT_QUERY = { diff --git a/scrapling/engines/pw.py b/scrapling/engines/pw.py index 7d15174..e4c80b7 100644 --- a/scrapling/engines/pw.py +++ b/scrapling/engines/pw.py @@ -1,12 +1,13 @@ import json -import logging -from scrapling.core._types import Callable, Dict, List, Optional, Union +from scrapling.core._types import Callable, Dict, Optional, Union +from scrapling.core.utils import log, lru_cache from scrapling.engines.constants import (DEFAULT_STEALTH_FLAGS, NSTBROWSER_DEFAULT_QUERY) from scrapling.engines.toolbelt import (Response, StatusText, + async_intercept_route, check_type_validity, construct_cdp_url, - construct_proxy_dict, do_nothing, + construct_proxy_dict, generate_convincing_referer, generate_headers, intercept_route, js_bypass_path) @@ -19,7 +20,7 @@ def __init__( useragent: Optional[str] = None, network_idle: Optional[bool] = False, timeout: Optional[float] = 30000, - page_action: Callable = do_nothing, + page_action: Callable = None, wait_selector: Optional[str] = None, locale: Optional[str] = 'en-US', wait_selector_state: Optional[str] = 'attached', @@ -74,11 +75,14 @@ def __init__( self.cdp_url = cdp_url self.useragent = useragent self.timeout = check_type_validity(timeout, [int, float], 30000) - if callable(page_action): - self.page_action = page_action + if page_action is not None: + if callable(page_action): + self.page_action = page_action + else: + self.page_action = None + log.error('[Ignored] Argument "page_action" must be callable') else: - self.page_action = do_nothing - logging.error('[Ignored] Argument "page_action" must be callable') + self.page_action = None self.wait_selector = wait_selector self.wait_selector_state = wait_selector_state @@ -94,10 +98,8 @@ def __init__( # '--disable-extensions', ] - def _cdp_url_logic(self, flags: Optional[List] = None) -> str: + def _cdp_url_logic(self) -> str: """Constructs new CDP URL if NSTBrowser is enabled otherwise return CDP URL as it is - - :param flags: Chrome flags to be added to NSTBrowser query :return: CDP URL """ cdp_url = self.cdp_url @@ -106,7 +108,8 @@ def _cdp_url_logic(self, flags: Optional[List] = None) -> str: config = self.nstbrowser_config else: query = NSTBROWSER_DEFAULT_QUERY.copy() - if flags: + if self.stealth: + flags = self.__set_flags() query.update({ "args": dict(zip(flags, [''] * len(flags))), # browser args should be a dictionary }) @@ -122,6 +125,68 @@ def _cdp_url_logic(self, flags: Optional[List] = None) -> str: return cdp_url + @lru_cache(typed=True) + def __set_flags(self): + """Returns the flags that will be used while launching the browser if stealth mode is enabled""" + flags = DEFAULT_STEALTH_FLAGS + if self.hide_canvas: + flags += ('--fingerprinting-canvas-image-data-noise',) + if self.disable_webgl: + flags += ('--disable-webgl', '--disable-webgl-image-chromium', '--disable-webgl2',) + + return flags + + def __launch_kwargs(self): + """Creates the arguments we will use while launching playwright's browser""" + launch_kwargs = {'headless': self.headless, 'ignore_default_args': self.harmful_default_args, 'channel': 'chrome' if self.real_chrome else 'chromium'} + if self.stealth: + launch_kwargs.update({'args': self.__set_flags(), 'chromium_sandbox': True}) + + return launch_kwargs + + def __context_kwargs(self): + """Creates the arguments for the browser context""" + context_kwargs = { + "proxy": self.proxy, + "locale": self.locale, + "color_scheme": 'dark', # Bypasses the 'prefersLightColor' check in creepjs + "device_scale_factor": 2, + "extra_http_headers": self.extra_headers if self.extra_headers else {}, + "user_agent": self.useragent if self.useragent else generate_headers(browser_mode=True).get('User-Agent'), + } + if self.stealth: + context_kwargs.update({ + 'is_mobile': False, + 'has_touch': False, + # I'm thinking about disabling it to rest from all Service Workers headache but let's keep it as it is for now + 'service_workers': 'allow', + 'ignore_https_errors': True, + 'screen': {'width': 1920, 'height': 1080}, + 'viewport': {'width': 1920, 'height': 1080}, + 'permissions': ['geolocation', 'notifications'] + }) + + return context_kwargs + + @lru_cache() + def __stealth_scripts(self): + # Basic bypasses nothing fancy as I'm still working on it + # But with adding these bypasses to the above config, it bypasses many online tests like + # https://bot.sannysoft.com/ + # https://kaliiiiiiiiii.github.io/brotector/ + # https://pixelscan.net/ + # https://iphey.com/ + # https://www.browserscan.net/bot-detection <== this one also checks for the CDP runtime fingerprint + # https://arh.antoinevastel.com/bots/areyouheadless/ + # https://prescience-data.github.io/execution-monitor.html + return tuple( + js_bypass_path(script) for script in ( + # Order is important + 'webdriver_fully.js', 'window_chrome.js', 'navigator_plugins.js', 'pdf_viewer.js', + 'notification_permission.js', 'screen_props.js', 'playwright_fingerprint.js' + ) + ) + def fetch(self, url: str) -> Response: """Opens up the browser and do your request based on your chosen options. @@ -135,61 +200,14 @@ def fetch(self, url: str) -> Response: from rebrowser_playwright.sync_api import sync_playwright with sync_playwright() as p: - # Handle the UserAgent early - if self.useragent: - extra_headers = {} - useragent = self.useragent - else: - extra_headers = {} - useragent = generate_headers(browser_mode=True).get('User-Agent') - - # Prepare the flags before diving - flags = DEFAULT_STEALTH_FLAGS - if self.hide_canvas: - flags += ['--fingerprinting-canvas-image-data-noise'] - if self.disable_webgl: - flags += ['--disable-webgl', '--disable-webgl-image-chromium', '--disable-webgl2'] - # Creating the browser if self.cdp_url: - cdp_url = self._cdp_url_logic(flags if self.stealth else None) + cdp_url = self._cdp_url_logic() browser = p.chromium.connect_over_cdp(endpoint_url=cdp_url) else: - if self.stealth: - browser = p.chromium.launch( - headless=self.headless, args=flags, ignore_default_args=self.harmful_default_args, chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium' - ) - else: - browser = p.chromium.launch(headless=self.headless, ignore_default_args=self.harmful_default_args, channel='chrome' if self.real_chrome else 'chromium') - - # Creating the context - if self.stealth: - context = browser.new_context( - locale=self.locale, - is_mobile=False, - has_touch=False, - proxy=self.proxy, - color_scheme='dark', # Bypasses the 'prefersLightColor' check in creepjs - user_agent=useragent, - device_scale_factor=2, - # I'm thinking about disabling it to rest from all Service Workers headache but let's keep it as it is for now - service_workers="allow", - ignore_https_errors=True, - extra_http_headers=extra_headers, - screen={"width": 1920, "height": 1080}, - viewport={"width": 1920, "height": 1080}, - permissions=["geolocation", 'notifications'], - ) - else: - context = browser.new_context( - locale=self.locale, - proxy=self.proxy, - color_scheme='dark', - user_agent=useragent, - device_scale_factor=2, - extra_http_headers=extra_headers - ) + browser = p.chromium.launch(**self.__launch_kwargs()) + context = browser.new_context(**self.__context_kwargs()) # Finally we are in business page = context.new_page() page.set_default_navigation_timeout(self.timeout) @@ -202,29 +220,16 @@ def fetch(self, url: str) -> Response: page.route("**/*", intercept_route) if self.stealth: - # Basic bypasses nothing fancy as I'm still working on it - # But with adding these bypasses to the above config, it bypasses many online tests like - # https://bot.sannysoft.com/ - # https://kaliiiiiiiiii.github.io/brotector/ - # https://pixelscan.net/ - # https://iphey.com/ - # https://www.browserscan.net/bot-detection <== this one also checks for the CDP runtime fingerprint - # https://arh.antoinevastel.com/bots/areyouheadless/ - # https://prescience-data.github.io/execution-monitor.html - page.add_init_script(path=js_bypass_path('webdriver_fully.js')) - page.add_init_script(path=js_bypass_path('window_chrome.js')) - page.add_init_script(path=js_bypass_path('navigator_plugins.js')) - page.add_init_script(path=js_bypass_path('pdf_viewer.js')) - page.add_init_script(path=js_bypass_path('notification_permission.js')) - page.add_init_script(path=js_bypass_path('screen_props.js')) - page.add_init_script(path=js_bypass_path('playwright_fingerprint.js')) + for script in self.__stealth_scripts(): + page.add_init_script(path=script) res = page.goto(url, referer=generate_convincing_referer(url) if self.google_search else None) page.wait_for_load_state(state="domcontentloaded") if self.network_idle: page.wait_for_load_state('networkidle') - page = self.page_action(page) + if self.page_action is not None: + page = self.page_action(page) if self.wait_selector and type(self.wait_selector) is str: waiter = page.locator(self.wait_selector) @@ -237,11 +242,8 @@ def fetch(self, url: str) -> Response: # This will be parsed inside `Response` encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding - - status_text = res.status_text # PlayWright API sometimes give empty status text for some reason! - if not status_text: - status_text = StatusText.get(res.status) + status_text = res.status_text or StatusText.get(res.status) response = Response( url=res.url, @@ -257,3 +259,76 @@ def fetch(self, url: str) -> Response: ) page.close() return response + + async def async_fetch(self, url: str) -> Response: + """Async version of `fetch` + + :param url: Target url. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + if not self.stealth or self.real_chrome: + # Because rebrowser_playwright doesn't play well with real browsers + from playwright.async_api import async_playwright + else: + from rebrowser_playwright.async_api import async_playwright + + async with async_playwright() as p: + # Creating the browser + if self.cdp_url: + cdp_url = self._cdp_url_logic() + browser = await p.chromium.connect_over_cdp(endpoint_url=cdp_url) + else: + browser = await p.chromium.launch(**self.__launch_kwargs()) + + context = await browser.new_context(**self.__context_kwargs()) + # Finally we are in business + page = await context.new_page() + page.set_default_navigation_timeout(self.timeout) + page.set_default_timeout(self.timeout) + + if self.extra_headers: + await page.set_extra_http_headers(self.extra_headers) + + if self.disable_resources: + await page.route("**/*", async_intercept_route) + + if self.stealth: + for script in self.__stealth_scripts(): + await page.add_init_script(path=script) + + res = await page.goto(url, referer=generate_convincing_referer(url) if self.google_search else None) + await page.wait_for_load_state(state="domcontentloaded") + if self.network_idle: + await page.wait_for_load_state('networkidle') + + if self.page_action is not None: + page = await self.page_action(page) + + if self.wait_selector and type(self.wait_selector) is str: + waiter = page.locator(self.wait_selector) + await waiter.first.wait_for(state=self.wait_selector_state) + # Wait again after waiting for the selector, helpful with protections like Cloudflare + await page.wait_for_load_state(state="load") + await page.wait_for_load_state(state="domcontentloaded") + if self.network_idle: + await page.wait_for_load_state('networkidle') + + # This will be parsed inside `Response` + encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding + # PlayWright API sometimes give empty status text for some reason! + status_text = res.status_text or StatusText.get(res.status) + + response = Response( + url=res.url, + text=await page.content(), + body=(await page.content()).encode('utf-8'), + status=res.status, + reason=status_text, + encoding=encoding, + cookies={cookie['name']: cookie['value'] for cookie in await page.context.cookies()}, + headers=await res.all_headers(), + request_headers=await res.request.all_headers(), + **self.adaptor_arguments + ) + await page.close() + return response diff --git a/scrapling/engines/static.py b/scrapling/engines/static.py index a091c4f..9d5bed7 100644 --- a/scrapling/engines/static.py +++ b/scrapling/engines/static.py @@ -1,34 +1,44 @@ -import logging - import httpx from httpx._models import Response as httpxResponse -from scrapling.core._types import Dict, Optional, Union +from scrapling.core._types import Dict, Optional, Tuple, Union +from scrapling.core.utils import log, lru_cache from .toolbelt import Response, generate_convincing_referer, generate_headers +@lru_cache(typed=True) class StaticEngine: - def __init__(self, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = None, adaptor_arguments: Dict = None): + def __init__( + self, url: str, proxy: Optional[str] = None, stealthy_headers: Optional[bool] = True, follow_redirects: bool = True, + timeout: Optional[Union[int, float]] = None, retries: Optional[int] = 3, adaptor_arguments: Tuple = None + ): """An engine that utilizes httpx library, check the `Fetcher` class for more documentation. + :param url: Target url. + :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and + create a referer header as if this request had came from Google's search of this URL's domain. + :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` :param follow_redirects: As the name says -- if enabled (default), redirects will be followed. :param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds. :param adaptor_arguments: The arguments that will be passed in the end while creating the final Adaptor's class. """ + self.url = url + self.proxy = proxy + self.stealth = stealthy_headers self.timeout = timeout self.follow_redirects = bool(follow_redirects) + self.retries = retries self._extra_headers = generate_headers(browser_mode=False) - self.adaptor_arguments = adaptor_arguments if adaptor_arguments else {} + # Because we are using `lru_cache` for a slight optimization but both dict/dict_items are not hashable so they can't be cached + # So my solution here was to convert it to tuple then convert it back to dictionary again here as tuples are hashable, ofc `tuple().__hash__()` + self.adaptor_arguments = dict(adaptor_arguments) if adaptor_arguments else {} - @staticmethod - def _headers_job(headers: Optional[Dict], url: str, stealth: bool) -> Dict: + def _headers_job(self, headers: Optional[Dict]) -> Dict: """Adds useragent to headers if it doesn't exist, generates real headers and append it to current headers, and finally generates a referer header that looks like if this request came from Google's search of the current URL's domain. :param headers: Current headers in the request if the user passed any - :param url: The Target URL. - :param stealth: Whether stealth mode is enabled or not. :return: A dictionary of the new headers. """ headers = headers or {} @@ -36,12 +46,12 @@ def _headers_job(headers: Optional[Dict], url: str, stealth: bool) -> Dict: # Validate headers if not headers.get('user-agent') and not headers.get('User-Agent'): headers['User-Agent'] = generate_headers(browser_mode=False).get('User-Agent') - logging.info(f"Can't find useragent in headers so '{headers['User-Agent']}' was used.") + log.debug(f"Can't find useragent in headers so '{headers['User-Agent']}' was used.") - if stealth: + if self.stealth: extra_headers = generate_headers(browser_mode=False) headers.update(extra_headers) - headers.update({'referer': generate_convincing_referer(url)}) + headers.update({'referer': generate_convincing_referer(self.url)}) return headers @@ -61,69 +71,102 @@ def _prepare_response(self, response: httpxResponse) -> Response: cookies=dict(response.cookies), headers=dict(response.headers), request_headers=dict(response.request.headers), + method=response.request.method, **self.adaptor_arguments ) - def get(self, url: str, proxy: Optional[str] = None, stealthy_headers: Optional[bool] = True, **kwargs: Dict) -> Response: + def get(self, **kwargs: Dict) -> Response: """Make basic HTTP GET request for you but with some added flavors. - :param url: Target url. - :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and - create a referer header as if this request had came from Google's search of this URL's domain. - :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` - :param kwargs: Any additional keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details. + :param kwargs: Any keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - headers = self._headers_job(kwargs.pop('headers', {}), url, stealthy_headers) - with httpx.Client(proxy=proxy) as client: - request = client.get(url=url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + headers = self._headers_job(kwargs.pop('headers', {})) + with httpx.Client(proxy=self.proxy, transport=httpx.HTTPTransport(retries=self.retries)) as client: + request = client.get(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) return self._prepare_response(request) - def post(self, url: str, proxy: Optional[str] = None, stealthy_headers: Optional[bool] = True, **kwargs: Dict) -> Response: + async def async_get(self, **kwargs: Dict) -> Response: + """Make basic async HTTP GET request for you but with some added flavors. + + :param kwargs: Any keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + headers = self._headers_job(kwargs.pop('headers', {})) + async with httpx.AsyncClient(proxy=self.proxy) as client: + request = await client.get(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + + return self._prepare_response(request) + + def post(self, **kwargs: Dict) -> Response: """Make basic HTTP POST request for you but with some added flavors. - :param url: Target url. - :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and - create a referer header as if this request had came from Google's search of this URL's domain. - :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` - :param kwargs: Any additional keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details. + :param kwargs: Any keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - headers = self._headers_job(kwargs.pop('headers', {}), url, stealthy_headers) - with httpx.Client(proxy=proxy) as client: - request = client.post(url=url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + headers = self._headers_job(kwargs.pop('headers', {})) + with httpx.Client(proxy=self.proxy, transport=httpx.HTTPTransport(retries=self.retries)) as client: + request = client.post(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) return self._prepare_response(request) - def delete(self, url: str, proxy: Optional[str] = None, stealthy_headers: Optional[bool] = True, **kwargs: Dict) -> Response: + async def async_post(self, **kwargs: Dict) -> Response: + """Make basic async HTTP POST request for you but with some added flavors. + + :param kwargs: Any keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + headers = self._headers_job(kwargs.pop('headers', {})) + async with httpx.AsyncClient(proxy=self.proxy) as client: + request = await client.post(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + + return self._prepare_response(request) + + def delete(self, **kwargs: Dict) -> Response: """Make basic HTTP DELETE request for you but with some added flavors. - :param url: Target url. - :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and - create a referer header as if this request had came from Google's search of this URL's domain. - :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` - :param kwargs: Any additional keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details. + :param kwargs: Any keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + headers = self._headers_job(kwargs.pop('headers', {})) + with httpx.Client(proxy=self.proxy, transport=httpx.HTTPTransport(retries=self.retries)) as client: + request = client.delete(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + + return self._prepare_response(request) + + async def async_delete(self, **kwargs: Dict) -> Response: + """Make basic async HTTP DELETE request for you but with some added flavors. + + :param kwargs: Any keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - headers = self._headers_job(kwargs.pop('headers', {}), url, stealthy_headers) - with httpx.Client(proxy=proxy) as client: - request = client.delete(url=url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + headers = self._headers_job(kwargs.pop('headers', {})) + async with httpx.AsyncClient(proxy=self.proxy) as client: + request = await client.delete(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) return self._prepare_response(request) - def put(self, url: str, proxy: Optional[str] = None, stealthy_headers: Optional[bool] = True, **kwargs: Dict) -> Response: + def put(self, **kwargs: Dict) -> Response: """Make basic HTTP PUT request for you but with some added flavors. - :param url: Target url. - :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and - create a referer header as if this request had came from Google's search of this URL's domain. - :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` - :param kwargs: Any additional keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details. + :param kwargs: Any keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + headers = self._headers_job(kwargs.pop('headers', {})) + with httpx.Client(proxy=self.proxy, transport=httpx.HTTPTransport(retries=self.retries)) as client: + request = client.put(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + + return self._prepare_response(request) + + async def async_put(self, **kwargs: Dict) -> Response: + """Make basic async HTTP PUT request for you but with some added flavors. + + :param kwargs: Any keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - headers = self._headers_job(kwargs.pop('headers', {}), url, stealthy_headers) - with httpx.Client(proxy=proxy) as client: - request = client.put(url=url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) + headers = self._headers_job(kwargs.pop('headers', {})) + async with httpx.AsyncClient(proxy=self.proxy) as client: + request = await client.put(url=self.url, headers=headers, follow_redirects=self.follow_redirects, timeout=self.timeout, **kwargs) return self._prepare_response(request) diff --git a/scrapling/engines/toolbelt/__init__.py b/scrapling/engines/toolbelt/__init__.py index 595929c..ccf2afa 100644 --- a/scrapling/engines/toolbelt/__init__.py +++ b/scrapling/engines/toolbelt/__init__.py @@ -1,6 +1,6 @@ from .custom import (BaseFetcher, Response, StatusText, check_if_engine_usable, - check_type_validity, do_nothing, get_variable_name) + check_type_validity, get_variable_name) from .fingerprints import (generate_convincing_referer, generate_headers, get_os_name) -from .navigation import (construct_cdp_url, construct_proxy_dict, - intercept_route, js_bypass_path) +from .navigation import (async_intercept_route, construct_cdp_url, + construct_proxy_dict, intercept_route, js_bypass_path) diff --git a/scrapling/engines/toolbelt/custom.py b/scrapling/engines/toolbelt/custom.py index 6e321cc..62c8452 100644 --- a/scrapling/engines/toolbelt/custom.py +++ b/scrapling/engines/toolbelt/custom.py @@ -2,13 +2,12 @@ Functions related to custom types or type checking """ import inspect -import logging from email.message import Message from scrapling.core._types import (Any, Callable, Dict, List, Optional, Tuple, Type, Union) from scrapling.core.custom_types import MappingProxyType -from scrapling.core.utils import cache, setup_basic_logging +from scrapling.core.utils import log, lru_cache from scrapling.parser import Adaptor, SQLiteStorageSystem @@ -17,7 +16,7 @@ class ResponseEncoding: __ISO_8859_1_CONTENT_TYPES = {"text/plain", "text/html", "text/css", "text/javascript"} @classmethod - @cache(maxsize=None) + @lru_cache(maxsize=None) def __parse_content_type(cls, header_value: str) -> Tuple[str, Dict[str, str]]: """Parse content type and parameters from a content-type header value. @@ -39,7 +38,7 @@ def __parse_content_type(cls, header_value: str) -> Tuple[str, Dict[str, str]]: return content_type, params @classmethod - @cache(maxsize=None) + @lru_cache(maxsize=None) def get_value(cls, content_type: Optional[str], text: Optional[str] = 'test') -> str: """Determine the appropriate character encoding from a content-type header. @@ -85,7 +84,10 @@ def get_value(cls, content_type: Optional[str], text: Optional[str] = 'test') -> class Response(Adaptor): """This class is returned by all engines as a way to unify response type between different libraries.""" - def __init__(self, url: str, text: str, body: bytes, status: int, reason: str, cookies: Dict, headers: Dict, request_headers: Dict, encoding: str = 'utf-8', **adaptor_arguments: Dict): + _is_response_result_logged = False # Class-level flag, initialized to False + + def __init__(self, url: str, text: str, body: bytes, status: int, reason: str, cookies: Dict, headers: Dict, request_headers: Dict, + encoding: str = 'utf-8', method: str = 'GET', **adaptor_arguments: Dict): automatch_domain = adaptor_arguments.pop('automatch_domain', None) self.status = status self.reason = reason @@ -96,6 +98,10 @@ def __init__(self, url: str, text: str, body: bytes, status: int, reason: str, c super().__init__(text=text, body=body, url=automatch_domain or url, encoding=encoding, **adaptor_arguments) # For back-ward compatibility self.adaptor = self + # For easier debugging while working from a Python shell + if not Response._is_response_result_logged: + log.info(f'Fetched ({status}) <{method} {url}> (referer: {request_headers.get("referer")})') + Response._is_response_result_logged = True # def __repr__(self): # return f'<{self.__class__.__name__} [{self.status} {self.reason}]>' @@ -104,8 +110,8 @@ def __init__(self, url: str, text: str, body: bytes, status: int, reason: str, c class BaseFetcher: def __init__( self, huge_tree: bool = True, keep_comments: Optional[bool] = False, auto_match: Optional[bool] = True, - storage: Any = SQLiteStorageSystem, storage_args: Optional[Dict] = None, debug: Optional[bool] = False, - automatch_domain: Optional[str] = None, + storage: Any = SQLiteStorageSystem, storage_args: Optional[Dict] = None, + automatch_domain: Optional[str] = None, keep_cdata: Optional[bool] = False, ): """Arguments below are the same from the Adaptor class so you can pass them directly, the rest of Adaptor's arguments are detected and passed automatically from the Fetcher based on the response for accessibility. @@ -113,6 +119,7 @@ def __init__( :param huge_tree: Enabled by default, should always be enabled when parsing large HTML documents. This controls libxml2 feature that forbids parsing certain large documents to protect from possible memory exhaustion. :param keep_comments: While parsing the HTML body, drop comments or not. Disabled by default for obvious reasons + :param keep_cdata: While parsing the HTML body, drop cdata or not. Disabled by default for cleaner HTML. :param auto_match: Globally turn-off the auto-match feature in all functions, this argument takes higher priority over all auto-match related arguments/functions in the class. :param storage: The storage class to be passed for auto-matching functionalities, see ``Docs`` for more info. @@ -120,23 +127,20 @@ def __init__( If empty, default values will be used. :param automatch_domain: For cases where you want to automatch selectors across different websites as if they were on the same website, use this argument to unify them. Otherwise, the domain of the request is used by default. - :param debug: Enable debug mode """ # Adaptor class parameters # I won't validate Adaptor's class parameters here again, I will leave it to be validated later self.adaptor_arguments = dict( huge_tree=huge_tree, keep_comments=keep_comments, + keep_cdata=keep_cdata, auto_match=auto_match, storage=storage, - storage_args=storage_args, - debug=debug, + storage_args=storage_args ) - # If the user used fetchers first, then configure the logger from here instead of the `Adaptor` class - setup_basic_logging(level='debug' if debug else 'info') if automatch_domain: if type(automatch_domain) is not str: - logging.warning('[Ignored] The argument "automatch_domain" must be of string type') + log.warning('[Ignored] The argument "automatch_domain" must be of string type') else: self.adaptor_arguments.update({'automatch_domain': automatch_domain}) @@ -212,7 +216,7 @@ class StatusText: }) @classmethod - @cache(maxsize=128) + @lru_cache(maxsize=128) def get(cls, status_code: int) -> str: """Get the phrase for a given HTTP status code.""" return cls._phrases.get(status_code, "Unknown Status Code") @@ -279,7 +283,7 @@ def check_type_validity(variable: Any, valid_types: Union[List[Type], None], def error_msg = f'Argument "{var_name}" cannot be None' if critical: raise TypeError(error_msg) - logging.error(f'[Ignored] {error_msg}') + log.error(f'[Ignored] {error_msg}') return default_value # If no valid_types specified and variable has a value, return it @@ -292,13 +296,7 @@ def check_type_validity(variable: Any, valid_types: Union[List[Type], None], def error_msg = f'Argument "{var_name}" must be of type {" or ".join(type_names)}' if critical: raise TypeError(error_msg) - logging.error(f'[Ignored] {error_msg}') + log.error(f'[Ignored] {error_msg}') return default_value return variable - - -# Pew Pew -def do_nothing(page): - # Just works as a filler for `page_action` argument in browser engines - return page diff --git a/scrapling/engines/toolbelt/fingerprints.py b/scrapling/engines/toolbelt/fingerprints.py index 5600003..a7bf633 100644 --- a/scrapling/engines/toolbelt/fingerprints.py +++ b/scrapling/engines/toolbelt/fingerprints.py @@ -9,10 +9,10 @@ from tldextract import extract from scrapling.core._types import Dict, Union -from scrapling.core.utils import cache +from scrapling.core.utils import lru_cache -@cache(None, typed=True) +@lru_cache(None, typed=True) def generate_convincing_referer(url: str) -> str: """Takes the domain from the URL without the subdomain/suffix and make it look like you were searching google for this website @@ -26,7 +26,7 @@ def generate_convincing_referer(url: str) -> str: return f'https://www.google.com/search?q={website_name}' -@cache(None, typed=True) +@lru_cache(None, typed=True) def get_os_name() -> Union[str, None]: """Get the current OS name in the same format needed for browserforge diff --git a/scrapling/engines/toolbelt/navigation.py b/scrapling/engines/toolbelt/navigation.py index 2d24cac..14a26d0 100644 --- a/scrapling/engines/toolbelt/navigation.py +++ b/scrapling/engines/toolbelt/navigation.py @@ -1,28 +1,41 @@ """ Functions related to files and URLs """ - -import logging import os from urllib.parse import urlencode, urlparse +from playwright.async_api import Route as async_Route from playwright.sync_api import Route from scrapling.core._types import Dict, Optional, Union -from scrapling.core.utils import cache +from scrapling.core.utils import log, lru_cache from scrapling.engines.constants import DEFAULT_DISABLED_RESOURCES -def intercept_route(route: Route) -> Union[Route, None]: +def intercept_route(route: Route): + """This is just a route handler but it drops requests that its type falls in `DEFAULT_DISABLED_RESOURCES` + + :param route: PlayWright `Route` object of the current page + :return: PlayWright `Route` object + """ + if route.request.resource_type in DEFAULT_DISABLED_RESOURCES: + log.debug(f'Blocking background resource "{route.request.url}" of type "{route.request.resource_type}"') + route.abort() + else: + route.continue_() + + +async def async_intercept_route(route: async_Route): """This is just a route handler but it drops requests that its type falls in `DEFAULT_DISABLED_RESOURCES` :param route: PlayWright `Route` object of the current page :return: PlayWright `Route` object """ if route.request.resource_type in DEFAULT_DISABLED_RESOURCES: - logging.debug(f'Blocking background resource "{route.request.url}" of type "{route.request.resource_type}"') - return route.abort() - return route.continue_() + log.debug(f'Blocking background resource "{route.request.url}" of type "{route.request.resource_type}"') + await route.abort() + else: + await route.continue_() def construct_proxy_dict(proxy_string: Union[str, Dict[str, str]]) -> Union[Dict, None]: @@ -97,7 +110,7 @@ def construct_cdp_url(cdp_url: str, query_params: Optional[Dict] = None) -> str: raise ValueError(f"Invalid CDP URL: {str(e)}") -@cache(None, typed=True) +@lru_cache(None, typed=True) def js_bypass_path(filename: str) -> str: """Takes the base filename of JS file inside the `bypasses` folder then return the full path of it diff --git a/scrapling/fetchers.py b/scrapling/fetchers.py index 619f2f8..86f77ae 100644 --- a/scrapling/fetchers.py +++ b/scrapling/fetchers.py @@ -2,7 +2,7 @@ Union) from scrapling.engines import (CamoufoxEngine, PlaywrightEngine, StaticEngine, check_if_engine_usable) -from scrapling.engines.toolbelt import BaseFetcher, Response, do_nothing +from scrapling.engines.toolbelt import BaseFetcher, Response class Fetcher(BaseFetcher): @@ -10,7 +10,9 @@ class Fetcher(BaseFetcher): Any additional keyword arguments passed to the methods below are passed to the respective httpx's method directly. """ - def get(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, proxy: Optional[str] = None, **kwargs: Dict) -> Response: + def get( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: """Make basic HTTP GET request for you but with some added flavors. :param url: Target url. @@ -19,13 +21,17 @@ def get(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[i :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and create a referer header as if this request had came from Google's search of this URL's domain. :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. :param kwargs: Any additional keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - response_object = StaticEngine(follow_redirects, timeout, adaptor_arguments=self.adaptor_arguments).get(url, proxy, stealthy_headers, **kwargs) + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries, adaptor_arguments=adaptor_arguments).get(**kwargs) return response_object - def post(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, proxy: Optional[str] = None, **kwargs: Dict) -> Response: + def post( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: """Make basic HTTP POST request for you but with some added flavors. :param url: Target url. @@ -34,13 +40,17 @@ def post(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[ :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and create a referer header as if this request came from Google's search of this URL's domain. :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. :param kwargs: Any additional keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - response_object = StaticEngine(follow_redirects, timeout, adaptor_arguments=self.adaptor_arguments).post(url, proxy, stealthy_headers, **kwargs) + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries, adaptor_arguments=adaptor_arguments).post(**kwargs) return response_object - def put(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, proxy: Optional[str] = None, **kwargs: Dict) -> Response: + def put( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: """Make basic HTTP PUT request for you but with some added flavors. :param url: Target url @@ -49,14 +59,18 @@ def put(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[i :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and create a referer header as if this request came from Google's search of this URL's domain. :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. :param kwargs: Any additional keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - response_object = StaticEngine(follow_redirects, timeout, adaptor_arguments=self.adaptor_arguments).put(url, proxy, stealthy_headers, **kwargs) + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries, adaptor_arguments=adaptor_arguments).put(**kwargs) return response_object - def delete(self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, proxy: Optional[str] = None, **kwargs: Dict) -> Response: + def delete( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: """Make basic HTTP DELETE request for you but with some added flavors. :param url: Target url @@ -65,10 +79,90 @@ def delete(self, url: str, follow_redirects: bool = True, timeout: Optional[Unio :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and create a referer header as if this request came from Google's search of this URL's domain. :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. :param kwargs: Any additional keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details. :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` """ - response_object = StaticEngine(follow_redirects, timeout, adaptor_arguments=self.adaptor_arguments).delete(url, proxy, stealthy_headers, **kwargs) + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries, adaptor_arguments=adaptor_arguments).delete(**kwargs) + return response_object + + +class AsyncFetcher(Fetcher): + async def get( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: + """Make basic HTTP GET request for you but with some added flavors. + + :param url: Target url. + :param follow_redirects: As the name says -- if enabled (default), redirects will be followed. + :param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds. + :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and + create a referer header as if this request had came from Google's search of this URL's domain. + :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. + :param kwargs: Any additional keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = await StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries=retries, adaptor_arguments=adaptor_arguments).async_get(**kwargs) + return response_object + + async def post( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: + """Make basic HTTP POST request for you but with some added flavors. + + :param url: Target url. + :param follow_redirects: As the name says -- if enabled (default), redirects will be followed. + :param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds. + :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and + create a referer header as if this request came from Google's search of this URL's domain. + :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. + :param kwargs: Any additional keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = await StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries=retries, adaptor_arguments=adaptor_arguments).async_post(**kwargs) + return response_object + + async def put( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: + """Make basic HTTP PUT request for you but with some added flavors. + + :param url: Target url + :param follow_redirects: As the name says -- if enabled (default), redirects will be followed. + :param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds. + :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and + create a referer header as if this request came from Google's search of this URL's domain. + :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. + :param kwargs: Any additional keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = await StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries=retries, adaptor_arguments=adaptor_arguments).async_post(**kwargs) + return response_object + + async def delete( + self, url: str, follow_redirects: bool = True, timeout: Optional[Union[int, float]] = 10, stealthy_headers: Optional[bool] = True, + proxy: Optional[str] = None, retries: Optional[int] = 3, **kwargs: Dict) -> Response: + """Make basic HTTP DELETE request for you but with some added flavors. + + :param url: Target url + :param follow_redirects: As the name says -- if enabled (default), redirects will be followed. + :param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds. + :param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and + create a referer header as if this request came from Google's search of this URL's domain. + :param proxy: A string of a proxy to use for http and https requests, the format accepted is `http://username:password@localhost:8030` + :param retries: The number of retries to do through httpx if the request failed for any reason. The default is 3 retries. + :param kwargs: Any additional keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + adaptor_arguments = tuple(self.adaptor_arguments.items()) + response_object = await StaticEngine(url, proxy, stealthy_headers, follow_redirects, timeout, retries=retries, adaptor_arguments=adaptor_arguments).async_delete(**kwargs) return response_object @@ -80,10 +174,10 @@ class StealthyFetcher(BaseFetcher): """ def fetch( self, url: str, headless: Optional[Union[bool, Literal['virtual']]] = True, block_images: Optional[bool] = False, disable_resources: Optional[bool] = False, - block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = False, network_idle: Optional[bool] = False, addons: Optional[List[str]] = None, - timeout: Optional[float] = 30000, page_action: Callable = do_nothing, wait_selector: Optional[str] = None, humanize: Optional[Union[bool, float]] = True, + block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = True, network_idle: Optional[bool] = False, addons: Optional[List[str]] = None, + timeout: Optional[float] = 30000, page_action: Callable = None, wait_selector: Optional[str] = None, humanize: Optional[Union[bool, float]] = True, wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None, proxy: Optional[Union[str, Dict[str, str]]] = None, - os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True, + os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True, geoip: Optional[bool] = False, ) -> Response: """ Opens up a browser and do your request based on your chosen options below. @@ -99,7 +193,9 @@ def fetch( :param addons: List of Firefox addons to use. Must be paths to extracted addons. :param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. - :param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases. + :param allow_webgl: Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. + :param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. + It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. :param network_idle: Wait for the page until there are no network connections for at least 500 ms. :param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000 @@ -113,6 +209,7 @@ def fetch( """ engine = CamoufoxEngine( proxy=proxy, + geoip=geoip, addons=addons, timeout=timeout, headless=headless, @@ -133,6 +230,64 @@ def fetch( ) return engine.fetch(url) + async def async_fetch( + self, url: str, headless: Optional[Union[bool, Literal['virtual']]] = True, block_images: Optional[bool] = False, disable_resources: Optional[bool] = False, + block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = True, network_idle: Optional[bool] = False, addons: Optional[List[str]] = None, + timeout: Optional[float] = 30000, page_action: Callable = None, wait_selector: Optional[str] = None, humanize: Optional[Union[bool, float]] = True, + wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None, proxy: Optional[Union[str, Dict[str, str]]] = None, + os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True, geoip: Optional[bool] = False, + ) -> Response: + """ + Opens up a browser and do your request based on your chosen options below. + + :param url: Target url. + :param headless: Run the browser in headless/hidden (default), 'virtual' screen mode, or headful/visible mode. + :param block_images: Prevent the loading of images through Firefox preferences. + This can help save your proxy usage but be careful with this option as it makes some websites never finish loading. + :param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites. + Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. + This can help save your proxy usage but be careful with this option as it makes some websites never finish loading. + :param block_webrtc: Blocks WebRTC entirely. + :param addons: List of Firefox addons to use. Must be paths to extracted addons. + :param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. + :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. + :param allow_webgl: Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. + :param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. + It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. + :param network_idle: Wait for the page until there are no network connections for at least 500 ms. + :param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. + :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000 + :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. + :param wait_selector: Wait for a specific css selector to be in a specific state. + :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`. + :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. + :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ + :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + engine = CamoufoxEngine( + proxy=proxy, + geoip=geoip, + addons=addons, + timeout=timeout, + headless=headless, + humanize=humanize, + disable_ads=disable_ads, + allow_webgl=allow_webgl, + page_action=page_action, + network_idle=network_idle, + block_images=block_images, + block_webrtc=block_webrtc, + os_randomize=os_randomize, + wait_selector=wait_selector, + google_search=google_search, + extra_headers=extra_headers, + disable_resources=disable_resources, + wait_selector_state=wait_selector_state, + adaptor_arguments=self.adaptor_arguments, + ) + return await engine.async_fetch(url) + class PlayWrightFetcher(BaseFetcher): """A `Fetcher` class type that provide many options, all of them are based on PlayWright. @@ -153,7 +308,7 @@ class PlayWrightFetcher(BaseFetcher): def fetch( self, url: str, headless: Union[bool, str] = True, disable_resources: bool = None, useragent: Optional[str] = None, network_idle: Optional[bool] = False, timeout: Optional[float] = 30000, - page_action: Optional[Callable] = do_nothing, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached', + page_action: Optional[Callable] = None, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached', hide_canvas: Optional[bool] = False, disable_webgl: Optional[bool] = False, extra_headers: Optional[Dict[str, str]] = None, google_search: Optional[bool] = True, proxy: Optional[Union[str, Dict[str, str]]] = None, locale: Optional[str] = 'en-US', stealth: Optional[bool] = False, real_chrome: Optional[bool] = False, @@ -210,6 +365,66 @@ def fetch( ) return engine.fetch(url) + async def async_fetch( + self, url: str, headless: Union[bool, str] = True, disable_resources: bool = None, + useragent: Optional[str] = None, network_idle: Optional[bool] = False, timeout: Optional[float] = 30000, + page_action: Optional[Callable] = None, wait_selector: Optional[str] = None, wait_selector_state: Optional[str] = 'attached', + hide_canvas: Optional[bool] = False, disable_webgl: Optional[bool] = False, extra_headers: Optional[Dict[str, str]] = None, google_search: Optional[bool] = True, + proxy: Optional[Union[str, Dict[str, str]]] = None, locale: Optional[str] = 'en-US', + stealth: Optional[bool] = False, real_chrome: Optional[bool] = False, + cdp_url: Optional[str] = None, + nstbrowser_mode: Optional[bool] = False, nstbrowser_config: Optional[Dict] = None, + ) -> Response: + """Opens up a browser and do your request based on your chosen options below. + + :param url: Target url. + :param headless: Run the browser in headless/hidden (default), or headful/visible mode. + :param disable_resources: Drop requests of unnecessary resources for speed boost. It depends but it made requests ~25% faster in my tests for some websites. + Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. + This can help save your proxy usage but be careful with this option as it makes some websites never finish loading. + :param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it. + :param network_idle: Wait for the page until there are no network connections for at least 500 ms. + :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000 + :param locale: Set the locale for the browser if wanted. The default value is `en-US`. + :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. + :param wait_selector: Wait for a specific css selector to be in a specific state. + :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`. + :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently. + :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. + :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting. + :param disable_webgl: Disables WebGL and WebGL 2.0 support entirely. + :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. + :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ + :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. + :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. + :param nstbrowser_mode: Enables NSTBrowser mode, it have to be used with `cdp_url` argument or it will get completely ignored. + :param nstbrowser_config: The config you want to send with requests to the NSTBrowser. If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config. + :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers` + """ + engine = PlaywrightEngine( + proxy=proxy, + locale=locale, + timeout=timeout, + stealth=stealth, + cdp_url=cdp_url, + headless=headless, + useragent=useragent, + real_chrome=real_chrome, + page_action=page_action, + hide_canvas=hide_canvas, + network_idle=network_idle, + google_search=google_search, + extra_headers=extra_headers, + wait_selector=wait_selector, + disable_webgl=disable_webgl, + nstbrowser_mode=nstbrowser_mode, + nstbrowser_config=nstbrowser_config, + disable_resources=disable_resources, + wait_selector_state=wait_selector_state, + adaptor_arguments=self.adaptor_arguments, + ) + return await engine.async_fetch(url) + class CustomFetcher(BaseFetcher): def fetch(self, url: str, browser_engine, **kwargs) -> Response: diff --git a/scrapling/parser.py b/scrapling/parser.py index daaa8c4..5440a84 100644 --- a/scrapling/parser.py +++ b/scrapling/parser.py @@ -2,6 +2,7 @@ import os import re from difflib import SequenceMatcher +from urllib.parse import urljoin from cssselect import SelectorError, SelectorSyntaxError from cssselect import parse as split_selectors @@ -17,13 +18,14 @@ StorageSystemMixin, _StorageTools) from scrapling.core.translator import HTMLTranslator from scrapling.core.utils import (clean_spaces, flatten, html_forbidden, - is_jsonable, logging, setup_basic_logging) + is_jsonable, log) class Adaptor(SelectorsGeneration): __slots__ = ( - 'url', 'encoding', '__auto_match_enabled', '_root', '_storage', '__debug', + 'url', 'encoding', '__auto_match_enabled', '_root', '_storage', '__keep_comments', '__huge_tree_enabled', '__attributes', '__text', '__tag', + '__keep_cdata', '__raw_body' ) def __init__( @@ -35,10 +37,10 @@ def __init__( huge_tree: bool = True, root: Optional[html.HtmlElement] = None, keep_comments: Optional[bool] = False, + keep_cdata: Optional[bool] = False, auto_match: Optional[bool] = True, storage: Any = SQLiteStorageSystem, storage_args: Optional[Dict] = None, - debug: Optional[bool] = True, **kwargs ): """The main class that works as a wrapper for the HTML input data. Using this class, you can search for elements @@ -58,33 +60,36 @@ def __init__( :param root: Used internally to pass etree objects instead of text/body arguments, it takes highest priority. Don't use it unless you know what you are doing! :param keep_comments: While parsing the HTML body, drop comments or not. Disabled by default for obvious reasons + :param keep_cdata: While parsing the HTML body, drop cdata or not. Disabled by default for cleaner HTML. :param auto_match: Globally turn-off the auto-match feature in all functions, this argument takes higher priority over all auto-match related arguments/functions in the class. :param storage: The storage class to be passed for auto-matching functionalities, see ``Docs`` for more info. :param storage_args: A dictionary of ``argument->value`` pairs to be passed for the storage class. If empty, default values will be used. - :param debug: Enable debug mode """ if root is None and not body and text is None: raise ValueError("Adaptor class needs text, body, or root arguments to work") self.__text = None + self.__raw_body = '' if root is None: if text is None: if not body or not isinstance(body, bytes): raise TypeError(f"body argument must be valid and of type bytes, got {body.__class__}") body = body.replace(b"\x00", b"").strip() + self.__raw_body = body.replace(b"\x00", b"").strip().decode() else: if not isinstance(text, str): raise TypeError(f"text argument must be of type str, got {text.__class__}") body = text.strip().replace("\x00", "").encode(encoding) or b"" + self.__raw_body = text.strip() # https://lxml.de/api/lxml.etree.HTMLParser-class.html parser = html.HTMLParser( - recover=True, remove_blank_text=True, remove_comments=(keep_comments is False), encoding=encoding, - compact=True, huge_tree=huge_tree, default_doctype=True + recover=True, remove_blank_text=True, remove_comments=(not keep_comments), encoding=encoding, + compact=True, huge_tree=huge_tree, default_doctype=True, strip_cdata=(not keep_cdata), ) self._root = etree.fromstring(body, parser=parser, base_url=url) if is_jsonable(text or body.decode()): @@ -99,7 +104,6 @@ def __init__( self._root = root - setup_basic_logging(level='debug' if debug else 'info') self.__auto_match_enabled = auto_match if self.__auto_match_enabled: @@ -110,7 +114,7 @@ def __init__( } if not hasattr(storage, '__wrapped__'): - raise ValueError("Storage class must be wrapped with cache decorator, see docs for info") + raise ValueError("Storage class must be wrapped with lru_cache decorator, see docs for info") if not issubclass(storage.__wrapped__, StorageSystemMixin): raise ValueError("Storage system must be inherited from class `StorageSystemMixin`") @@ -118,13 +122,13 @@ def __init__( self._storage = storage(**storage_args) self.__keep_comments = keep_comments + self.__keep_cdata = keep_cdata self.__huge_tree_enabled = huge_tree self.encoding = encoding self.url = url # For selector stuff self.__attributes = None self.__tag = None - self.__debug = debug # No need to check if all response attributes exist or not because if `status` exist, then the rest exist (Save some CPU cycles for speed) self.__response_data = { key: getattr(self, key) for key in ('status', 'reason', 'cookies', 'headers', 'request_headers',) @@ -155,8 +159,8 @@ def __get_correct_result( root=element, text='', body=b'', # Since root argument is provided, both `text` and `body` will be ignored so this is just a filler url=self.url, encoding=self.encoding, auto_match=self.__auto_match_enabled, - keep_comments=True, # if the comments are already removed in initialization, no need to try to delete them in sub-elements - huge_tree=self.__huge_tree_enabled, debug=self.__debug, + keep_comments=self.__keep_comments, keep_cdata=self.__keep_cdata, + huge_tree=self.__huge_tree_enabled, **self.__response_data ) return element @@ -243,6 +247,10 @@ def _traverse(node: html.HtmlElement) -> None: return TextHandler(separator.join([s for s in _all_strings])) + def urljoin(self, relative_url: str) -> str: + """Join this Adaptor's url with a relative url to form an absolute full URL.""" + return urljoin(self.url, relative_url) + @property def attrib(self) -> AttributesHandler: """Get attributes of the element""" @@ -255,7 +263,10 @@ def html_content(self) -> str: """Return the inner html code of the element""" return etree.tostring(self._root, encoding='unicode', method='html', with_tail=False) - body = html_content + @property + def body(self) -> str: + """Return raw HTML code of the element/page without any processing when possible or return `Adaptor.html_content`""" + return self.__raw_body or self.html_content def prettify(self) -> str: """Return a prettified version of the element's inner html-code""" @@ -330,6 +341,16 @@ def previous(self) -> Union['Adaptor', None]: return self.__convert_results(prev_element) + # For easy copy-paste from Scrapy/parsel code when needed :) + def get(self, default=None): + return self + + def get_all(self): + return self + + extract = get_all + extract_first = get + def __str__(self) -> str: return self.html_content @@ -392,10 +413,10 @@ def _traverse(node: html.HtmlElement, ele: Dict) -> None: if score_table: highest_probability = max(score_table.keys()) if score_table[highest_probability] and highest_probability >= percentage: - logging.debug(f'Highest probability was {highest_probability}%') - logging.debug('Top 5 best matching elements are: ') + log.debug(f'Highest probability was {highest_probability}%') + log.debug('Top 5 best matching elements are: ') for percent in tuple(sorted(score_table.keys(), reverse=True))[:5]: - logging.debug(f'{percent} -> {self.__convert_results(score_table[percent])}') + log.debug(f'{percent} -> {self.__convert_results(score_table[percent])}') if not adaptor_type: return score_table[highest_probability] return self.__convert_results(score_table[highest_probability]) @@ -521,7 +542,7 @@ def xpath(self, selector: str, identifier: str = '', if selected_elements: if not self.__auto_match_enabled and auto_save: - logging.warning("Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.") + log.warning("Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.") elif self.__auto_match_enabled and auto_save: self.save(selected_elements[0], identifier or selector) @@ -540,7 +561,7 @@ def xpath(self, selector: str, identifier: str = '', return self.__convert_results(selected_elements) elif not self.__auto_match_enabled and auto_match: - logging.warning("Argument `auto_match` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.") + log.warning("Argument `auto_match` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.") return self.__convert_results(selected_elements) @@ -744,8 +765,8 @@ def save(self, element: Union['Adaptor', html.HtmlElement], identifier: str) -> self._storage.save(element, identifier) else: - logging.critical( - "Can't use Auto-match features with disabled globally, you have to start a new class instance." + log.critical( + "Can't use Auto-match features while disabled globally, you have to start a new class instance." ) def retrieve(self, identifier: str) -> Optional[Dict]: @@ -758,8 +779,8 @@ def retrieve(self, identifier: str) -> Optional[Dict]: if self.__auto_match_enabled: return self._storage.retrieve(identifier) - logging.critical( - "Can't use Auto-match features with disabled globally, you have to start a new class instance." + log.critical( + "Can't use Auto-match features while disabled globally, you have to start a new class instance." ) # Operations on text functions @@ -1073,12 +1094,19 @@ def filter(self, func: Callable[['Adaptor'], bool]) -> Union['Adaptors', List]: ] return self.__class__(results) if results else results + # For easy copy-paste from Scrapy/parsel code when needed :) def get(self, default=None): """Returns the first item of the current list :param default: the default value to return if the current list is empty """ return self[0] if len(self) > 0 else default + def extract(self): + return self + + extract_first = get + get_all = extract + @property def first(self): """Returns the first item of the current list or `None` if the list is empty""" diff --git a/setup.cfg b/setup.cfg index 84169d5..b8b9227 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,6 +1,6 @@ [metadata] name = scrapling -version = 0.2.8 +version = 0.2.9 author = Karim Shoair author_email = karim.shoair@pm.me description = Scrapling is an undetectable, powerful, flexible, adaptive, and high-performance web scraping library for Python. diff --git a/setup.py b/setup.py index 0a29929..baa9429 100644 --- a/setup.py +++ b/setup.py @@ -6,7 +6,7 @@ setup( name="scrapling", - version="0.2.8", + version="0.2.9", description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It simplifies the process of extracting data from websites, even when they undergo structural changes, and offers impressive speed improvements over many popular scraping tools.""", @@ -55,12 +55,11 @@ "orjson>=3", "tldextract", 'httpx[brotli,zstd]', - 'playwright==1.48', # Temporary because currently All libraries that provide CDP patches doesn't support playwright 1.49 yet - 'rebrowser-playwright', - 'camoufox>=0.4.4', - 'browserforge', + 'playwright>=1.49.1', + 'rebrowser-playwright>=1.49.1', + 'camoufox[geoip]>=0.4.9' ], - python_requires=">=3.8", + python_requires=">=3.9", url="https://github.com/D4Vinci/Scrapling", project_urls={ "Documentation": "https://github.com/D4Vinci/Scrapling/tree/main/docs", # For now diff --git a/tests/fetchers/async/__init__.py b/tests/fetchers/async/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/fetchers/async/test_camoufox.py b/tests/fetchers/async/test_camoufox.py new file mode 100644 index 0000000..f2f495d --- /dev/null +++ b/tests/fetchers/async/test_camoufox.py @@ -0,0 +1,95 @@ +import pytest +import pytest_httpbin + +from scrapling import StealthyFetcher + + +@pytest_httpbin.use_class_based_httpbin +@pytest.mark.asyncio +class TestStealthyFetcher: + @pytest.fixture(scope="class") + def fetcher(self): + return StealthyFetcher(auto_match=False) + + @pytest.fixture(scope="class") + def urls(self, httpbin): + url = httpbin.url + return { + 'status_200': f'{url}/status/200', + 'status_404': f'{url}/status/404', + 'status_501': f'{url}/status/501', + 'basic_url': f'{url}/get', + 'html_url': f'{url}/html', + 'delayed_url': f'{url}/delay/10', # 10 Seconds delay response + 'cookies_url': f"{url}/cookies/set/test/value" + } + + async def test_basic_fetch(self, fetcher, urls): + """Test doing basic fetch request with multiple statuses""" + assert (await fetcher.async_fetch(urls['status_200'])).status == 200 + assert (await fetcher.async_fetch(urls['status_404'])).status == 404 + assert (await fetcher.async_fetch(urls['status_501'])).status == 501 + + async def test_networkidle(self, fetcher, urls): + """Test if waiting for `networkidle` make page does not finish loading or not""" + assert (await fetcher.async_fetch(urls['basic_url'], network_idle=True)).status == 200 + + async def test_blocking_resources(self, fetcher, urls): + """Test if blocking resources make page does not finish loading or not""" + assert (await fetcher.async_fetch(urls['basic_url'], block_images=True)).status == 200 + assert (await fetcher.async_fetch(urls['basic_url'], disable_resources=True)).status == 200 + + async def test_waiting_selector(self, fetcher, urls): + """Test if waiting for a selector make page does not finish loading or not""" + assert (await fetcher.async_fetch(urls['html_url'], wait_selector='h1')).status == 200 + assert (await fetcher.async_fetch( + urls['html_url'], + wait_selector='h1', + wait_selector_state='visible' + )).status == 200 + + async def test_cookies_loading(self, fetcher, urls): + """Test if cookies are set after the request""" + response = await fetcher.async_fetch(urls['cookies_url']) + assert response.cookies == {'test': 'value'} + + async def test_automation(self, fetcher, urls): + """Test if automation break the code or not""" + + async def scroll_page(page): + await page.mouse.wheel(10, 0) + await page.mouse.move(100, 400) + await page.mouse.up() + return page + + assert (await fetcher.async_fetch(urls['html_url'], page_action=scroll_page)).status == 200 + + async def test_properties(self, fetcher, urls): + """Test if different arguments breaks the code or not""" + assert (await fetcher.async_fetch( + urls['html_url'], + block_webrtc=True, + allow_webgl=True + )).status == 200 + + assert (await fetcher.async_fetch( + urls['html_url'], + block_webrtc=False, + allow_webgl=True + )).status == 200 + + assert (await fetcher.async_fetch( + urls['html_url'], + block_webrtc=True, + allow_webgl=False + )).status == 200 + + assert (await fetcher.async_fetch( + urls['html_url'], + extra_headers={'ayo': ''}, + os_randomize=True + )).status == 200 + + async def test_infinite_timeout(self, fetcher, urls): + """Test if infinite timeout breaks the code or not""" + assert (await fetcher.async_fetch(urls['delayed_url'], timeout=None)).status == 200 diff --git a/tests/fetchers/async/test_httpx.py b/tests/fetchers/async/test_httpx.py new file mode 100644 index 0000000..67cc037 --- /dev/null +++ b/tests/fetchers/async/test_httpx.py @@ -0,0 +1,83 @@ +import pytest +import pytest_httpbin + +from scrapling.fetchers import AsyncFetcher + + +@pytest_httpbin.use_class_based_httpbin +@pytest.mark.asyncio +class TestAsyncFetcher: + @pytest.fixture(scope="class") + def fetcher(self): + return AsyncFetcher(auto_match=True) + + @pytest.fixture(scope="class") + def urls(self, httpbin): + return { + 'status_200': f'{httpbin.url}/status/200', + 'status_404': f'{httpbin.url}/status/404', + 'status_501': f'{httpbin.url}/status/501', + 'basic_url': f'{httpbin.url}/get', + 'post_url': f'{httpbin.url}/post', + 'put_url': f'{httpbin.url}/put', + 'delete_url': f'{httpbin.url}/delete', + 'html_url': f'{httpbin.url}/html' + } + + async def test_basic_get(self, fetcher, urls): + """Test doing basic get request with multiple statuses""" + assert (await fetcher.get(urls['status_200'])).status == 200 + assert (await fetcher.get(urls['status_404'])).status == 404 + assert (await fetcher.get(urls['status_501'])).status == 501 + + async def test_get_properties(self, fetcher, urls): + """Test if different arguments with GET request breaks the code or not""" + assert (await fetcher.get(urls['status_200'], stealthy_headers=True)).status == 200 + assert (await fetcher.get(urls['status_200'], follow_redirects=True)).status == 200 + assert (await fetcher.get(urls['status_200'], timeout=None)).status == 200 + assert (await fetcher.get( + urls['status_200'], + stealthy_headers=True, + follow_redirects=True, + timeout=None + )).status == 200 + + async def test_post_properties(self, fetcher, urls): + """Test if different arguments with POST request breaks the code or not""" + assert (await fetcher.post(urls['post_url'], data={'key': 'value'})).status == 200 + assert (await fetcher.post(urls['post_url'], data={'key': 'value'}, stealthy_headers=True)).status == 200 + assert (await fetcher.post(urls['post_url'], data={'key': 'value'}, follow_redirects=True)).status == 200 + assert (await fetcher.post(urls['post_url'], data={'key': 'value'}, timeout=None)).status == 200 + assert (await fetcher.post( + urls['post_url'], + data={'key': 'value'}, + stealthy_headers=True, + follow_redirects=True, + timeout=None + )).status == 200 + + async def test_put_properties(self, fetcher, urls): + """Test if different arguments with PUT request breaks the code or not""" + assert (await fetcher.put(urls['put_url'], data={'key': 'value'})).status in [200, 405] + assert (await fetcher.put(urls['put_url'], data={'key': 'value'}, stealthy_headers=True)).status in [200, 405] + assert (await fetcher.put(urls['put_url'], data={'key': 'value'}, follow_redirects=True)).status in [200, 405] + assert (await fetcher.put(urls['put_url'], data={'key': 'value'}, timeout=None)).status in [200, 405] + assert (await fetcher.put( + urls['put_url'], + data={'key': 'value'}, + stealthy_headers=True, + follow_redirects=True, + timeout=None + )).status in [200, 405] + + async def test_delete_properties(self, fetcher, urls): + """Test if different arguments with DELETE request breaks the code or not""" + assert (await fetcher.delete(urls['delete_url'], stealthy_headers=True)).status == 200 + assert (await fetcher.delete(urls['delete_url'], follow_redirects=True)).status == 200 + assert (await fetcher.delete(urls['delete_url'], timeout=None)).status == 200 + assert (await fetcher.delete( + urls['delete_url'], + stealthy_headers=True, + follow_redirects=True, + timeout=None + )).status == 200 diff --git a/tests/fetchers/async/test_playwright.py b/tests/fetchers/async/test_playwright.py new file mode 100644 index 0000000..a8b4ef4 --- /dev/null +++ b/tests/fetchers/async/test_playwright.py @@ -0,0 +1,99 @@ +import pytest +import pytest_httpbin + +from scrapling import PlayWrightFetcher + + +@pytest_httpbin.use_class_based_httpbin +class TestPlayWrightFetcherAsync: + @pytest.fixture + def fetcher(self): + return PlayWrightFetcher(auto_match=False) + + @pytest.fixture + def urls(self, httpbin): + return { + 'status_200': f'{httpbin.url}/status/200', + 'status_404': f'{httpbin.url}/status/404', + 'status_501': f'{httpbin.url}/status/501', + 'basic_url': f'{httpbin.url}/get', + 'html_url': f'{httpbin.url}/html', + 'delayed_url': f'{httpbin.url}/delay/10', + 'cookies_url': f"{httpbin.url}/cookies/set/test/value" + } + + @pytest.mark.asyncio + async def test_basic_fetch(self, fetcher, urls): + """Test doing basic fetch request with multiple statuses""" + response = await fetcher.async_fetch(urls['status_200']) + assert response.status == 200 + + @pytest.mark.asyncio + async def test_networkidle(self, fetcher, urls): + """Test if waiting for `networkidle` make page does not finish loading or not""" + response = await fetcher.async_fetch(urls['basic_url'], network_idle=True) + assert response.status == 200 + + @pytest.mark.asyncio + async def test_blocking_resources(self, fetcher, urls): + """Test if blocking resources make page does not finish loading or not""" + response = await fetcher.async_fetch(urls['basic_url'], disable_resources=True) + assert response.status == 200 + + @pytest.mark.asyncio + async def test_waiting_selector(self, fetcher, urls): + """Test if waiting for a selector make page does not finish loading or not""" + response1 = await fetcher.async_fetch(urls['html_url'], wait_selector='h1') + assert response1.status == 200 + + response2 = await fetcher.async_fetch(urls['html_url'], wait_selector='h1', wait_selector_state='visible') + assert response2.status == 200 + + @pytest.mark.asyncio + async def test_cookies_loading(self, fetcher, urls): + """Test if cookies are set after the request""" + response = await fetcher.async_fetch(urls['cookies_url']) + assert response.cookies == {'test': 'value'} + + @pytest.mark.asyncio + async def test_automation(self, fetcher, urls): + """Test if automation break the code or not""" + async def scroll_page(page): + await page.mouse.wheel(10, 0) + await page.mouse.move(100, 400) + await page.mouse.up() + return page + + response = await fetcher.async_fetch(urls['html_url'], page_action=scroll_page) + assert response.status == 200 + + @pytest.mark.parametrize("kwargs", [ + {"disable_webgl": True, "hide_canvas": False}, + {"disable_webgl": False, "hide_canvas": True}, + {"stealth": True}, + {"useragent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0'}, + {"extra_headers": {'ayo': ''}} + ]) + @pytest.mark.asyncio + async def test_properties(self, fetcher, urls, kwargs): + """Test if different arguments breaks the code or not""" + response = await fetcher.async_fetch(urls['html_url'], **kwargs) + assert response.status == 200 + + @pytest.mark.asyncio + async def test_cdp_url_invalid(self, fetcher, urls): + """Test if invalid CDP URLs raise appropriate exceptions""" + with pytest.raises(ValueError): + await fetcher.async_fetch(urls['html_url'], cdp_url='blahblah') + + with pytest.raises(ValueError): + await fetcher.async_fetch(urls['html_url'], cdp_url='blahblah', nstbrowser_mode=True) + + with pytest.raises(Exception): + await fetcher.async_fetch(urls['html_url'], cdp_url='ws://blahblah') + + @pytest.mark.asyncio + async def test_infinite_timeout(self, fetcher, urls): + """Test if infinite timeout breaks the code or not""" + response = await fetcher.async_fetch(urls['delayed_url'], timeout=None) + assert response.status == 200 diff --git a/tests/fetchers/sync/__init__.py b/tests/fetchers/sync/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/fetchers/sync/test_camoufox.py b/tests/fetchers/sync/test_camoufox.py new file mode 100644 index 0000000..33800f4 --- /dev/null +++ b/tests/fetchers/sync/test_camoufox.py @@ -0,0 +1,68 @@ +import pytest +import pytest_httpbin + +from scrapling import StealthyFetcher + + +@pytest_httpbin.use_class_based_httpbin +class TestStealthyFetcher: + @pytest.fixture(scope="class") + def fetcher(self): + """Fixture to create a StealthyFetcher instance for the entire test class""" + return StealthyFetcher(auto_match=False) + + @pytest.fixture(autouse=True) + def setup_urls(self, httpbin): + """Fixture to set up URLs for testing""" + self.status_200 = f'{httpbin.url}/status/200' + self.status_404 = f'{httpbin.url}/status/404' + self.status_501 = f'{httpbin.url}/status/501' + self.basic_url = f'{httpbin.url}/get' + self.html_url = f'{httpbin.url}/html' + self.delayed_url = f'{httpbin.url}/delay/10' # 10 Seconds delay response + self.cookies_url = f"{httpbin.url}/cookies/set/test/value" + + def test_basic_fetch(self, fetcher): + """Test doing basic fetch request with multiple statuses""" + assert fetcher.fetch(self.status_200).status == 200 + assert fetcher.fetch(self.status_404).status == 404 + assert fetcher.fetch(self.status_501).status == 501 + + def test_networkidle(self, fetcher): + """Test if waiting for `networkidle` make page does not finish loading or not""" + assert fetcher.fetch(self.basic_url, network_idle=True).status == 200 + + def test_blocking_resources(self, fetcher): + """Test if blocking resources make page does not finish loading or not""" + assert fetcher.fetch(self.basic_url, block_images=True).status == 200 + assert fetcher.fetch(self.basic_url, disable_resources=True).status == 200 + + def test_waiting_selector(self, fetcher): + """Test if waiting for a selector make page does not finish loading or not""" + assert fetcher.fetch(self.html_url, wait_selector='h1').status == 200 + assert fetcher.fetch(self.html_url, wait_selector='h1', wait_selector_state='visible').status == 200 + + def test_cookies_loading(self, fetcher): + """Test if cookies are set after the request""" + assert fetcher.fetch(self.cookies_url).cookies == {'test': 'value'} + + def test_automation(self, fetcher): + """Test if automation break the code or not""" + def scroll_page(page): + page.mouse.wheel(10, 0) + page.mouse.move(100, 400) + page.mouse.up() + return page + + assert fetcher.fetch(self.html_url, page_action=scroll_page).status == 200 + + def test_properties(self, fetcher): + """Test if different arguments breaks the code or not""" + assert fetcher.fetch(self.html_url, block_webrtc=True, allow_webgl=True).status == 200 + assert fetcher.fetch(self.html_url, block_webrtc=False, allow_webgl=True).status == 200 + assert fetcher.fetch(self.html_url, block_webrtc=True, allow_webgl=False).status == 200 + assert fetcher.fetch(self.html_url, extra_headers={'ayo': ''}, os_randomize=True).status == 200 + + def test_infinite_timeout(self, fetcher): + """Test if infinite timeout breaks the code or not""" + assert fetcher.fetch(self.delayed_url, timeout=None).status == 200 diff --git a/tests/fetchers/sync/test_httpx.py b/tests/fetchers/sync/test_httpx.py new file mode 100644 index 0000000..9f5ca80 --- /dev/null +++ b/tests/fetchers/sync/test_httpx.py @@ -0,0 +1,82 @@ +import pytest +import pytest_httpbin + +from scrapling import Fetcher + + +@pytest_httpbin.use_class_based_httpbin +class TestFetcher: + @pytest.fixture(scope="class") + def fetcher(self): + """Fixture to create a Fetcher instance for the entire test class""" + return Fetcher(auto_match=False) + + @pytest.fixture(autouse=True) + def setup_urls(self, httpbin): + """Fixture to set up URLs for testing""" + self.status_200 = f'{httpbin.url}/status/200' + self.status_404 = f'{httpbin.url}/status/404' + self.status_501 = f'{httpbin.url}/status/501' + self.basic_url = f'{httpbin.url}/get' + self.post_url = f'{httpbin.url}/post' + self.put_url = f'{httpbin.url}/put' + self.delete_url = f'{httpbin.url}/delete' + self.html_url = f'{httpbin.url}/html' + + def test_basic_get(self, fetcher): + """Test doing basic get request with multiple statuses""" + assert fetcher.get(self.status_200).status == 200 + assert fetcher.get(self.status_404).status == 404 + assert fetcher.get(self.status_501).status == 501 + + def test_get_properties(self, fetcher): + """Test if different arguments with GET request breaks the code or not""" + assert fetcher.get(self.status_200, stealthy_headers=True).status == 200 + assert fetcher.get(self.status_200, follow_redirects=True).status == 200 + assert fetcher.get(self.status_200, timeout=None).status == 200 + assert fetcher.get( + self.status_200, + stealthy_headers=True, + follow_redirects=True, + timeout=None + ).status == 200 + + def test_post_properties(self, fetcher): + """Test if different arguments with POST request breaks the code or not""" + assert fetcher.post(self.post_url, data={'key': 'value'}).status == 200 + assert fetcher.post(self.post_url, data={'key': 'value'}, stealthy_headers=True).status == 200 + assert fetcher.post(self.post_url, data={'key': 'value'}, follow_redirects=True).status == 200 + assert fetcher.post(self.post_url, data={'key': 'value'}, timeout=None).status == 200 + assert fetcher.post( + self.post_url, + data={'key': 'value'}, + stealthy_headers=True, + follow_redirects=True, + timeout=None + ).status == 200 + + def test_put_properties(self, fetcher): + """Test if different arguments with PUT request breaks the code or not""" + assert fetcher.put(self.put_url, data={'key': 'value'}).status == 200 + assert fetcher.put(self.put_url, data={'key': 'value'}, stealthy_headers=True).status == 200 + assert fetcher.put(self.put_url, data={'key': 'value'}, follow_redirects=True).status == 200 + assert fetcher.put(self.put_url, data={'key': 'value'}, timeout=None).status == 200 + assert fetcher.put( + self.put_url, + data={'key': 'value'}, + stealthy_headers=True, + follow_redirects=True, + timeout=None + ).status == 200 + + def test_delete_properties(self, fetcher): + """Test if different arguments with DELETE request breaks the code or not""" + assert fetcher.delete(self.delete_url, stealthy_headers=True).status == 200 + assert fetcher.delete(self.delete_url, follow_redirects=True).status == 200 + assert fetcher.delete(self.delete_url, timeout=None).status == 200 + assert fetcher.delete( + self.delete_url, + stealthy_headers=True, + follow_redirects=True, + timeout=None + ).status == 200 diff --git a/tests/fetchers/sync/test_playwright.py b/tests/fetchers/sync/test_playwright.py new file mode 100644 index 0000000..e1f424c --- /dev/null +++ b/tests/fetchers/sync/test_playwright.py @@ -0,0 +1,87 @@ +import pytest +import pytest_httpbin + +from scrapling import PlayWrightFetcher + + +@pytest_httpbin.use_class_based_httpbin +class TestPlayWrightFetcher: + + @pytest.fixture(scope="class") + def fetcher(self): + """Fixture to create a StealthyFetcher instance for the entire test class""" + return PlayWrightFetcher(auto_match=False) + + @pytest.fixture(autouse=True) + def setup_urls(self, httpbin): + """Fixture to set up URLs for testing""" + self.status_200 = f'{httpbin.url}/status/200' + self.status_404 = f'{httpbin.url}/status/404' + self.status_501 = f'{httpbin.url}/status/501' + self.basic_url = f'{httpbin.url}/get' + self.html_url = f'{httpbin.url}/html' + self.delayed_url = f'{httpbin.url}/delay/10' # 10 Seconds delay response + self.cookies_url = f"{httpbin.url}/cookies/set/test/value" + + def test_basic_fetch(self, fetcher): + """Test doing basic fetch request with multiple statuses""" + assert fetcher.fetch(self.status_200).status == 200 + # There's a bug with playwright makes it crashes if a URL returns status code 4xx/5xx without body, let's disable this till they reply to my issue report + # assert fetcher.fetch(self.status_404).status == 404 + # assert fetcher.fetch(self.status_501).status == 501 + + def test_networkidle(self, fetcher): + """Test if waiting for `networkidle` make page does not finish loading or not""" + assert fetcher.fetch(self.basic_url, network_idle=True).status == 200 + + def test_blocking_resources(self, fetcher): + """Test if blocking resources make page does not finish loading or not""" + assert fetcher.fetch(self.basic_url, disable_resources=True).status == 200 + + def test_waiting_selector(self, fetcher): + """Test if waiting for a selector make page does not finish loading or not""" + assert fetcher.fetch(self.html_url, wait_selector='h1').status == 200 + assert fetcher.fetch(self.html_url, wait_selector='h1', wait_selector_state='visible').status == 200 + + def test_cookies_loading(self, fetcher): + """Test if cookies are set after the request""" + assert fetcher.fetch(self.cookies_url).cookies == {'test': 'value'} + + def test_automation(self, fetcher): + """Test if automation break the code or not""" + + def scroll_page(page): + page.mouse.wheel(10, 0) + page.mouse.move(100, 400) + page.mouse.up() + return page + + assert fetcher.fetch(self.html_url, page_action=scroll_page).status == 200 + + @pytest.mark.parametrize("kwargs", [ + {"disable_webgl": True, "hide_canvas": False}, + {"disable_webgl": False, "hide_canvas": True}, + {"stealth": True}, + {"useragent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0'}, + {"extra_headers": {'ayo': ''}} + ]) + def test_properties(self, fetcher, kwargs): + """Test if different arguments breaks the code or not""" + response = fetcher.fetch(self.html_url, **kwargs) + assert response.status == 200 + + def test_cdp_url_invalid(self, fetcher): + """Test if invalid CDP URLs raise appropriate exceptions""" + with pytest.raises(ValueError): + fetcher.fetch(self.html_url, cdp_url='blahblah') + + with pytest.raises(ValueError): + fetcher.fetch(self.html_url, cdp_url='blahblah', nstbrowser_mode=True) + + with pytest.raises(Exception): + fetcher.fetch(self.html_url, cdp_url='ws://blahblah') + + def test_infinite_timeout(self, fetcher, ): + """Test if infinite timeout breaks the code or not""" + response = fetcher.fetch(self.delayed_url, timeout=None) + assert response.status == 200 diff --git a/tests/fetchers/test_camoufox.py b/tests/fetchers/test_camoufox.py deleted file mode 100644 index fcbf3b7..0000000 --- a/tests/fetchers/test_camoufox.py +++ /dev/null @@ -1,65 +0,0 @@ -import unittest - -import pytest_httpbin - -from scrapling import StealthyFetcher - - -@pytest_httpbin.use_class_based_httpbin -# @pytest_httpbin.use_class_based_httpbin_secure -class TestStealthyFetcher(unittest.TestCase): - def setUp(self): - self.fetcher = StealthyFetcher(auto_match=False) - url = self.httpbin.url - self.status_200 = f'{url}/status/200' - self.status_404 = f'{url}/status/404' - self.status_501 = f'{url}/status/501' - self.basic_url = f'{url}/get' - self.html_url = f'{url}/html' - self.delayed_url = f'{url}/delay/10' # 10 Seconds delay response - self.cookies_url = f"{url}/cookies/set/test/value" - - def test_basic_fetch(self): - """Test doing basic fetch request with multiple statuses""" - self.assertEqual(self.fetcher.fetch(self.status_200).status, 200) - self.assertEqual(self.fetcher.fetch(self.status_404).status, 404) - self.assertEqual(self.fetcher.fetch(self.status_501).status, 501) - - def test_networkidle(self): - """Test if waiting for `networkidle` make page does not finish loading or not""" - self.assertEqual(self.fetcher.fetch(self.basic_url, network_idle=True).status, 200) - - def test_blocking_resources(self): - """Test if blocking resources make page does not finish loading or not""" - self.assertEqual(self.fetcher.fetch(self.basic_url, block_images=True).status, 200) - self.assertEqual(self.fetcher.fetch(self.basic_url, disable_resources=True).status, 200) - - def test_waiting_selector(self): - """Test if waiting for a selector make page does not finish loading or not""" - self.assertEqual(self.fetcher.fetch(self.html_url, wait_selector='h1').status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, wait_selector='h1', wait_selector_state='visible').status, 200) - - def test_cookies_loading(self): - """Test if cookies are set after the request""" - self.assertEqual(self.fetcher.fetch(self.cookies_url).cookies, {'test': 'value'}) - - def test_automation(self): - """Test if automation break the code or not""" - def scroll_page(page): - page.mouse.wheel(10, 0) - page.mouse.move(100, 400) - page.mouse.up() - return page - - self.assertEqual(self.fetcher.fetch(self.html_url, page_action=scroll_page).status, 200) - - def test_properties(self): - """Test if different arguments breaks the code or not""" - self.assertEqual(self.fetcher.fetch(self.html_url, block_webrtc=True, allow_webgl=True).status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, block_webrtc=False, allow_webgl=True).status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, block_webrtc=True, allow_webgl=False).status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, extra_headers={'ayo': ''}, os_randomize=True).status, 200) - - def test_infinite_timeout(self): - """Test if infinite timeout breaks the code or not""" - self.assertEqual(self.fetcher.fetch(self.delayed_url, timeout=None).status, 200) diff --git a/tests/fetchers/test_httpx.py b/tests/fetchers/test_httpx.py deleted file mode 100644 index 1a5cc02..0000000 --- a/tests/fetchers/test_httpx.py +++ /dev/null @@ -1,68 +0,0 @@ -import unittest - -import pytest_httpbin - -from scrapling import Fetcher - - -@pytest_httpbin.use_class_based_httpbin -class TestFetcher(unittest.TestCase): - def setUp(self): - self.fetcher = Fetcher(auto_match=False) - url = self.httpbin.url - self.status_200 = f'{url}/status/200' - self.status_404 = f'{url}/status/404' - self.status_501 = f'{url}/status/501' - self.basic_url = f'{url}/get' - self.post_url = f'{url}/post' - self.put_url = f'{url}/put' - self.delete_url = f'{url}/delete' - self.html_url = f'{url}/html' - - def test_basic_get(self): - """Test doing basic get request with multiple statuses""" - self.assertEqual(self.fetcher.get(self.status_200).status, 200) - self.assertEqual(self.fetcher.get(self.status_404).status, 404) - self.assertEqual(self.fetcher.get(self.status_501).status, 501) - - def test_get_properties(self): - """Test if different arguments with GET request breaks the code or not""" - self.assertEqual(self.fetcher.get(self.status_200, stealthy_headers=True).status, 200) - self.assertEqual(self.fetcher.get(self.status_200, follow_redirects=True).status, 200) - self.assertEqual(self.fetcher.get(self.status_200, timeout=None).status, 200) - self.assertEqual( - self.fetcher.get(self.status_200, stealthy_headers=True, follow_redirects=True, timeout=None).status, - 200 - ) - - def test_post_properties(self): - """Test if different arguments with POST request breaks the code or not""" - self.assertEqual(self.fetcher.post(self.post_url, data={'key': 'value'}).status, 200) - self.assertEqual(self.fetcher.post(self.post_url, data={'key': 'value'}, stealthy_headers=True).status, 200) - self.assertEqual(self.fetcher.post(self.post_url, data={'key': 'value'}, follow_redirects=True).status, 200) - self.assertEqual(self.fetcher.post(self.post_url, data={'key': 'value'}, timeout=None).status, 200) - self.assertEqual( - self.fetcher.post(self.post_url, data={'key': 'value'}, stealthy_headers=True, follow_redirects=True, timeout=None).status, - 200 - ) - - def test_put_properties(self): - """Test if different arguments with PUT request breaks the code or not""" - self.assertEqual(self.fetcher.put(self.put_url, data={'key': 'value'}).status, 200) - self.assertEqual(self.fetcher.put(self.put_url, data={'key': 'value'}, stealthy_headers=True).status, 200) - self.assertEqual(self.fetcher.put(self.put_url, data={'key': 'value'}, follow_redirects=True).status, 200) - self.assertEqual(self.fetcher.put(self.put_url, data={'key': 'value'}, timeout=None).status, 200) - self.assertEqual( - self.fetcher.put(self.put_url, data={'key': 'value'}, stealthy_headers=True, follow_redirects=True, timeout=None).status, - 200 - ) - - def test_delete_properties(self): - """Test if different arguments with DELETE request breaks the code or not""" - self.assertEqual(self.fetcher.delete(self.delete_url, stealthy_headers=True).status, 200) - self.assertEqual(self.fetcher.delete(self.delete_url, follow_redirects=True).status, 200) - self.assertEqual(self.fetcher.delete(self.delete_url, timeout=None).status, 200) - self.assertEqual( - self.fetcher.delete(self.delete_url, stealthy_headers=True, follow_redirects=True, timeout=None).status, - 200 - ) diff --git a/tests/fetchers/test_playwright.py b/tests/fetchers/test_playwright.py deleted file mode 100644 index dda30e0..0000000 --- a/tests/fetchers/test_playwright.py +++ /dev/null @@ -1,77 +0,0 @@ -import unittest - -import pytest_httpbin - -from scrapling import PlayWrightFetcher - - -@pytest_httpbin.use_class_based_httpbin -# @pytest_httpbin.use_class_based_httpbin_secure -class TestPlayWrightFetcher(unittest.TestCase): - def setUp(self): - self.fetcher = PlayWrightFetcher(auto_match=False) - url = self.httpbin.url - self.status_200 = f'{url}/status/200' - self.status_404 = f'{url}/status/404' - self.status_501 = f'{url}/status/501' - self.basic_url = f'{url}/get' - self.html_url = f'{url}/html' - self.delayed_url = f'{url}/delay/10' # 10 Seconds delay response - self.cookies_url = f"{url}/cookies/set/test/value" - - def test_basic_fetch(self): - """Test doing basic fetch request with multiple statuses""" - self.assertEqual(self.fetcher.fetch(self.status_200).status, 200) - self.assertEqual(self.fetcher.fetch(self.status_404).status, 404) - self.assertEqual(self.fetcher.fetch(self.status_501).status, 501) - - def test_networkidle(self): - """Test if waiting for `networkidle` make page does not finish loading or not""" - self.assertEqual(self.fetcher.fetch(self.basic_url, network_idle=True).status, 200) - - def test_blocking_resources(self): - """Test if blocking resources make page does not finish loading or not""" - self.assertEqual(self.fetcher.fetch(self.basic_url, disable_resources=True).status, 200) - - def test_waiting_selector(self): - """Test if waiting for a selector make page does not finish loading or not""" - self.assertEqual(self.fetcher.fetch(self.html_url, wait_selector='h1').status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, wait_selector='h1', wait_selector_state='visible').status, 200) - - def test_cookies_loading(self): - """Test if cookies are set after the request""" - self.assertEqual(self.fetcher.fetch(self.cookies_url).cookies, {'test': 'value'}) - - def test_automation(self): - """Test if automation break the code or not""" - def scroll_page(page): - page.mouse.wheel(10, 0) - page.mouse.move(100, 400) - page.mouse.up() - return page - - self.assertEqual(self.fetcher.fetch(self.html_url, page_action=scroll_page).status, 200) - - def test_properties(self): - """Test if different arguments breaks the code or not""" - self.assertEqual(self.fetcher.fetch(self.html_url, disable_webgl=True, hide_canvas=False).status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, disable_webgl=False, hide_canvas=True).status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, stealth=True).status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, useragent='Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:131.0) Gecko/20100101 Firefox/131.0').status, 200) - self.assertEqual(self.fetcher.fetch(self.html_url, extra_headers={'ayo': ''}).status, 200) - - def test_cdp_url(self): - """Test if it's going to try to connect to cdp url or not""" - with self.assertRaises(ValueError): - _ = self.fetcher.fetch(self.html_url, cdp_url='blahblah') - - with self.assertRaises(ValueError): - _ = self.fetcher.fetch(self.html_url, cdp_url='blahblah', nstbrowser_mode=True) - - with self.assertRaises(Exception): - # There's no type for this error in PlayWright, it's just `Error` - _ = self.fetcher.fetch(self.html_url, cdp_url='ws://blahblah') - - def test_infinite_timeout(self): - """Test if infinite timeout breaks the code or not""" - self.assertEqual(self.fetcher.fetch(self.delayed_url, timeout=None).status, 200) diff --git a/tests/fetchers/test_utils.py b/tests/fetchers/test_utils.py index 5fc1906..044c9b5 100644 --- a/tests/fetchers/test_utils.py +++ b/tests/fetchers/test_utils.py @@ -1,129 +1,97 @@ -import unittest +import pytest from scrapling.engines.toolbelt.custom import ResponseEncoding, StatusText -class TestPlayWrightFetcher(unittest.TestCase): - def setUp(self): - self.content_type_map = { - # A map generated by ChatGPT for most possible `content_type` values and the expected outcome - 'text/html; charset=UTF-8': 'UTF-8', - 'text/html; charset=ISO-8859-1': 'ISO-8859-1', - 'text/html': 'ISO-8859-1', - 'application/json; charset=UTF-8': 'UTF-8', - 'application/json': 'utf-8', - 'text/json': 'utf-8', - 'application/javascript; charset=UTF-8': 'UTF-8', - 'application/javascript': 'utf-8', - 'text/plain; charset=UTF-8': 'UTF-8', - 'text/plain; charset=ISO-8859-1': 'ISO-8859-1', - 'text/plain': 'ISO-8859-1', - 'application/xhtml+xml; charset=UTF-8': 'UTF-8', - 'application/xhtml+xml': 'utf-8', - 'text/html; charset=windows-1252': 'windows-1252', - 'application/json; charset=windows-1252': 'windows-1252', - 'text/plain; charset=windows-1252': 'windows-1252', - 'text/html; charset="UTF-8"': 'UTF-8', - 'text/html; charset="ISO-8859-1"': 'ISO-8859-1', - 'text/html; charset="windows-1252"': 'windows-1252', - 'application/json; charset="UTF-8"': 'UTF-8', - 'application/json; charset="ISO-8859-1"': 'ISO-8859-1', - 'application/json; charset="windows-1252"': 'windows-1252', - 'text/json; charset="UTF-8"': 'UTF-8', - 'application/javascript; charset="UTF-8"': 'UTF-8', - 'application/javascript; charset="ISO-8859-1"': 'ISO-8859-1', - 'text/plain; charset="UTF-8"': 'UTF-8', - 'text/plain; charset="ISO-8859-1"': 'ISO-8859-1', - 'text/plain; charset="windows-1252"': 'windows-1252', - 'application/xhtml+xml; charset="UTF-8"': 'UTF-8', - 'application/xhtml+xml; charset="ISO-8859-1"': 'ISO-8859-1', - 'application/xhtml+xml; charset="windows-1252"': 'windows-1252', - 'text/html; charset="US-ASCII"': 'US-ASCII', - 'application/json; charset="US-ASCII"': 'US-ASCII', - 'text/plain; charset="US-ASCII"': 'US-ASCII', - 'text/html; charset="Shift_JIS"': 'Shift_JIS', - 'application/json; charset="Shift_JIS"': 'Shift_JIS', - 'text/plain; charset="Shift_JIS"': 'Shift_JIS', - 'application/xml; charset="UTF-8"': 'UTF-8', - 'application/xml; charset="ISO-8859-1"': 'ISO-8859-1', - 'application/xml': 'utf-8', - 'text/xml; charset="UTF-8"': 'UTF-8', - 'text/xml; charset="ISO-8859-1"': 'ISO-8859-1', - 'text/xml': 'utf-8' - } - self.status_map = { - 100: "Continue", - 101: "Switching Protocols", - 102: "Processing", - 103: "Early Hints", - 200: "OK", - 201: "Created", - 202: "Accepted", - 203: "Non-Authoritative Information", - 204: "No Content", - 205: "Reset Content", - 206: "Partial Content", - 207: "Multi-Status", - 208: "Already Reported", - 226: "IM Used", - 300: "Multiple Choices", - 301: "Moved Permanently", - 302: "Found", - 303: "See Other", - 304: "Not Modified", - 305: "Use Proxy", - 307: "Temporary Redirect", - 308: "Permanent Redirect", - 400: "Bad Request", - 401: "Unauthorized", - 402: "Payment Required", - 403: "Forbidden", - 404: "Not Found", - 405: "Method Not Allowed", - 406: "Not Acceptable", - 407: "Proxy Authentication Required", - 408: "Request Timeout", - 409: "Conflict", - 410: "Gone", - 411: "Length Required", - 412: "Precondition Failed", - 413: "Payload Too Large", - 414: "URI Too Long", - 415: "Unsupported Media Type", - 416: "Range Not Satisfiable", - 417: "Expectation Failed", - 418: "I'm a teapot", - 421: "Misdirected Request", - 422: "Unprocessable Entity", - 423: "Locked", - 424: "Failed Dependency", - 425: "Too Early", - 426: "Upgrade Required", - 428: "Precondition Required", - 429: "Too Many Requests", - 431: "Request Header Fields Too Large", - 451: "Unavailable For Legal Reasons", - 500: "Internal Server Error", - 501: "Not Implemented", - 502: "Bad Gateway", - 503: "Service Unavailable", - 504: "Gateway Timeout", - 505: "HTTP Version Not Supported", - 506: "Variant Also Negotiates", - 507: "Insufficient Storage", - 508: "Loop Detected", - 510: "Not Extended", - 511: "Network Authentication Required" - } +@pytest.fixture +def content_type_map(): + return { + # A map generated by ChatGPT for most possible `content_type` values and the expected outcome + 'text/html; charset=UTF-8': 'UTF-8', + 'text/html; charset=ISO-8859-1': 'ISO-8859-1', + 'text/html': 'ISO-8859-1', + 'application/json; charset=UTF-8': 'UTF-8', + 'application/json': 'utf-8', + 'text/json': 'utf-8', + 'application/javascript; charset=UTF-8': 'UTF-8', + 'application/javascript': 'utf-8', + 'text/plain; charset=UTF-8': 'UTF-8', + 'text/plain; charset=ISO-8859-1': 'ISO-8859-1', + 'text/plain': 'ISO-8859-1', + 'application/xhtml+xml; charset=UTF-8': 'UTF-8', + 'application/xhtml+xml': 'utf-8', + 'text/html; charset=windows-1252': 'windows-1252', + 'application/json; charset=windows-1252': 'windows-1252', + 'text/plain; charset=windows-1252': 'windows-1252', + 'text/html; charset="UTF-8"': 'UTF-8', + 'text/html; charset="ISO-8859-1"': 'ISO-8859-1', + 'text/html; charset="windows-1252"': 'windows-1252', + 'application/json; charset="UTF-8"': 'UTF-8', + 'application/json; charset="ISO-8859-1"': 'ISO-8859-1', + 'application/json; charset="windows-1252"': 'windows-1252', + 'text/json; charset="UTF-8"': 'UTF-8', + 'application/javascript; charset="UTF-8"': 'UTF-8', + 'application/javascript; charset="ISO-8859-1"': 'ISO-8859-1', + 'text/plain; charset="UTF-8"': 'UTF-8', + 'text/plain; charset="ISO-8859-1"': 'ISO-8859-1', + 'text/plain; charset="windows-1252"': 'windows-1252', + 'application/xhtml+xml; charset="UTF-8"': 'UTF-8', + 'application/xhtml+xml; charset="ISO-8859-1"': 'ISO-8859-1', + 'application/xhtml+xml; charset="windows-1252"': 'windows-1252', + 'text/html; charset="US-ASCII"': 'US-ASCII', + 'application/json; charset="US-ASCII"': 'US-ASCII', + 'text/plain; charset="US-ASCII"': 'US-ASCII', + 'text/html; charset="Shift_JIS"': 'Shift_JIS', + 'application/json; charset="Shift_JIS"': 'Shift_JIS', + 'text/plain; charset="Shift_JIS"': 'Shift_JIS', + 'application/xml; charset="UTF-8"': 'UTF-8', + 'application/xml; charset="ISO-8859-1"': 'ISO-8859-1', + 'application/xml': 'utf-8', + 'text/xml; charset="UTF-8"': 'UTF-8', + 'text/xml; charset="ISO-8859-1"': 'ISO-8859-1', + 'text/xml': 'utf-8' + } - def test_parsing_content_type(self): - """Test if parsing different types of content-type returns the expected result""" - for header_value, expected_encoding in self.content_type_map.items(): - self.assertEqual(ResponseEncoding.get_value(header_value), expected_encoding) - def test_parsing_response_status(self): - """Test if using different http responses' status codes returns the expected result""" - for status_code, expected_status_text in self.status_map.items(): - self.assertEqual(StatusText.get(status_code), expected_status_text) +@pytest.fixture +def status_map(): + return { + 100: "Continue", 101: "Switching Protocols", 102: "Processing", 103: "Early Hints", + 200: "OK", 201: "Created", 202: "Accepted", 203: "Non-Authoritative Information", + 204: "No Content", 205: "Reset Content", 206: "Partial Content", 207: "Multi-Status", + 208: "Already Reported", 226: "IM Used", 300: "Multiple Choices", + 301: "Moved Permanently", 302: "Found", 303: "See Other", 304: "Not Modified", + 305: "Use Proxy", 307: "Temporary Redirect", 308: "Permanent Redirect", + 400: "Bad Request", 401: "Unauthorized", 402: "Payment Required", 403: "Forbidden", + 404: "Not Found", 405: "Method Not Allowed", 406: "Not Acceptable", + 407: "Proxy Authentication Required", 408: "Request Timeout", 409: "Conflict", + 410: "Gone", 411: "Length Required", 412: "Precondition Failed", + 413: "Payload Too Large", 414: "URI Too Long", 415: "Unsupported Media Type", + 416: "Range Not Satisfiable", 417: "Expectation Failed", 418: "I'm a teapot", + 421: "Misdirected Request", 422: "Unprocessable Entity", 423: "Locked", + 424: "Failed Dependency", 425: "Too Early", 426: "Upgrade Required", + 428: "Precondition Required", 429: "Too Many Requests", + 431: "Request Header Fields Too Large", 451: "Unavailable For Legal Reasons", + 500: "Internal Server Error", 501: "Not Implemented", 502: "Bad Gateway", + 503: "Service Unavailable", 504: "Gateway Timeout", + 505: "HTTP Version Not Supported", 506: "Variant Also Negotiates", + 507: "Insufficient Storage", 508: "Loop Detected", 510: "Not Extended", + 511: "Network Authentication Required" + } - self.assertEqual(StatusText.get(1000), "Unknown Status Code") + +def test_parsing_content_type(content_type_map): + """Test if parsing different types of content-type returns the expected result""" + for header_value, expected_encoding in content_type_map.items(): + assert ResponseEncoding.get_value(header_value) == expected_encoding + + +def test_parsing_response_status(status_map): + """Test if using different http responses' status codes returns the expected result""" + for status_code, expected_status_text in status_map.items(): + assert StatusText.get(status_code) == expected_status_text + + +def test_unknown_status_code(): + """Test handling of an unknown status code""" + assert StatusText.get(1000) == "Unknown Status Code" diff --git a/tests/parser/test_automatch.py b/tests/parser/test_automatch.py index 1e78e87..38ad88b 100644 --- a/tests/parser/test_automatch.py +++ b/tests/parser/test_automatch.py @@ -1,10 +1,11 @@ -import unittest +import asyncio -from scrapling import Adaptor +import pytest +from scrapling import Adaptor -class TestParserAutoMatch(unittest.TestCase): +class TestParserAutoMatch: def test_element_relocation(self): """Test relocating element after structure change""" original_html = ''' @@ -42,15 +43,69 @@ def test_element_relocation(self): ''' - old_page = Adaptor(original_html, url='example.com', auto_match=True, debug=True) - new_page = Adaptor(changed_html, url='example.com', auto_match=True, debug=True) + old_page = Adaptor(original_html, url='example.com', auto_match=True) + new_page = Adaptor(changed_html, url='example.com', auto_match=True) + + # 'p1' was used as ID and now it's not and all the path elements have changes + # Also at the same time testing auto-match vs combined selectors + _ = old_page.css('#p1, #p2', auto_save=True)[0] + relocated = new_page.css('#p1', auto_match=True) + + assert relocated is not None + assert relocated[0].attrib['data-id'] == 'p1' + assert relocated[0].has_class('new-class') + assert relocated[0].css('.new-description')[0].text == 'Description 1' + + @pytest.mark.asyncio + async def test_element_relocation_async(self): + """Test relocating element after structure change in async mode""" + original_html = ''' +
+
+
+

Product 1

+

Description 1

+
+
+

Product 2

+

Description 2

+
+
+
+ ''' + changed_html = ''' +
+
+
+
+
+

Product 1

+

Description 1

+
+
+
+
+

Product 2

+

Description 2

+
+
+
+
+
+ ''' + + # Simulate async operation + await asyncio.sleep(0.1) # Minimal async operation + + old_page = Adaptor(original_html, url='example.com', auto_match=True) + new_page = Adaptor(changed_html, url='example.com', auto_match=True) # 'p1' was used as ID and now it's not and all the path elements have changes # Also at the same time testing auto-match vs combined selectors _ = old_page.css('#p1, #p2', auto_save=True)[0] relocated = new_page.css('#p1', auto_match=True) - self.assertIsNotNone(relocated) - self.assertEqual(relocated[0].attrib['data-id'], 'p1') - self.assertTrue(relocated[0].has_class('new-class')) - self.assertEqual(relocated[0].css('.new-description')[0].text, 'Description 1') + assert relocated is not None + assert relocated[0].attrib['data-id'] == 'p1' + assert relocated[0].has_class('new-class') + assert relocated[0].css('.new-description')[0].text == 'Description 1' diff --git a/tests/parser/test_general.py b/tests/parser/test_general.py index ea1fb78..62c9fde 100644 --- a/tests/parser/test_general.py +++ b/tests/parser/test_general.py @@ -1,288 +1,330 @@ - import pickle -import unittest +import time +import pytest from cssselect import SelectorError, SelectorSyntaxError from scrapling import Adaptor -class TestParser(unittest.TestCase): - def setUp(self): - self.html = ''' - - - Complex Web Page - - - -
- -
-
-
-

Products

-
-
-

Product 1

-

This is product 1

- $10.99 - -
-
-

Product 2

-

This is product 2

- $20.99 - -
-
-

Product 3

-

This is product 3

- $15.99 - -
+@pytest.fixture +def html_content(): + return ''' + + + Complex Web Page + + + +
+ +
+
+
+

Products

+
+
+

Product 1

+

This is product 1

+ $10.99 + +
+
+

Product 2

+

This is product 2

+ $20.99 + +
+
+

Product 3

+

This is product 3

+ $15.99 + +
+
+
+
+

Customer Reviews

+
+
+

Great product!

+ John Doe
-
-
-

Customer Reviews

-
-
-

Great product!

- John Doe -
-
-

Good value for money.

- Jane Smith -
+
+

Good value for money.

+ Jane Smith
-
-
-
-

© 2024 Our Company

-
- - - - ''' - self.page = Adaptor(self.html, auto_match=False, debug=False) - - def test_css_selector(self): - """Test Selecting elements with complex CSS selectors""" - elements = self.page.css('main #products .product-list article.product') - self.assertEqual(len(elements), 3) - - in_stock_products = self.page.css( +
+
+
+
+

© 2024 Our Company

+
+ + + + ''' + + +@pytest.fixture +def page(html_content): + return Adaptor(html_content, auto_match=False) + + +# CSS Selector Tests +class TestCSSSelectors: + def test_basic_product_selection(self, page): + """Test selecting all product elements""" + elements = page.css('main #products .product-list article.product') + assert len(elements) == 3 + + def test_in_stock_product_selection(self, page): + """Test selecting in-stock products""" + in_stock_products = page.css( 'main #products .product-list article.product:not(:contains("Out of stock"))') - self.assertEqual(len(in_stock_products), 2) + assert len(in_stock_products) == 2 + - def test_xpath_selector(self): - """Test Selecting elements with Complex XPath selectors""" - reviews = self.page.xpath( +# XPath Selector Tests +class TestXPathSelectors: + def test_high_rating_reviews(self, page): + """Test selecting reviews with high ratings""" + reviews = page.xpath( '//section[@id="reviews"]//div[contains(@class, "review") and @data-rating >= 4]' ) - self.assertEqual(len(reviews), 2) + assert len(reviews) == 2 - high_priced_products = self.page.xpath( + def test_high_priced_products(self, page): + """Test selecting products above a certain price""" + high_priced_products = page.xpath( '//article[contains(@class, "product")]' '[number(translate(substring-after(.//span[@class="price"], "$"), ",", "")) > 15]' ) - self.assertEqual(len(high_priced_products), 2) + assert len(high_priced_products) == 2 + + +# Text Matching Tests +class TestTextMatching: + def test_regex_multiple_matches(self, page): + """Test finding multiple matches with regex""" + stock_info = page.find_by_regex(r'In stock: \d+', first_match=False) + assert len(stock_info) == 2 - def test_find_by_text(self): - """Test Selecting elements with Text matching""" - stock_info = self.page.find_by_regex(r'In stock: \d+', first_match=False) - self.assertEqual(len(stock_info), 2) + def test_regex_first_match(self, page): + """Test finding the first match with regex""" + stock_info = page.find_by_regex(r'In stock: \d+', first_match=True, case_sensitive=True) + assert stock_info.text == 'In stock: 5' - stock_info = self.page.find_by_regex(r'In stock: \d+', first_match=True, case_sensitive=True) - self.assertEqual(stock_info.text, 'In stock: 5') + def test_partial_text_match(self, page): + """Test finding elements with partial text match""" + stock_info = page.find_by_text(r'In stock:', partial=True, first_match=False) + assert len(stock_info) == 2 - stock_info = self.page.find_by_text(r'In stock:', partial=True, first_match=False) - self.assertEqual(len(stock_info), 2) + def test_exact_text_match(self, page): + """Test finding elements with exact text match""" + out_of_stock = page.find_by_text('Out of stock', partial=False, first_match=False) + assert len(out_of_stock) == 1 - out_of_stock = self.page.find_by_text('Out of stock', partial=False, first_match=False) - self.assertEqual(len(out_of_stock), 1) - def test_find_similar_elements(self): - """Test Finding similar elements of an element""" - first_product = self.page.css_first('.product') +# Similar Elements Tests +class TestSimilarElements: + def test_finding_similar_products(self, page): + """Test finding similar product elements""" + first_product = page.css_first('.product') similar_products = first_product.find_similar() - self.assertEqual(len(similar_products), 2) + assert len(similar_products) == 2 - first_review = self.page.find('div', class_='review') + def test_finding_similar_reviews(self, page): + """Test finding similar review elements with additional filtering""" + first_review = page.find('div', class_='review') similar_high_rated_reviews = [ review for review in first_review.find_similar() if int(review.attrib.get('data-rating', 0)) >= 4 ] - self.assertEqual(len(similar_high_rated_reviews), 1) + assert len(similar_high_rated_reviews) == 1 - def test_expected_errors(self): - """Test errors that should raised if it does""" - with self.assertRaises(ValueError): + +# Error Handling Tests +class TestErrorHandling: + def test_invalid_adaptor_initialization(self): + """Test various invalid Adaptor initializations""" + # No arguments + with pytest.raises(ValueError): _ = Adaptor(auto_match=False) - with self.assertRaises(TypeError): + # Invalid argument types + with pytest.raises(TypeError): _ = Adaptor(root="ayo", auto_match=False) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): _ = Adaptor(text=1, auto_match=False) - with self.assertRaises(TypeError): + with pytest.raises(TypeError): _ = Adaptor(body=1, auto_match=False) - with self.assertRaises(ValueError): - _ = Adaptor(self.html, storage=object, auto_match=True) - - def test_pickleable(self): - """Test that objects aren't pickleable""" - table = self.page.css('.product-list')[0] - with self.assertRaises(TypeError): # Adaptors - pickle.dumps(table) - - with self.assertRaises(TypeError): # Adaptor - pickle.dumps(table[0]) - - def test_overridden(self): - """Test overridden functions""" - table = self.page.css('.product-list')[0] - self.assertTrue(issubclass(type(table.__str__()), str)) - self.assertTrue(issubclass(type(table.__repr__()), str)) - self.assertTrue(issubclass(type(table.attrib.__str__()), str)) - self.assertTrue(issubclass(type(table.attrib.__repr__()), str)) - - def test_bad_selector(self): - """Test object can handle bad selector""" - with self.assertRaises((SelectorError, SelectorSyntaxError,)): - self.page.css('4 ayo') + def test_invalid_storage(self, page, html_content): + """Test invalid storage parameter""" + with pytest.raises(ValueError): + _ = Adaptor(html_content, storage=object, auto_match=True) - with self.assertRaises((SelectorError, SelectorSyntaxError,)): - self.page.xpath('4 ayo') + def test_bad_selectors(self, page): + """Test handling of invalid selectors""" + with pytest.raises((SelectorError, SelectorSyntaxError)): + page.css('4 ayo') - def test_selectors_generation(self): - """Try to create selectors for all elements in the page""" - def _traverse(element: Adaptor): - self.assertTrue(type(element.generate_css_selector) is str) - self.assertTrue(type(element.generate_xpath_selector) is str) - for branch in element.children: - _traverse(branch) + with pytest.raises((SelectorError, SelectorSyntaxError)): + page.xpath('4 ayo') - _traverse(self.page) - def test_getting_all_text(self): - """Test getting all text""" - self.assertNotEqual(self.page.get_all_text(), '') - - def test_element_navigation(self): - """Test moving in the page from selected element""" - table = self.page.css('.product-list')[0] +# Pickling and Object Representation Tests +class TestPicklingAndRepresentation: + def test_unpickleable_objects(self, page): + """Test that Adaptor objects cannot be pickled""" + table = page.css('.product-list')[0] + with pytest.raises(TypeError): + pickle.dumps(table) - self.assertIsNot(table.path, []) - self.assertNotEqual(table.html_content, '') - self.assertNotEqual(table.prettify(), '') + with pytest.raises(TypeError): + pickle.dumps(table[0]) + def test_string_representations(self, page): + """Test custom string representations of objects""" + table = page.css('.product-list')[0] + assert issubclass(type(table.__str__()), str) + assert issubclass(type(table.__repr__()), str) + assert issubclass(type(table.attrib.__str__()), str) + assert issubclass(type(table.attrib.__repr__()), str) + + +# Navigation and Traversal Tests +class TestElementNavigation: + def test_basic_navigation_properties(self, page): + """Test basic navigation properties of elements""" + table = page.css('.product-list')[0] + assert table.path is not None + assert table.html_content != '' + assert table.prettify() != '' + + def test_parent_and_sibling_navigation(self, page): + """Test parent and sibling navigation""" + table = page.css('.product-list')[0] parent = table.parent - self.assertEqual(parent.attrib['id'], 'products') - - children = table.children - self.assertEqual(len(children), 3) + assert parent.attrib['id'] == 'products' parent_siblings = parent.siblings - self.assertEqual(len(parent_siblings), 1) + assert len(parent_siblings) == 1 + + def test_child_navigation(self, page): + """Test child navigation""" + table = page.css('.product-list')[0] + children = table.children + assert len(children) == 3 - child = table.find({'data-id': "1"}) + def test_next_and_previous_navigation(self, page): + """Test next and previous element navigation""" + child = page.css('.product-list')[0].find({'data-id': "1"}) next_element = child.next - self.assertEqual(next_element.attrib['data-id'], '2') + assert next_element.attrib['data-id'] == '2' prev_element = next_element.previous - self.assertEqual(prev_element.tag, child.tag) + assert prev_element.tag == child.tag - all_prices = self.page.css('.price') + def test_ancestor_finding(self, page): + """Test finding ancestors of elements""" + all_prices = page.css('.price') products_with_prices = [ price.find_ancestor(lambda p: p.has_class('product')) for price in all_prices ] - self.assertEqual(len(products_with_prices), 3) - - def test_empty_return(self): - """Test cases where functions shouldn't have results""" - test_html = """ - - - - """ - soup = Adaptor(test_html, auto_match=False, keep_comments=False) - html_tag = soup.css('html')[0] - self.assertEqual(html_tag.path, []) - self.assertEqual(html_tag.siblings, []) - self.assertEqual(html_tag.parent, None) - self.assertEqual(html_tag.find_ancestor(lambda e: e), None) - - self.assertEqual(soup.css('#a a')[0].next, None) - self.assertEqual(soup.css('#b a')[0].previous, None) - - def test_text_to_json(self): - """Test converting text to json""" - script_content = self.page.css('#page-data::text')[0] - self.assertTrue(issubclass(type(script_content.sort()), str)) + assert len(products_with_prices) == 3 + + +# JSON and Attribute Tests +class TestJSONAndAttributes: + def test_json_conversion(self, page): + """Test converting content to JSON""" + script_content = page.css('#page-data::text')[0] + assert issubclass(type(script_content.sort()), str) page_data = script_content.json() - self.assertEqual(page_data['totalProducts'], 3) - self.assertTrue('lastUpdated' in page_data) - - def test_regex_on_text(self): - """Test doing regex on a selected text""" - element = self.page.css('[data-id="1"] .price')[0] - match = element.re_first(r'[\.\d]+') - self.assertEqual(match, '10.99') - match = element.text.re(r'(\d+)', replace_entities=False) - self.assertEqual(len(match), 2) - - def test_attribute_operations(self): - """Test operations on elements attributes""" - products = self.page.css('.product') + assert page_data['totalProducts'] == 3 + assert 'lastUpdated' in page_data + + def test_attribute_operations(self, page): + """Test various attribute-related operations""" + # Product ID extraction + products = page.css('.product') product_ids = [product.attrib['data-id'] for product in products] - self.assertEqual(product_ids, ['1', '2', '3']) - self.assertTrue('data-id' in products[0].attrib) + assert product_ids == ['1', '2', '3'] + assert 'data-id' in products[0].attrib - reviews = self.page.css('.review') + # Review rating calculations + reviews = page.css('.review') review_ratings = [int(review.attrib['data-rating']) for review in reviews] - self.assertEqual(sum(review_ratings) / len(review_ratings), 4.5) + assert sum(review_ratings) / len(review_ratings) == 4.5 + # Attribute searching key_value = list(products[0].attrib.search_values('1', partial=False)) - self.assertEqual(list(key_value[0].keys()), ['data-id']) + assert list(key_value[0].keys()) == ['data-id'] key_value = list(products[0].attrib.search_values('1', partial=True)) - self.assertEqual(list(key_value[0].keys()), ['data-id']) + assert list(key_value[0].keys()) == ['data-id'] + + # JSON attribute conversion + attr_json = page.css_first('#products').attrib['schema'].json() + assert attr_json == {'jsonable': 'data'} + assert isinstance(page.css('#products')[0].attrib.json_string, bytes) + + +# Performance Test +def test_large_html_parsing_performance(): + """Test parsing and selecting performance on large HTML""" + large_html = '' + '
' * 5000 + '
' * 5000 + '' + + start_time = time.time() + parsed = Adaptor(large_html, auto_match=False) + elements = parsed.css('.item') + end_time = time.time() + + assert len(elements) == 5000 + # Converting 5000 elements to a class and doing operations on them will take time + # Based on my tests with 100 runs, 1 loop each Scrapling (given the extra work/features) takes 10.4ms on average + assert end_time - start_time < 0.5 # Locally I test on 0.1 but on GitHub actions with browsers and threading sometimes closing adds fractions of seconds + + +# Selector Generation Test +def test_selectors_generation(page): + """Try to create selectors for all elements in the page""" - attr_json = self.page.css_first('#products').attrib['schema'].json() - self.assertEqual(attr_json, {'jsonable': 'data'}) - self.assertEqual(type(self.page.css('#products')[0].attrib.json_string), bytes) + def _traverse(element: Adaptor): + assert isinstance(element.generate_css_selector, str) + assert isinstance(element.generate_xpath_selector, str) + for branch in element.children: + _traverse(branch) - def test_performance(self): - """Test parsing and selecting speed""" - import time - large_html = '' + '
' * 5000 + '
' * 5000 + '' + _traverse(page) - start_time = time.time() - parsed = Adaptor(large_html, auto_match=False, debug=False) - elements = parsed.css('.item') - end_time = time.time() - self.assertEqual(len(elements), 5000) - # Converting 5000 elements to a class and doing operations on them will take time - # Based on my tests with 100 runs, 1 loop each Scrapling (given the extra work/features) takes 10.4ms on average - self.assertLess(end_time - start_time, 0.5) # Locally I test on 0.1 but on GitHub actions with browsers and threading sometimes closing adds fractions of seconds +# Miscellaneous Tests +def test_getting_all_text(page): + """Test getting all text from the page""" + assert page.get_all_text() != '' -# Use `coverage run -m unittest --verbose tests/test_parser_functions.py` instead for the coverage report -# if __name__ == '__main__': -# unittest.main(verbosity=2) +def test_regex_on_text(page): + """Test regex operations on text""" + element = page.css('[data-id="1"] .price')[0] + match = element.re_first(r'[\.\d]+') + assert match == '10.99' + match = element.text.re(r'(\d+)', replace_entities=False) + assert len(match) == 2 diff --git a/tests/requirements.txt b/tests/requirements.txt index 394a2ea..52f672c 100644 --- a/tests/requirements.txt +++ b/tests/requirements.txt @@ -4,5 +4,6 @@ playwright camoufox werkzeug<3.0.0 pytest-httpbin==2.1.0 +pytest-asyncio httpbin~=0.10.0 pytest-xdist diff --git a/tox.ini b/tox.ini index 28b09e1..b78af38 100644 --- a/tox.ini +++ b/tox.ini @@ -4,7 +4,7 @@ # and then run "tox" from this directory. [tox] -envlist = pre-commit,py{38,39,310,311,312,313} +envlist = pre-commit,py{39,310,311,312,313} [testenv] usedevelop = True @@ -15,8 +15,7 @@ commands = playwright install chromium playwright install-deps chromium firefox camoufox fetch --browserforge - py38: pytest --config-file=pytest.ini --cov=scrapling --cov-report=xml - py{39,310,311,312,313}: pytest --config-file=pytest.ini --cov=scrapling --cov-report=xml -n auto + pytest --cov=scrapling --cov-report=xml -n auto [testenv:pre-commit] basepython = python3