Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.7 #18

Merged
merged 10 commits into from
Nov 26, 2024
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
* [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
* [Extraction By Text Speed Test](#extraction-by-text-speed-test)
* [Installation](#installation)
* [Fetching Websites Features](#fetching-websites-features)
* [Fetcher](#fetcher)
* [StealthyFetcher](#stealthyfetcher)
* [PlayWrightFetcher](#playwrightfetcher)
* [Fetching Websites](#fetching-websites)
* [Features](#features)
* [Fetcher class](#fetcher)
* [StealthyFetcher class](#stealthyfetcher)
* [PlayWrightFetcher class](#playwrightfetcher)
* [Advanced Parsing Features](#advanced-parsing-features)
* [Smart Navigation](#smart-navigation)
* [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
Expand Down Expand Up @@ -210,7 +211,10 @@ playwright install chromium
python -m browserforge update
```

## Fetching Websites Features
## Fetching Websites
Fetchers are basically interfaces that do requests or fetch pages for you in a single request fashion then return an `Adaptor` object for you. This feature was introduced because the only option we had before was to fetch the page as you want then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.

### Features
You might be a little bit confused by now so let me clear things up. All fetcher-type classes are imported in the same way
```python
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
Expand All @@ -233,9 +237,11 @@ Also, the `Response` object returned from all fetchers is the same as `Adaptor`
This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.

For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.

You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
```python
>> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')
```
Expand Down Expand Up @@ -263,6 +269,7 @@ True
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
| humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
| disable_ads | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. | ✔️ |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
Expand Down Expand Up @@ -317,6 +324,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
| stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
| real_chrome | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | ✔️ |
| locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
| nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
| nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
Expand Down
2 changes: 1 addition & 1 deletion scrapling/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from scrapling.core.custom_types import TextHandler, AttributesHandler

__author__ = "Karim Shoair ([email protected])"
__version__ = "0.2.6"
__version__ = "0.2.7"
__copyright__ = "Copyright (c) 2024 Karim Shoair"


Expand Down
13 changes: 12 additions & 1 deletion scrapling/engines/camo.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
generate_convincing_referer,
)

from camoufox import DefaultAddons
from camoufox.sync_api import Camoufox


Expand All @@ -21,7 +22,8 @@ def __init__(
block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = False, network_idle: Optional[bool] = False, humanize: Optional[Union[bool, float]] = True,
timeout: Optional[float] = 30000, page_action: Callable = do_nothing, wait_selector: Optional[str] = None, addons: Optional[List[str]] = None,
wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None,
proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, adaptor_arguments: Dict = None
proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True,
adaptor_arguments: Dict = None,
):
"""An engine that utilizes Camoufox library, check the `StealthyFetcher` class for more documentation.

Expand All @@ -36,6 +38,7 @@ def __init__(
:param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
:param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases.
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
:param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
:param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
Expand All @@ -54,6 +57,7 @@ def __init__(
self.network_idle = bool(network_idle)
self.google_search = bool(google_search)
self.os_randomize = bool(os_randomize)
self.disable_ads = bool(disable_ads)
self.extra_headers = extra_headers or {}
self.proxy = construct_proxy_dict(proxy)
self.addons = addons or []
Expand All @@ -75,9 +79,11 @@ def fetch(self, url: str) -> Response:
:param url: Target url.
:return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`
"""
addons = [] if self.disable_ads else [DefaultAddons.UBO]
with Camoufox(
proxy=self.proxy,
addons=self.addons,
exclude_addons=addons,
headless=self.headless,
humanize=self.humanize,
i_know_what_im_doing=True, # To turn warnings off with the user configurations
Expand Down Expand Up @@ -105,6 +111,11 @@ def fetch(self, url: str) -> Response:
if self.wait_selector and type(self.wait_selector) is str:
waiter = page.locator(self.wait_selector)
waiter.first.wait_for(state=self.wait_selector_state)
# Wait again after waiting for the selector, helpful with protections like Cloudflare
page.wait_for_load_state(state="load")
page.wait_for_load_state(state="domcontentloaded")
if self.network_idle:
page.wait_for_load_state('networkidle')

# This will be parsed inside `Response`
encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding
Expand Down
2 changes: 1 addition & 1 deletion scrapling/engines/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
'--disable-default-apps',
'--disable-print-preview',
'--disable-dev-shm-usage',
'--disable-popup-blocking',
# '--disable-popup-blocking',
'--metrics-recording-only',
'--disable-crash-reporter',
'--disable-partial-raster',
Expand Down
24 changes: 21 additions & 3 deletions scrapling/engines/pw.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def __init__(
timeout: Optional[float] = 30000,
page_action: Callable = do_nothing,
wait_selector: Optional[str] = None,
locale: Optional[str] = 'en-US',
wait_selector_state: Optional[str] = 'attached',
stealth: Optional[bool] = False,
real_chrome: Optional[bool] = False,
Expand All @@ -50,6 +51,7 @@ def __init__(
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
:param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
:param wait_selector: Wait for a specific css selector to be in a specific state.
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
:param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
:param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
:param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
Expand All @@ -64,6 +66,7 @@ def __init__(
:param adaptor_arguments: The arguments that will be passed in the end while creating the final Adaptor's class.
"""
self.headless = headless
self.locale = check_type_validity(locale, [str], 'en-US', param_name='locale')
self.disable_resources = disable_resources
self.network_idle = bool(network_idle)
self.stealth = bool(stealth)
Expand All @@ -87,6 +90,14 @@ def __init__(
self.nstbrowser_mode = bool(nstbrowser_mode)
self.nstbrowser_config = nstbrowser_config
self.adaptor_arguments = adaptor_arguments if adaptor_arguments else {}
self.harmful_default_args = [
# This will be ignored to avoid detection more and possibly avoid the popup crashing bug abuse: https://issues.chromium.org/issues/340836884
'--enable-automation',
'--disable-popup-blocking',
# '--disable-component-update',
# '--disable-default-apps',
# '--disable-extensions',
]

def _cdp_url_logic(self, flags: Optional[List] = None) -> str:
"""Constructs new CDP URL if NSTBrowser is enabled otherwise return CDP URL as it is
Expand Down Expand Up @@ -151,15 +162,15 @@ def fetch(self, url: str) -> Response:
else:
if self.stealth:
browser = p.chromium.launch(
headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
headless=self.headless, args=flags, ignore_default_args=self.harmful_default_args, chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
)
else:
browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'], channel='chrome' if self.real_chrome else 'chromium')
browser = p.chromium.launch(headless=self.headless, ignore_default_args=self.harmful_default_args, channel='chrome' if self.real_chrome else 'chromium')

# Creating the context
if self.stealth:
context = browser.new_context(
locale='en-US',
locale=self.locale,
is_mobile=False,
has_touch=False,
proxy=self.proxy,
Expand All @@ -176,6 +187,8 @@ def fetch(self, url: str) -> Response:
)
else:
context = browser.new_context(
locale=self.locale,
proxy=self.proxy,
color_scheme='dark',
user_agent=useragent,
device_scale_factor=2,
Expand Down Expand Up @@ -221,6 +234,11 @@ def fetch(self, url: str) -> Response:
if self.wait_selector and type(self.wait_selector) is str:
waiter = page.locator(self.wait_selector)
waiter.first.wait_for(state=self.wait_selector_state)
# Wait again after waiting for the selector, helpful with protections like Cloudflare
page.wait_for_load_state(state="load")
page.wait_for_load_state(state="domcontentloaded")
if self.network_idle:
page.wait_for_load_state('networkidle')

# This will be parsed inside `Response`
encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding
Expand Down
Loading