Skip to content

Commit

Permalink
Merge pull request #18 from D4Vinci/dev
Browse files Browse the repository at this point in the history
v0.2.7
  • Loading branch information
D4Vinci authored Nov 26, 2024
2 parents 468d9b8 + 06a47f9 commit 26aebba
Show file tree
Hide file tree
Showing 10 changed files with 101 additions and 40 deletions.
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
* [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
* [Extraction By Text Speed Test](#extraction-by-text-speed-test)
* [Installation](#installation)
* [Fetching Websites Features](#fetching-websites-features)
* [Fetcher](#fetcher)
* [StealthyFetcher](#stealthyfetcher)
* [PlayWrightFetcher](#playwrightfetcher)
* [Fetching Websites](#fetching-websites)
* [Features](#features)
* [Fetcher class](#fetcher)
* [StealthyFetcher class](#stealthyfetcher)
* [PlayWrightFetcher class](#playwrightfetcher)
* [Advanced Parsing Features](#advanced-parsing-features)
* [Smart Navigation](#smart-navigation)
* [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
Expand Down Expand Up @@ -210,7 +211,10 @@ playwright install chromium
python -m browserforge update
```
## Fetching Websites Features
## Fetching Websites
Fetchers are basically interfaces that do requests or fetch pages for you in a single request fashion then return an `Adaptor` object for you. This feature was introduced because the only option we had before was to fetch the page as you want then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.
### Features
You might be a little bit confused by now so let me clear things up. All fetcher-type classes are imported in the same way
```python
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
Expand All @@ -233,9 +237,11 @@ Also, the `Response` object returned from all fetchers is the same as `Adaptor`
This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.

For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.

You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
```python
>> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')
```
Expand Down Expand Up @@ -263,6 +269,7 @@ True
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
| humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
| disable_ads | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. | ✔️ |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
Expand Down Expand Up @@ -317,6 +324,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
| stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
| real_chrome | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | ✔️ |
| locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
| nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
| nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | ✔️ |
Expand Down
2 changes: 1 addition & 1 deletion scrapling/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from scrapling.core.custom_types import TextHandler, AttributesHandler

__author__ = "Karim Shoair ([email protected])"
__version__ = "0.2.6"
__version__ = "0.2.7"
__copyright__ = "Copyright (c) 2024 Karim Shoair"


Expand Down
13 changes: 12 additions & 1 deletion scrapling/engines/camo.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
generate_convincing_referer,
)

from camoufox import DefaultAddons
from camoufox.sync_api import Camoufox


Expand All @@ -21,7 +22,8 @@ def __init__(
block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = False, network_idle: Optional[bool] = False, humanize: Optional[Union[bool, float]] = True,
timeout: Optional[float] = 30000, page_action: Callable = do_nothing, wait_selector: Optional[str] = None, addons: Optional[List[str]] = None,
wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None,
proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, adaptor_arguments: Dict = None
proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True,
adaptor_arguments: Dict = None,
):
"""An engine that utilizes Camoufox library, check the `StealthyFetcher` class for more documentation.
Expand All @@ -36,6 +38,7 @@ def __init__(
:param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
:param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases.
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
:param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
:param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
Expand All @@ -54,6 +57,7 @@ def __init__(
self.network_idle = bool(network_idle)
self.google_search = bool(google_search)
self.os_randomize = bool(os_randomize)
self.disable_ads = bool(disable_ads)
self.extra_headers = extra_headers or {}
self.proxy = construct_proxy_dict(proxy)
self.addons = addons or []
Expand All @@ -75,9 +79,11 @@ def fetch(self, url: str) -> Response:
:param url: Target url.
:return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`
"""
addons = [] if self.disable_ads else [DefaultAddons.UBO]
with Camoufox(
proxy=self.proxy,
addons=self.addons,
exclude_addons=addons,
headless=self.headless,
humanize=self.humanize,
i_know_what_im_doing=True, # To turn warnings off with the user configurations
Expand Down Expand Up @@ -105,6 +111,11 @@ def fetch(self, url: str) -> Response:
if self.wait_selector and type(self.wait_selector) is str:
waiter = page.locator(self.wait_selector)
waiter.first.wait_for(state=self.wait_selector_state)
# Wait again after waiting for the selector, helpful with protections like Cloudflare
page.wait_for_load_state(state="load")
page.wait_for_load_state(state="domcontentloaded")
if self.network_idle:
page.wait_for_load_state('networkidle')

# This will be parsed inside `Response`
encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding
Expand Down
2 changes: 1 addition & 1 deletion scrapling/engines/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
'--disable-default-apps',
'--disable-print-preview',
'--disable-dev-shm-usage',
'--disable-popup-blocking',
# '--disable-popup-blocking',
'--metrics-recording-only',
'--disable-crash-reporter',
'--disable-partial-raster',
Expand Down
24 changes: 21 additions & 3 deletions scrapling/engines/pw.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def __init__(
timeout: Optional[float] = 30000,
page_action: Callable = do_nothing,
wait_selector: Optional[str] = None,
locale: Optional[str] = 'en-US',
wait_selector_state: Optional[str] = 'attached',
stealth: Optional[bool] = False,
real_chrome: Optional[bool] = False,
Expand All @@ -50,6 +51,7 @@ def __init__(
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
:param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
:param wait_selector: Wait for a specific css selector to be in a specific state.
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
:param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
:param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
:param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
Expand All @@ -64,6 +66,7 @@ def __init__(
:param adaptor_arguments: The arguments that will be passed in the end while creating the final Adaptor's class.
"""
self.headless = headless
self.locale = check_type_validity(locale, [str], 'en-US', param_name='locale')
self.disable_resources = disable_resources
self.network_idle = bool(network_idle)
self.stealth = bool(stealth)
Expand All @@ -87,6 +90,14 @@ def __init__(
self.nstbrowser_mode = bool(nstbrowser_mode)
self.nstbrowser_config = nstbrowser_config
self.adaptor_arguments = adaptor_arguments if adaptor_arguments else {}
self.harmful_default_args = [
# This will be ignored to avoid detection more and possibly avoid the popup crashing bug abuse: https://issues.chromium.org/issues/340836884
'--enable-automation',
'--disable-popup-blocking',
# '--disable-component-update',
# '--disable-default-apps',
# '--disable-extensions',
]

def _cdp_url_logic(self, flags: Optional[List] = None) -> str:
"""Constructs new CDP URL if NSTBrowser is enabled otherwise return CDP URL as it is
Expand Down Expand Up @@ -151,15 +162,15 @@ def fetch(self, url: str) -> Response:
else:
if self.stealth:
browser = p.chromium.launch(
headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
headless=self.headless, args=flags, ignore_default_args=self.harmful_default_args, chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
)
else:
browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'], channel='chrome' if self.real_chrome else 'chromium')
browser = p.chromium.launch(headless=self.headless, ignore_default_args=self.harmful_default_args, channel='chrome' if self.real_chrome else 'chromium')

# Creating the context
if self.stealth:
context = browser.new_context(
locale='en-US',
locale=self.locale,
is_mobile=False,
has_touch=False,
proxy=self.proxy,
Expand All @@ -176,6 +187,8 @@ def fetch(self, url: str) -> Response:
)
else:
context = browser.new_context(
locale=self.locale,
proxy=self.proxy,
color_scheme='dark',
user_agent=useragent,
device_scale_factor=2,
Expand Down Expand Up @@ -221,6 +234,11 @@ def fetch(self, url: str) -> Response:
if self.wait_selector and type(self.wait_selector) is str:
waiter = page.locator(self.wait_selector)
waiter.first.wait_for(state=self.wait_selector_state)
# Wait again after waiting for the selector, helpful with protections like Cloudflare
page.wait_for_load_state(state="load")
page.wait_for_load_state(state="domcontentloaded")
if self.network_idle:
page.wait_for_load_state('networkidle')

# This will be parsed inside `Response`
encoding = res.headers.get('content-type', '') or 'utf-8' # default encoding
Expand Down
Loading

0 comments on commit 26aebba

Please sign in to comment.