Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.92 #27

Merged
merged 6 commits into from
Dec 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .bandit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ skips:
- B410
- B113 # `Requests call without timeout` these requests are done in the benchmark and examples scripts only
- B403 # We are using pickle for tests only
- B404 # Using subprocess library
- B602 # subprocess call with shell=True identified
3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@ include *.js
include scrapling/engines/toolbelt/bypasses/*.js
include scrapling/*.db
include scrapling/*.db*
include scrapling/*.db-*
include scrapling/py.typed
include scrapling/.scrapling_dependencies_installed
include .scrapling_dependencies_installed

recursive-exclude * __pycache__
recursive-exclude * *.py[co]
44 changes: 5 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,52 +167,18 @@ Scrapling can find elements with more methods and it returns full element `Adapt
> All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.

## Installation
Scrapling is a breeze to get started with - Starting from version 0.2.9, we require at least Python 3.9 to work.
Scrapling is a breeze to get started with; Starting from version 0.2.9, we require at least Python 3.9 to work.
```bash
pip3 install scrapling
```
- For using the `StealthyFetcher`, go to the command line and download the browser with
<details><summary>Windows OS</summary>

```bash
camoufox fetch --browserforge
```
</details>
<details><summary>MacOS</summary>

```bash
python3 -m camoufox fetch --browserforge
```
</details>
<details><summary>Linux</summary>

Then run this command to install browsers' dependencies needed to use Fetcher classes
```bash
python -m camoufox fetch --browserforge
```
On a fresh installation of Linux, you may also need the following Firefox dependencies:
- Debian-based distros
```bash
sudo apt install -y libgtk-3-0 libx11-xcb1 libasound2
```
- Arch-based distros
```bash
sudo pacman -S gtk3 libx11 libxcb cairo libasound alsa-lib
```
</details>

<small> See the official <a href="https://camoufox.com/python/installation/#download-the-browser">Camoufox documentation</a> for more info on installation</small>

- If you are going to use the `PlayWrightFetcher` options, then install Playwright's Chromium browser with:
```commandline
playwright install chromium
```
- If you are going to use normal requests only with the `Fetcher` class then update the fingerprints files with:
```commandline
python -m browserforge update
scrapling install
```
If you have any installation issues, please open an issue.

## Fetching Websites
Fetchers are basically interfaces that do requests or fetch pages for you in a single request fashion and then return an `Adaptor` object for you. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.
Fetchers are interfaces built on top of other libraries with added features that do requests or fetch pages for you in a single request fashion and then return an `Adaptor` object. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.

### Features
You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way
Expand Down
2 changes: 1 addition & 1 deletion scrapling/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from scrapling.parser import Adaptor, Adaptors

__author__ = "Karim Shoair ([email protected])"
__version__ = "0.2.91"
__version__ = "0.2.92"
__copyright__ = "Copyright (c) 2024 Karim Shoair"


Expand Down
37 changes: 37 additions & 0 deletions scrapling/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import os
import subprocess
import sys
from pathlib import Path

import click


def get_package_dir():
return Path(os.path.dirname(__file__))


def run_command(command, line):
print(f"Installing {line}...")
_ = subprocess.check_call(command, shell=True)
# I meant to not use try except here


@click.command(help="Install all Scrapling's Fetchers dependencies")
def install():
if not get_package_dir().joinpath(".scrapling_dependencies_installed").exists():
run_command([sys.executable, "-m", "playwright", "install", 'chromium'], 'Playwright browsers')
run_command([sys.executable, "-m", "playwright", "install-deps", 'chromium', 'firefox'], 'Playwright dependencies')
run_command([sys.executable, "-m", "camoufox", "fetch", '--browserforge'], 'Camoufox browser and databases')
# if no errors raised by above commands, then we add below file
get_package_dir().joinpath(".scrapling_dependencies_installed").touch()
else:
print('The dependencies are already installed')


@click.group()
def main():
pass


# Adding commands
main.add_command(install)
22 changes: 10 additions & 12 deletions scrapling/engines/camo.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def fetch(self, url: str) -> Response:

def handle_response(finished_response):
nonlocal final_response
if finished_response.request.resource_type == "document":
if finished_response.request.resource_type == "document" and finished_response.request.is_navigation_request():
final_response = finished_response

with Camoufox(
Expand Down Expand Up @@ -133,7 +133,6 @@ def handle_response(finished_response):
if self.network_idle:
page.wait_for_load_state('networkidle')

response_bytes = final_response.body() if final_response else page.content().encode('utf-8')
# In case we didn't catch a document type somehow
final_response = final_response if final_response else first_response
# This will be parsed inside `Response`
Expand All @@ -142,15 +141,15 @@ def handle_response(finished_response):
status_text = final_response.status_text or StatusText.get(final_response.status)

response = Response(
url=final_response.url,
url=page.url,
text=page.content(),
body=response_bytes,
body=page.content().encode('utf-8'),
status=final_response.status,
reason=status_text,
encoding=encoding,
cookies={cookie['name']: cookie['value'] for cookie in page.context.cookies()},
headers=final_response.all_headers(),
request_headers=final_response.request.all_headers(),
headers=first_response.all_headers(),
request_headers=first_response.request.all_headers(),
**self.adaptor_arguments
)
page.close()
Expand All @@ -169,7 +168,7 @@ async def async_fetch(self, url: str) -> Response:

async def handle_response(finished_response):
nonlocal final_response
if finished_response.request.resource_type == "document":
if finished_response.request.resource_type == "document" and finished_response.request.is_navigation_request():
final_response = finished_response

async with AsyncCamoufox(
Expand Down Expand Up @@ -213,7 +212,6 @@ async def handle_response(finished_response):
if self.network_idle:
await page.wait_for_load_state('networkidle')

response_bytes = await final_response.body() if final_response else (await page.content()).encode('utf-8')
# In case we didn't catch a document type somehow
final_response = final_response if final_response else first_response
# This will be parsed inside `Response`
Expand All @@ -222,15 +220,15 @@ async def handle_response(finished_response):
status_text = final_response.status_text or StatusText.get(final_response.status)

response = Response(
url=final_response.url,
url=page.url,
text=await page.content(),
body=response_bytes,
body=(await page.content()).encode('utf-8'),
status=final_response.status,
reason=status_text,
encoding=encoding,
cookies={cookie['name']: cookie['value'] for cookie in await page.context.cookies()},
headers=await final_response.all_headers(),
request_headers=await final_response.request.all_headers(),
headers=await first_response.all_headers(),
request_headers=await first_response.request.all_headers(),
**self.adaptor_arguments
)
await page.close()
Expand Down
22 changes: 10 additions & 12 deletions scrapling/engines/pw.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def fetch(self, url: str) -> Response:

def handle_response(finished_response: PlaywrightResponse):
nonlocal final_response
if finished_response.request.resource_type == "document":
if finished_response.request.resource_type == "document" and finished_response.request.is_navigation_request():
final_response = finished_response

with sync_playwright() as p:
Expand Down Expand Up @@ -252,7 +252,6 @@ def handle_response(finished_response: PlaywrightResponse):
if self.network_idle:
page.wait_for_load_state('networkidle')

response_bytes = final_response.body() if final_response else page.content().encode('utf-8')
# In case we didn't catch a document type somehow
final_response = final_response if final_response else first_response
# This will be parsed inside `Response`
Expand All @@ -261,15 +260,15 @@ def handle_response(finished_response: PlaywrightResponse):
status_text = final_response.status_text or StatusText.get(final_response.status)

response = Response(
url=final_response.url,
url=page.url,
text=page.content(),
body=response_bytes,
body=page.content().encode('utf-8'),
status=final_response.status,
reason=status_text,
encoding=encoding,
cookies={cookie['name']: cookie['value'] for cookie in page.context.cookies()},
headers=final_response.all_headers(),
request_headers=final_response.request.all_headers(),
headers=first_response.all_headers(),
request_headers=first_response.request.all_headers(),
**self.adaptor_arguments
)
page.close()
Expand All @@ -293,7 +292,7 @@ async def async_fetch(self, url: str) -> Response:

async def handle_response(finished_response: PlaywrightResponse):
nonlocal final_response
if finished_response.request.resource_type == "document":
if finished_response.request.resource_type == "document" and finished_response.request.is_navigation_request():
final_response = finished_response

async with async_playwright() as p:
Expand Down Expand Up @@ -339,7 +338,6 @@ async def handle_response(finished_response: PlaywrightResponse):
if self.network_idle:
await page.wait_for_load_state('networkidle')

response_bytes = await final_response.body() if final_response else (await page.content()).encode('utf-8')
# In case we didn't catch a document type somehow
final_response = final_response if final_response else first_response
# This will be parsed inside `Response`
Expand All @@ -348,15 +346,15 @@ async def handle_response(finished_response: PlaywrightResponse):
status_text = final_response.status_text or StatusText.get(final_response.status)

response = Response(
url=final_response.url,
url=page.url,
text=await page.content(),
body=response_bytes,
body=(await page.content()).encode('utf-8'),
status=final_response.status,
reason=status_text,
encoding=encoding,
cookies={cookie['name']: cookie['value'] for cookie in await page.context.cookies()},
headers=await final_response.all_headers(),
request_headers=await final_response.request.all_headers(),
headers=await first_response.all_headers(),
request_headers=await first_response.request.all_headers(),
**self.adaptor_arguments
)
await page.close()
Expand Down
4 changes: 2 additions & 2 deletions scrapling/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ def xpath_first(self, selector: str, identifier: str = '',

def css(self, selector: str, identifier: str = '',
auto_match: bool = False, auto_save: bool = False, percentage: int = 0
) -> Union['Adaptors[Adaptor]', List]:
) -> Union['Adaptors[Adaptor]', List, 'TextHandlers[TextHandler]']:
"""Search current tree with CSS3 selectors

**Important:
Expand Down Expand Up @@ -517,7 +517,7 @@ def css(self, selector: str, identifier: str = '',

def xpath(self, selector: str, identifier: str = '',
auto_match: bool = False, auto_save: bool = False, percentage: int = 0, **kwargs: Any
) -> Union['Adaptors[Adaptor]', List]:
) -> Union['Adaptors[Adaptor]', List, 'TextHandlers[TextHandler]']:
"""Search current tree with XPath selectors

**Important:
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[metadata]
name = scrapling
version = 0.2.91
version = 0.2.92
author = Karim Shoair
author_email = [email protected]
description = Scrapling is an undetectable, powerful, flexible, adaptive, and high-performance web scraping library for Python.
Expand Down
8 changes: 7 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

setup(
name="scrapling",
version="0.2.91",
version="0.2.92",
description="""Scrapling is a powerful, flexible, and high-performance web scraping library for Python. It
simplifies the process of extracting data from websites, even when they undergo structural changes, and offers
impressive speed improvements over many popular scraping tools.""",
Expand All @@ -20,6 +20,11 @@
package_dir={
"scrapling": "scrapling",
},
entry_points={
'console_scripts': [
'scrapling=scrapling.cli:main'
],
},
include_package_data=True,
classifiers=[
"Operating System :: OS Independent",
Expand Down Expand Up @@ -50,6 +55,7 @@
"requests>=2.3",
"lxml>=4.5",
"cssselect>=1.2",
'click',
"w3lib",
"orjson>=3",
"tldextract",
Expand Down
Loading