Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.9 #25

Merged
merged 46 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
31e838c
Make standard string methods return TextHandler again instead of str
D4Vinci Dec 4, 2024
45e86f5
Adding empty methods (get/get_all/extract/extract_all)
D4Vinci Dec 4, 2024
d0f1895
fix: Enable WebGL by default
D4Vinci Dec 4, 2024
5f9c398
Adding `urljoin` method to Adaptors and Responses
D4Vinci Dec 10, 2024
b4f9061
Adding `keep_cdata` argument for `Adaptor` and `Response` classes
D4Vinci Dec 10, 2024
93ed768
Bumping up the libraries versions for better stealth and speed!
D4Vinci Dec 10, 2024
e60a57c
Python 3.8 is not supported anymore
D4Vinci Dec 10, 2024
ba11c43
Preparing to release 0.2.9 soon
D4Vinci Dec 10, 2024
70b5424
Update tox.ini
D4Vinci Dec 10, 2024
e332444
Update tests.yml
D4Vinci Dec 10, 2024
9f0001a
feat: logging for response status
D4Vinci Dec 11, 2024
dcf5187
fix: Adaptor.body returns raw HTML without processing
D4Vinci Dec 11, 2024
f30eb6a
build: disable the 404 error test for playwright
D4Vinci Dec 11, 2024
193827e
refactor(api)!: Unifying log under 1 logger and removing debug parameter
D4Vinci Dec 11, 2024
e254341
fix: forgot to stage it with last commit
D4Vinci Dec 11, 2024
6f87420
build: disable the 501 error test for playwright
D4Vinci Dec 11, 2024
bfe9063
feat: adding `geoip` parameter to the StealthyFetcher
D4Vinci Dec 12, 2024
f9bee4c
build: Pumping up camoufox version to solve browserforge issue
D4Vinci Dec 12, 2024
838dd62
build: pumping up camoufox
D4Vinci Dec 13, 2024
299793a
feat: adding the `retries` argument for all methods of `Fetcher` class
D4Vinci Dec 15, 2024
445af3c
perf: Give repeated usage of `Fetcher` a slight performance increase
D4Vinci Dec 15, 2024
faf728a
style: moving repeated arguments from inside the functions to __init__
D4Vinci Dec 15, 2024
b10cfd3
feat: Adding `AsyncFetcher` class version of `Fetcher`
D4Vinci Dec 15, 2024
a90192d
build: adding tests for `AsyncFetcher` class
D4Vinci Dec 15, 2024
6c17bd8
style: using better data structures for constants
D4Vinci Dec 15, 2024
889c111
refactor(Playwright Engine): Separate what we can for cleaner code an…
D4Vinci Dec 15, 2024
af4f2c0
feat(PlaywrightFetcher): Add async support for PlaywrightFetcher
D4Vinci Dec 15, 2024
aac77e4
test: add test for PlaywrightFetcher Async support
D4Vinci Dec 15, 2024
de015f2
chore: Restructuring fetchers into cleaner structure
D4Vinci Dec 15, 2024
361ee44
test: adding `pytest-asyncio` plugin to tests requirements file
D4Vinci Dec 15, 2024
efb3270
style: Rewrite asyncFetcher tests to a cleaner version
D4Vinci Dec 15, 2024
79a911e
feat(StealthyFetcher): Add async fetch support
D4Vinci Dec 15, 2024
448587f
test: add tests for StealthyFetcher async fetch
D4Vinci Dec 15, 2024
2606f7a
test: clearer naming for async tests
D4Vinci Dec 15, 2024
3ecffcb
style: Rewrite sync Fetchers tests to a cleaner version
D4Vinci Dec 15, 2024
eceef48
fix(Fetchers/page_action): Fixing logic
D4Vinci Dec 16, 2024
d66ebc1
fix(Fetchers/disable_resources): Fixing the logic for intercepting re…
D4Vinci Dec 16, 2024
05c6eeb
fix: add AsyncFetcher to top-level shortcuts
D4Vinci Dec 16, 2024
6cf5ce9
docs: Adding async examples and fixing some typos
D4Vinci Dec 16, 2024
43b41b3
docs: fix TOC header
D4Vinci Dec 16, 2024
3ff0d55
chore: Delete former-sponsor banner
D4Vinci Dec 16, 2024
69e3161
test: Rewrite parser tests to a cleaner version and adding more tests
D4Vinci Dec 16, 2024
20ef453
test: Rewrite automatch tests to a cleaner version and adding async test
D4Vinci Dec 16, 2024
ac95600
docs: add an example that uses `page.urljoin`
D4Vinci Dec 16, 2024
c282cb3
fix: Stopped the log spamming that happens with multiple instances cr…
D4Vinci Dec 16, 2024
838b8b1
style: fixing a typo in logging
D4Vinci Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/01-bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ body:

- type: textarea
attributes:
label: "Actual behavior (Remember to use `debug` parameter)"
label: "Actual behavior"
validations:
required: true

Expand Down
4 changes: 0 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,6 @@ jobs:
fail-fast: false
matrix:
include:
- python-version: "3.8"
os: ubuntu-latest
env:
TOXENV: py
- python-version: "3.9"
os: ubuntu-latest
env:
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ repos:
rev: v1.6.0
hooks:
- id: vermin
args: ['-t=3.8-', '--violations', '--eval-annotations', '--no-tips']
args: ['-t=3.9-', '--violations', '--eval-annotations', '--no-tips']
6 changes: 5 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ tests/test_parser_functions.py ................ [100%]

=============================== 16 passed in 0.22s ================================
```
Also, consider setting `debug` to `True` while initializing the Adaptor object so it's easier to know what's happening in the background.
Also, consider setting the scrapling logging level to `debug` so it's easier to know what's happening in the background.
```python
>>> import logging
>>> logging.getLogger("scrapling").setLevel(logging.DEBUG)
```

### The process is straight-forward.

Expand Down
37 changes: 27 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Dealing with failing web scrapers due to anti-bot protections or website changes
Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

```python
>> from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
# Fetch websites' source under the radar!
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
>> print(page.status)
Expand Down Expand Up @@ -35,7 +35,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha

## Table of content
* [Key Features](#key-features)
* [Fetch websites as you prefer](#fetch-websites-as-you-prefer)
* [Fetch websites as you prefer](#fetch-websites-as-you-prefer-with-async-support)
* [Adaptive Scraping](#adaptive-scraping)
* [Performance](#performance)
* [Developing Experience](#developing-experience)
Expand Down Expand Up @@ -76,7 +76,7 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha

## Key Features

### Fetch websites as you prefer
### Fetch websites as you prefer with async support
- **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
- **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
- **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright - All is possible with `PlayWrightFetcher`!
Expand Down Expand Up @@ -167,7 +167,7 @@ Scrapling can find elements with more methods and it returns full element `Adapt
> All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.

## Installation
Scrapling is a breeze to get started with - Starting from version 0.2, we require at least Python 3.8 to work.
Scrapling is a breeze to get started with - Starting from version 0.2.9, we require at least Python 3.9 to work.
```bash
pip3 install scrapling
```
Expand Down Expand Up @@ -219,11 +219,11 @@ You might be slightly confused by now so let me clear things up. All fetcher-typ
```python
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
```
All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug`, which are the same ones you give to the `Adaptor` class.
All of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class.

If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code:
```python
from scrapling.defaults import Fetcher, StealthyFetcher, PlayWrightFetcher
from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
```
then use it right away without initializing like:
```python
Expand All @@ -236,21 +236,32 @@ Also, the `Response` object returned from all fetchers is the same as the `Adapt
### Fetcher
This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.

For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
For all methods, you have `stealthy_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument `retries` for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all `Fetcher` methods is 3.

You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
```python
>> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')
```
For Async requests, you will just replace the import like below:
```python
>> from scrapling import AsyncFetcher
>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = await AsyncFetcher().delete('https://httpbin.org/delete')
```
### StealthyFetcher
This class is built on top of [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
```python
>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
>> page.status == 200
True
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True
```
> Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

Expand All @@ -268,7 +279,8 @@ True
| page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
| humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
| allow_webgl | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. | ✔️ |
| geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
| disable_ads | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. | ✔️ |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
Expand All @@ -287,6 +299,9 @@ This class is built on top of [Playwright](https://playwright.dev/python/) which
>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # the async version of fetch
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
```
> Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

Expand Down Expand Up @@ -391,6 +406,9 @@ You can select elements by their text content in multiple ways, here's a full ex
>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>

>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href']) # We use `page.urljoin` to return the full URL from the relative `href`
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'

>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]

Expand Down Expand Up @@ -804,7 +822,6 @@ This project includes code adapted from:

## Known Issues
- In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.
- Currently, Scrapling is not compatible with async/await.

---
<div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>
6 changes: 3 additions & 3 deletions benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ def test_pyquery():
@benchmark
def test_scrapling():
# No need to do `.extract()` like parsel to extract text
# Also, this is faster than `[t.text for t in Adaptor(large_html, auto_match=False, debug=False).css('.item')]`
# Also, this is faster than `[t.text for t in Adaptor(large_html, auto_match=False).css('.item')]`
# for obvious reasons, of course.
return Adaptor(large_html, auto_match=False, debug=False).css('.item::text')
return Adaptor(large_html, auto_match=False).css('.item::text')


@benchmark
Expand Down Expand Up @@ -103,7 +103,7 @@ def test_scrapling_text(request_html):
# Will loop over resulted elements to get text too to make comparison even more fair otherwise Scrapling will be even faster
return [
element.text for element in Adaptor(
request_html, auto_match=False, debug=False
request_html, auto_match=False
).find_by_text('Tipping the Velvet', first_match=True).find_similar(ignore_attributes=['title'])
]

Expand Down
Binary file removed images/CapSolver.png
Binary file not shown.
2 changes: 2 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
[pytest]
asyncio_mode = auto
asyncio_default_fixture_loop_scope = function
addopts = -p no:warnings --doctest-modules --ignore=setup.py --verbose
8 changes: 4 additions & 4 deletions scrapling/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Declare top-level shortcuts
from scrapling.core.custom_types import AttributesHandler, TextHandler
from scrapling.fetchers import (CustomFetcher, Fetcher, PlayWrightFetcher,
StealthyFetcher)
from scrapling.fetchers import (AsyncFetcher, CustomFetcher, Fetcher,
PlayWrightFetcher, StealthyFetcher)
from scrapling.parser import Adaptor, Adaptors

__author__ = "Karim Shoair ([email protected])"
__version__ = "0.2.8"
__version__ = "0.2.9"
__copyright__ = "Copyright (c) 2024 Karim Shoair"


__all__ = ['Adaptor', 'Fetcher', 'StealthyFetcher', 'PlayWrightFetcher']
__all__ = ['Adaptor', 'Fetcher', 'AsyncFetcher', 'StealthyFetcher', 'PlayWrightFetcher']
94 changes: 88 additions & 6 deletions scrapling/core/custom_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,70 @@ class TextHandler(str):
__slots__ = ()

def __new__(cls, string):
# Because str is immutable and we can't override __init__
if type(string) is str:
if isinstance(string, str):
return super().__new__(cls, string)
else:
return super().__new__(cls, '')
return super().__new__(cls, '')

# Make methods from original `str` class return `TextHandler` instead of returning `str` again
# Of course, this stupid workaround is only so we can keep the auto-completion working without issues in your IDE
# and I made sonnet write it for me :)
def strip(self, chars=None):
return TextHandler(super().strip(chars))

def lstrip(self, chars=None):
return TextHandler(super().lstrip(chars))

def rstrip(self, chars=None):
return TextHandler(super().rstrip(chars))

def capitalize(self):
return TextHandler(super().capitalize())

def casefold(self):
return TextHandler(super().casefold())

def center(self, width, fillchar=' '):
return TextHandler(super().center(width, fillchar))

def expandtabs(self, tabsize=8):
return TextHandler(super().expandtabs(tabsize))

def format(self, *args, **kwargs):
return TextHandler(super().format(*args, **kwargs))

def format_map(self, mapping):
return TextHandler(super().format_map(mapping))

def join(self, iterable):
return TextHandler(super().join(iterable))

def ljust(self, width, fillchar=' '):
return TextHandler(super().ljust(width, fillchar))

def rjust(self, width, fillchar=' '):
return TextHandler(super().rjust(width, fillchar))

def swapcase(self):
return TextHandler(super().swapcase())

def title(self):
return TextHandler(super().title())

def translate(self, table):
return TextHandler(super().translate(table))

def zfill(self, width):
return TextHandler(super().zfill(width))

def replace(self, old, new, count=-1):
return TextHandler(super().replace(old, new, count))

def upper(self):
return TextHandler(super().upper())

def lower(self):
return TextHandler(super().lower())
##############

def sort(self, reverse: bool = False) -> str:
"""Return a sorted version of the string"""
Expand All @@ -30,11 +89,21 @@ def clean(self) -> str:
data = re.sub(' +', ' ', data)
return self.__class__(data.strip())

# For easy copy-paste from Scrapy/parsel code when needed :)
def get(self, default=None):
return self

def get_all(self):
return self

extract = get_all
extract_first = get

def json(self) -> Dict:
"""Return json response if the response is jsonable otherwise throw error"""
# Using __str__ function as a workaround for orjson issue with subclasses of str
# Using str function as a workaround for orjson issue with subclasses of str
# Check this out: https://github.com/ijl/orjson/issues/445
return loads(self.__str__())
return loads(str(self))

def re(
self, regex: Union[str, Pattern[str]], replace_entities: bool = True, clean_match: bool = False,
Expand Down Expand Up @@ -127,6 +196,19 @@ def re_first(self, regex: Union[str, Pattern[str]], default=None, replace_entiti
return result
return default

# For easy copy-paste from Scrapy/parsel code when needed :)
def get(self, default=None):
"""Returns the first item of the current list
:param default: the default value to return if the current list is empty
"""
return self[0] if len(self) > 0 else default

def extract(self):
return self

extract_first = get
get_all = extract


class AttributesHandler(Mapping):
"""A read-only mapping to use instead of the standard dictionary for the speed boost but at the same time I use it to add more functionalities.
Expand Down
Loading
Loading