Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.2.9 #25

Merged
merged 46 commits into from
Dec 16, 2024
Merged

v0.2.9 #25

merged 46 commits into from
Dec 16, 2024

Conversation

D4Vinci
Copy link
Owner

@D4Vinci D4Vinci commented Dec 16, 2024

What's changed

New features

  1. Introducing the long-awaited async support for Scrapling! Now you have the AsyncFetcher class version of Fetcher, and both StealthyFetcher and PlayWrightFetcher have a new method called async_fetch with the same options.
>> from scrapling import StealthyFetcher
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')  # the async version of fetch
>> page.status == 200
True
  1. Now the StealthyFetcher class has the geoip argument in its fetch methods which when enabled makes the class automatically use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.

  2. Added the retries argument to Fetcher/AsyncFetcher classes so now you can set the number of retries of each request done by httpx.

  3. Added the url_join method to Adaptor and Fetchers which takes a relative URL and joins it with the current URL to generate an absolute full URL!

  4. Added the keep_cdata method to Adaptor and Fetchers to stop the parser from removing cdata when needed.

  5. Now Adaptor/Response body method returns the raw HTML response when possible (without processing it in the library).

  6. Adding logging for the Response class so now when you use the Fetchers you will get a log that gives info about the response you got.
    Example:

    >> from scrapling.defaults import Fetcher
    >> Fetcher.get('https://books.toscrape.com/index.html')
    [2024-12-16 13:33:36] INFO: Fetched (200) <GET https://books.toscrape.com/index.html> (referer: https://www.google.com/search?q=toscrape)
    >> 
  7. Now using all standard string methods on a TextHandler like .replace() will result in another TextHandler. It was returning the standard string before.

  8. Big improvements to speed across the library and improvements to stealth in Fetchers classes overall.

  9. Due to refactoring a lot of the code and using caching at the right positions, now doing requests in bulk will have a big speed increase.

Breaking changes

  • Now the support for Python 3.8 has been dropped. (Mainly because Playwright stopped supporting it but it was a problematic version anyway)

  • The debug argument has been removed from all the library, now if you want to set the library to debugging, do this after importing the library:

    >>> import logging
    >>> logging.getLogger("scrapling").setLevel(logging.DEBUG)

Bugs Squashed

  1. Now WebGL is enabled by default as a lot of protections are checking if it's enabled now.
  2. Some mistakes and typos in the docs/README.

Quality of life changes

  1. All logging is now unified under the logger name scrapling for easier and cleaner control. We were using the root logger before.
  2. Restructured the tests folder into a cleaner structure and added tests for the new features. All the tests were rewritten to a cleaner version and more tests were added for higher coverage.
  3. Refactored a big part of the code to be cleaner and easier to maintain.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

For easy copy-paste from Scrapy/parsel code when needed :)
This will force lxml to keep cdata while parsing html if you want
It's not supported in the new version of Playwright and is problematic in some situations while being slower than newer versions.
If possible, otherwise returns `Adaptor.html_content`
So now you control the logging and the debugging from the shell through the logger with the name 'scrapling'
Dropped browserforge from the requirements here so its version gets controlled by camoufox
For the issue with geoip
By caching the `StaticEngine` class instance
The first step in fully supporting async
This will cause a slight performance increase
…d the async function later

The caching will give a slight performance increase with bulk requests
@D4Vinci D4Vinci merged commit 60df72c into main Dec 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant