-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Integrate browserforge fingerprints #829
base: master
Are you sure you want to change the base?
Conversation
TODO: Tests, JS page function for injecting fingerprint json.
Todo: Make it work with page.add_init_script
Added dev test TODO: Add proper tests Sync header generation with fingerprint generation (use gen from browserforge)
Use browser_pool_options from crawler to pass fingerprint related stuff to be similar to JS
Good job! Why don't we use browserforge's |
# Navigate to the URL and get response. | ||
response = await context.page.goto(context.request.url) | ||
|
||
response = await context.page.goto(context.request.url, wait_until='domcontentloaded') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, Playwright waits until the load
event (i.e. until the HTML has loaded and all the subresources - images, stylesheets - loaded too).
I don't have a strong opinion here; I'm just curious - did this make a change in some behavior during the development? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I forgot it there when I was testing stuff. It is not needed. Good catch.
I was not aware of that. Wow, that guy did everything. I will try to use that injector, |
@@ -152,8 +143,45 @@ async def _create_browser_context(self, proxy_info: ProxyInfo | None = None) -> | |||
else None | |||
) | |||
|
|||
return await self._browser.new_context( | |||
if self._use_fingerprints: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This can be simpler if we decide to rely purely on browserforge and remove our own header generation. )
raise RuntimeError('Browser context was not set yet.') | ||
return self._browser_context | ||
|
||
@override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused why code below is shown as changed. There was no change to this in any commit in this PR
Description
Integrate
browserforge
package and use it's fingerprint and header generation capabilities to enable fingerprint generation in PlaywrightCrawlerIssues
Open for discussion:
1. Javascript injected script (src/crawlee/fingerprint_suite/_injected_page_function.py)Extracted from utils and fingerprint_injectorRemoved all type-script specific syntax( it does not work when used through Python's Playwright)Merged into standalone file without any imports.Wrapped in function that can be called with single argument -> fingerprintHaving such modified duplicated code in Python repo is not nice and is not practical when the original gets updated.Proposal: Refactor JS code to make this a standalone-file without any dependency and without Typescript syntax. Download this single file from Javascript repo either on installing the crawlee or dynamically when running PlaywrightCrawler with fingerprints.
2. How can users use it? Currently I followed JS implementation https://crawlee.dev/docs/guides/avoid-blocking#customizing-browser-fingerprints
Which in Python looks for exmaple like this
It is a little bit clumsy as we pass so many nested options to internal objects of Crawler. Any better proposals?
3. Fingerprint QA?How do we ensure that generated fingerprints are good? While writing tests I found out that around 5% fingerprints (specifically the headers) generated from the browserforge are not good and they will break the requests. I did a simple function to catch that specific problem, but how many more are there that I haven't found?4. Keep our own header generation or rely on browserforge?
@vdusek already created header generation function. It is different implementation of browserforge HeaderGenerator. Do we keep both?
In this PR I kept our header generation unless 'use_fingerprints': True. If using fingerprints from browserforge, then I use header generation from browserforge as there is some basic logic to make header and fingerprint somewhat consistent (especially UserAgent).
5. Keep browser forge tightly integrated or just create example of how people can integrate it, but not have it available by default?