Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Integrate browserforge fingerprints #829

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Dec 18, 2024

Description

Integrate browserforge package and use it's fingerprint and header generation capabilities to enable fingerprint generation in PlaywrightCrawler

Issues

Open for discussion:

1. Javascript injected script (src/crawlee/fingerprint_suite/_injected_page_function.py)

  • Extracted from utils and fingerprint_injector
  • Removed all type-script specific syntax( it does not work when used through Python's Playwright)
  • Merged into standalone file without any imports.
  • Wrapped in function that can be called with single argument -> fingerprint

Having such modified duplicated code in Python repo is not nice and is not practical when the original gets updated.
Proposal: Refactor JS code to make this a standalone-file without any dependency and without Typescript syntax. Download this single file from Javascript repo either on installing the crawlee or dynamically when running PlaywrightCrawler with fingerprints.

2. How can users use it? Currently I followed JS implementation https://crawlee.dev/docs/guides/avoid-blocking#customizing-browser-fingerprints

Which in Python looks for exmaple like this

crawler = PlaywrightCrawler(
    headless=True,
    browser_pool_options={
        'use_fingerprints': True,
        'fingerprint_generator_options': {
            'browser': 'edge',
            'os': 'android',
            'screen': Screen(
                min_width=min_width, max_width=max_width, min_height=min_height, max_height=max_height
            ),
        },
    },
)

It is a little bit clumsy as we pass so many nested options to internal objects of Crawler. Any better proposals?

3. Fingerprint QA?
How do we ensure that generated fingerprints are good? While writing tests I found out that around 5% fingerprints (specifically the headers) generated from the browserforge are not good and they will break the requests. I did a simple function to catch that specific problem, but how many more are there that I haven't found?

4. Keep our own header generation or rely on browserforge?
@vdusek already created header generation function. It is different implementation of browserforge HeaderGenerator. Do we keep both?
In this PR I kept our header generation unless 'use_fingerprints': True. If using fingerprints from browserforge, then I use header generation from browserforge as there is some basic logic to make header and fingerprint somewhat consistent (especially UserAgent).

5. Keep browser forge tightly integrated or just create example of how people can integrate it, but not have it available by default?

TODO: Tests, JS page function for injecting fingerprint json.
Todo: Make it work with page.add_init_script
Added dev test
TODO:
Add proper tests
Sync header generation with fingerprint generation (use gen from browserforge)
Added more tests.
Use browser_pool_options from crawler to pass fingerprint related stuff to be similar to JS
@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 18, 2024
@barjin
Copy link
Contributor

barjin commented Dec 18, 2024

Good job! Why don't we use browserforge's injector module for this? Do we need more control over the fingerprint than browserforge can offer us?

# Navigate to the URL and get response.
response = await context.page.goto(context.request.url)

response = await context.page.goto(context.request.url, wait_until='domcontentloaded')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, Playwright waits until the load event (i.e. until the HTML has loaded and all the subresources - images, stylesheets - loaded too).

I don't have a strong opinion here; I'm just curious - did this make a change in some behavior during the development? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I forgot it there when I was testing stuff. It is not needed. Good catch.

@Pijukatel
Copy link
Contributor Author

Good job! Why don't we use browserforge's injector module for this? Do we need more control over the fingerprint than browserforge can offer us?

I was not aware of that. Wow, that guy did everything. I will try to use that injector,

@Pijukatel Pijukatel changed the title Integrate browserforge fingerprints feat: Integrate browserforge fingerprints Dec 19, 2024
@@ -152,8 +143,45 @@ async def _create_browser_context(self, proxy_info: ProxyInfo | None = None) ->
else None
)

return await self._browser.new_context(
if self._use_fingerprints:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This can be simpler if we decide to rely purely on browserforge and remove our own header generation. )

raise RuntimeError('Browser context was not set yet.')
return self._browser_context

@override
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused why code below is shown as changed. There was no change to this in any commit in this PR

@github-actions github-actions bot added this to the 105th sprint - Tooling team milestone Dec 19, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add init script for Playwright browser context
2 participants