Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to change User Agent in PlaywrightCrawler? #751

Open
LeMoussel opened this issue Nov 27, 2024 · 5 comments
Open

How to change User Agent in PlaywrightCrawler? #751

LeMoussel opened this issue Nov 27, 2024 · 5 comments
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@LeMoussel
Copy link

LeMoussel commented Nov 27, 2024

How can I change the user_agent in PlaywrightCrawler?

Here's what I tried:

from crawlee.playwright_crawler import PlaywrightCrawler

crawler = PlaywrightCrawler(browser_options={'user_agent': "My User Agent"})

However, I encountered the following error: BrowserType.launch() got an unexpected keyword argument 'user_agent'.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 27, 2024
vdusek added a commit that referenced this issue Nov 27, 2024
- Enhance argument docstrings for `PlaywrightCrawler` (and propagate
them further).
- Mostly `browser_options` and `page_options`, and add links to the PW
docs.
- This previous state was clearly insufficient, e.g.
#751.
@LeMoussel
Copy link
Author

As described in PR #753 , I specific user_agent to use but I still have an error:

File "c:\Users\pc\AppData\Local\Programs\Python\Python310\lib\site-packages\crawlee\playwright_crawler\_playwright_crawler.py", line 108, in __init__
    super().__init__(**kwargs)
TypeError: BasicCrawler.__init__() got an unexpected keyword argument 'browser_options'

@janbuchar
Copy link
Collaborator

Hi @LeMoussel, the browser_options parameter is not yet released, so unless you are using a beta release, this works as expected.

When are we planning to make a new release @vdusek?

@Mantisus
Copy link
Collaborator

Mantisus commented Nov 27, 2024

Hi @LeMoussel, the user_agent is not an option for browser_options. It must be passed in when the context is created. Watch out for #755

You can now set the User-Agent as the header for the request

But if your goal is to replace the default User-Agent in Playwright, note that this already happens automatically with HeaderGenerator

@LeMoussel
Copy link
Author

LeMoussel commented Nov 28, 2024

It seems I'm missing something here. 🙂
When I set the User-Agent header for the request like this:

    import json

    from crawlee import Request
    from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

    crawler = PlaywrightCrawler(
        max_requests_per_crawl=1,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f"Processing {context.request.url} ...")
        response = await context.response.text()
        data = json.loads(response)
        print(data['user-agent'])

    await crawler.run(
        [
            Request.from_url(
                url="https://httpbingo.org/user-agent",
                headers={"User-Agent": "Test User Agent"},
            )
        ]
    )

I get the following result:

[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 1; cpu = 0.0; mem = 0.0; event_loop = 1.0; client_info = 0.0
[crawlee.playwright_crawler._playwright_crawler] INFO  Processing https://httpbingo.org/user-agent ...
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
[crawlee.playwright_crawler._playwright_crawler] INFO  The crawler has reached its limit of 1 requests per crawl. All ongoing requests have now completed. Total requests processed: 1. The crawler will now shut down.
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.playwright_crawler._playwright_crawler] INFO  Final request statistics:

As you can see, the User-Agent value hasn't changed.

My goal is to replace the default Playwright user agent with my custom one, and I want to achieve this manually in my code, not automatically by crawlee-python.

Note: I tested the same with HttpCrawler, and it correctly passes the User-Agent value.

@Mantisus
Copy link
Collaborator

Mantisus commented Nov 28, 2024

You didn't miss anything, I completely missed that in Playwright the page.set_extra_http_headers method doesn't overwrite those headers that are set at the context level.

And since User-Agent is set at the context level from HeaderGenerator we can't overwrite it from Request

When the fix for #755 is ready, that should allow you to set User-Agent for context.

But we need to think about cases where some header can be overridden at the page level from Request

@vdusek vdusek self-assigned this Dec 2, 2024
janbuchar pushed a commit that referenced this issue Dec 16, 2024
### Description

- fix  `page_options` for `PlaywrightBrowserPlugin`

### Issues

- Closes: #755, #751 

### Testing

- Add test for check workability `page_options` in
`PlaywrightBrowserPlugin`

### Checklist

- [ ] CI passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

4 participants