Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cookies not transferring between page.goto() calls - stay logged in using same playwright page #149

Closed
BenzTivianne opened this issue Dec 15, 2022 · 8 comments

Comments

@BenzTivianne
Copy link

I am running into an issue where if I set cookies to a page, the webpage loads as if the cookies are there (ie accepting the terms and conditions popup) but then when I load the same webpage using the same playwright page, the page loads with the terms and conditions popup as if the cookies are not there.

The following code is a simple way of accessing the url with cookies and then uses the same page to reload the same webpage url. The original url I was using is a webpage only I have access to unfortunately and could not provide the page. I am also using the: PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False} option within the settings.py file for scrapy.

import scrapy

class TestSpider(scrapy.Spider):
    name = 'testing'

    def start_requests(self):
        url = "https://www.testwebsite.com/"
        yield scrapy.Request(url, 
            callback=self.parse,
            cookies={'cookie_name': 'cookie_value'},
            meta={
                'playwright': True,
                'playwright_include_page': True,
            }
        )

    def parse1(self, response):
        print('Parse2 step')  # webpage loaded with cookies here. No accept popup
        page = response.meta['playwright_page'] 

        yield scrapy.Request(response.url, 
            callback=self.test, 
            dont_filter=True,
            meta={
                'playwright': True,
                'playwright_include_page': True, 
                'playwright_page': page
            }
        )
    
    def parse2(self, response): 
        print('Parse2 step')  # webpage loaded without cookies here. Popup shown

I also tried the above with only using playwright and the webpage loads both times as if if has cookies set.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    context.add_cookies([{'name': 'cookie_name', 'value': 'cookie_value', 'domain': 'www.testwebsite.com', 'path': '/'}])
    page =  context.new_page()
    page.goto('https://www.testwebsite.com/') # webpage loaded with cookies here. No accept popup
    page2 = page 
    page2.goto('https://www.testwebsite.com/') # webpage loaded with cookies here. No accept popup

    print('done')

I understand I could just add the cookies in the yield scrapy.Request() inside of the parse1 function. The reason I am doing it this way is because in the yield scrapy.Request() inside of start_requests I will be using a PageMethod call in order to log into a website. I want to stay logged in throughout the entire scrapy session, using a single page to load all the urls I need.

What I also found interesting is that I went directly inside the code in handler.ScrapyPlaywrightDownloadHandler and changed the code to make a page.goto() request right after the original page.goto() request but still got the same outcome of no cookies loaded on second page.goto() call. Shown below starting at line 296

page_goto_kwargs = request.meta.get("playwright_page_goto_kwargs") or {}
page_goto_kwargs.pop("url", None)
response = await page.goto(url=request.url, **page_goto_kwargs)
response = await page.goto(url=request.url, **page_goto_kwargs) # added same line to reload webpage with same playwright page. Page is loaded without cookies

Is there a way to log into the initial playwright page and stay logged in? Or is there a part of the code within the handler.py file that is preventing the cookies and login information from staying with each page call?

@elacuesta
Copy link
Member

By default headers (including cookies) are not handled by the browser, instead they are overridden (source) with the headers that come from the Scrapy request. This means cookies are the result of the processing done by Scrapy's built-in CookiesMiddleware. I'd suggest trying the following:

  1. Enable COOKIES_DEBUG to see how Scrapy is dealing with the cookies
  2. Set PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None to give control to the header processing to Playwright (in which case Scrapy's cookie management gets ignored).

@BenzTivianne
Copy link
Author

Thank you for your response. I understand now that the cookies are handled by Scrapy and its middlewares and not within the browser. I did try setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None but it seemed that did not work and the cookies were not loaded. But I did try using the PLAYWRIGHT_CONTEXTS option. I had to do some rigging and am curious as to why it works.

My spider:

import scrapy
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = "https://www.testwebsite.com/sign_in_page/"
        yield scrapy.Request(url, 
            callback=self.parse,
            meta=dict(
                playwright=True,
                playwright_include_page = True, 
                playwright_context= "new",
                playwright_context_kwargs= {
                    "storage_state": {
                        "cookies":[{'name': 'cookie_name', 'value': 'cookie_value', 'domain': 'www.testwebsite.com', 'path': '/'}]
                    }
                },
                playwright_page_methods={                                      
                    "type1": PageMethod("type", selector="//input[@id='InputEmail']", text='[email protected]', delay=50),
                    "type2": PageMethod("type", selector="//input[@id='InputPassword']", text='password123', delay=50),
                    "click1": PageMethod("click", selector="//*[@id='login_form']/button"),
                    "load1": PageMethod("wait_for_url", url='https://www.testwebsite.com/main/'), # Added to wait for the new page to properly load all the way
                },
            ),
        )

    async def parse(self, response):
        url = 'https://www.testwebsite.com/random_page/'
        page = response.meta['playwright_page']

        cookies = await page.context.cookies()
        print(cookies) # used for testing

        yield scrapy.Request(url, 
            callback=self.test, 
            dont_filter=True,
            meta=dict(
                playwright = True,
                playwright_include_page = True, 
                playwright_page = page
            )
        )
    
    async def test_parse(self, response): 
        print('Here') # still logged in with all cookies
        cookies = await page.context.cookies()
        print(cookies) # same cookies printed

The issue I ran into is with the page methods. I had to input the username and password, then click a button which redirects me to the main webpage of the website (login web page redirected to main web page). The issue is that the page would not properly load when the button was clicked and it hung in a loading state until it timed out. I found that by commenting out the lines 267-277 in the handler.py file, it would work but would not wait for the page to load, hence why I added: "load1": PageMethod("wait_for_url", url='https://www.testwebsite.com/main/').

Screenshot of handler.py lines commented out:
Screen Shot 2022-12-16 at 3 19 56 PM

So my question is, is this part of the code needed? Could it possibly cause issues in the future? I do hope this all makes sense. I have been working on this all week and am learning it as fast as I can. I do appreciate all the help and advice.

-Benz

@elacuesta
Copy link
Member

That bit of code is necessary because:

  1. AFAICT it is not possible to make POST requests directly, the only way is to intercept the request and override the method (since playwright==1.16 there is APIRequestContext.post, but from what I see there is no JS evaluation like there is when using Page.goto).
  2. It is how PLAYWRIGHT_PROCESS_REQUEST_HEADERS works. Everything users do with Scrapy should still work, including setting specific request headers in spiders and updating them using middlewares.
  3. It is how PLAYWRIGHT_ABORT_REQUEST works. Users asked for a way to prevent certain automatic background requests to be aborted, to my knowledge intercepting the request and aborting it is the only way.

That said, I'd actually expect the cookies from the context not to be sent because of the overriding of the headers that happens in there, but it seems like it is not working the way I though it was. I've opened microsoft/playwright-python#1686 upstream regarding this.

@fioreagui
Copy link

Hey, I'm experiencing a similar issue.
I'm currently working on a spider and have attempted two methods to set the cookies. I tried using the argument in the scrapy.Request, expecting the CookiesMiddleware to function. Additionally, I used the argument within the 'playwright_context_kwargs' in the context specification.
Unfortunately, neither of these approaches seems to be effective.
Could you guide me on the correct way to apply cookies? Has there been any update on this problem?

@elacuesta
Copy link
Member

There has been no work related to this issue. If you think there's a bug, please provide a minimal, reproducible example.

@fioreagui
Copy link

Yes, I think there might be a bug, or there could be another way of applying cookies that I'm not seeing.
My goal is to send a request to the host with the cookies already loaded. Here's a snippet of my spider's code.

# case 1 
# Request.cookies 

custom_settings = {'COOKIES_DEBUG' : True}
def start_requests(self):
    cookies = [
        {'name':'cookie_name','value':'cookie_value','domain':'example.com', 'path':'/'}
        ]
    yield Request(
        url=url,
        cookies=cookies,
        callback=self.parse,
        meta=dict(
            playwright = True,
            playwright_include_page = True,
        ), 
    )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        storage_state = await page.context.storage_state()

        print("Cookies sent: ", 
            response.request.headers.get('Cookie'))
        # output: b'PHPSESSID=5c89ahogln1s3v6bvr46j0rj8s'
        # my cookie wasn't sent
        print("Response cookies: ", 
            response.headers.getlist("Set-Cookie"))
        # output: [b'PHPSESSID=5c89ahogln1s3v6bvr46j0rj8s; path=/']
        print("Page cookies: ", 
            storage_state['cookies'])
        # output: [{'name': 'PHPSESSID', 'value': '5c89ahogln1s3v6bvr46j0rj8s', 'domain': 'example.com', 'path': '/', 'expires': -1, 'httpOnly': False, 'secure': False, 'sameSite': 'None'}]

# case 2
# context cookies + PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None 

custom_settings = {'PLAYWRIGHT_PROCESS_REQUEST_HEADERS' : None}

def start_requests(self):
    cookies = [
        {'name':'cookie_name','value':'cookie_value','domain':'example.com', 'path':'/'}
        ]
    yield Request(
        url=url,
        callback=self.parse,
        meta=dict(
            playwright = True,
            playwright_include_page = True,
            playwright_context = 'context',
            playwright_context_kwargs = {
                'storage_state':{
                    'cookies' : cookies,
                }
            }
        ), 
    ) 

    async def parse(self, response):
        page = response.meta["playwright_page"]
        storage_state = await page.context.storage_state()

        print("Cookies sent: ", 
            response.request.headers.get('Cookie'))
        # output: None
        print("Response cookies: ", 
            response.headers.getlist("Set-Cookie"))
        # output: []
        print("Page cookies: ", 
            storage_state['cookies'])
        # output: []

In the first case, some cookies were sent but not the one I was interested in. In the second case, no cookies were sent.
Theoretically, I can't understand why neither of these cases is working.

@elacuesta
Copy link
Member

elacuesta commented Aug 11, 2023

I'm sorry @fiorellaaguirrezabala , I'm not able to reproduce either case. Maybe you're setting an incorrect domain for your cookies? The URL you're using doesn't not appear in your snippet.
In both cases it makes sense for the "Response cookies" line to be empty. In the case 1, the Page cookies are empty because the "Cookie" header is overridden for the Playwright request with the values from the Scrapy request, the source is not the Page object.
In any case, I want to point out that this doesn't seem related to the current issue if you're not making subsequent requests with the same page, as the OP is doing.

Case 1

import json
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod


class HeadersSpider(Spider):
    name = "headers"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "COOKIES_DEBUG": True
    }

    def start_requests(self):
        cookies = [
            {"name": "cookie_name", "value": "cookie_value", "domain": "httpbin.org", "path": "/"}
        ]
        yield Request(
            url="https://httpbin.org/headers",
            cookies=cookies,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_methods=[
                    PageMethod("screenshot", path="headers1.png", full_page=True),
                ],
            ),
        )

    async def parse(self, response):
        headers = json.loads(response.css("pre::text").get())["headers"]
        yield {"url": response.url, "headers": headers}
        page = response.meta["playwright_page"]
        storage_state = await page.context.storage_state()
        await page.close()

        print("Cookies sent: ", response.request.headers.get("Cookie"))
        print("Response cookies: ", response.headers.getlist("Set-Cookie"))
        print("Page cookies: ", storage_state["cookies"])
2023-08-11 14:43:47 [scrapy.core.engine] INFO: Spider opened
2023-08-11 14:43:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-11 14:43:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-11 14:43:47 [scrapy-playwright] INFO: Starting download handler
2023-08-11 14:43:52 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://httpbin.org/headers>
Cookie: cookie_name=cookie_value

2023-08-11 14:43:52 [scrapy-playwright] INFO: Launching browser chromium
2023-08-11 14:43:52 [scrapy-playwright] INFO: Browser chromium launched
2023-08-11 14:43:52 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False)
2023-08-11 14:43:52 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-08-11 14:43:52 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/headers> (resource type: document, referrer: None)
2023-08-11 14:44:19 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/headers> (referrer: None)
2023-08-11 14:44:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/headers> (referer: None) ['playwright']
2023-08-11 14:44:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://httpbin.org/headers>
{'url': 'https://httpbin.org/headers', 'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Cache-Control': 'no-cache', 'Cookie': 'cookie_name=cookie_value', 'Host': 'httpbin.org', 'Pragma': 'no-cache', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'none', 'Sec-Fetch-User': '?1', 'User-Agent': 'Scrapy/2.10.0 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-64d67359-6b1bd1835e14d949563f4a78'}}
Cookies sent:  b'cookie_name=cookie_value'
Response cookies:  []
Page cookies:  []
2023-08-11 14:44:20 [scrapy.core.engine] INFO: Closing spider (finished)

The screenshot from PageMethod("screenshot"):
headers1

case 2

import json
from scrapy import Spider, Request
from scrapy_playwright.page import PageMethod


class HeadersSpider(Spider):
    name = "headers"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
    }

    def start_requests(self):
        cookies = [
            {"name": "cookie_name", "value": "cookie_value", "domain": "httpbin.org", "path": "/"}
        ]
        yield Request(
            url="https://httpbin.org/headers",
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context="context",
                playwright_context_kwargs={
                    "storage_state": {
                        "cookies": cookies,
                    }
                },
                playwright_page_methods=[
                    PageMethod("screenshot", path="headers2.png", full_page=True),
                ],
            ),
        )

    async def parse(self, response):
        headers = json.loads(response.css("pre::text").get())["headers"]
        yield {"url": response.url, "headers": headers}
        page = response.meta["playwright_page"]
        storage_state = await page.context.storage_state()
        await page.close()

        print("Cookies sent: ", response.request.headers.get("Cookie"))
        print("Response cookies: ", response.headers.getlist("Set-Cookie"))
        print("Page cookies: ", storage_state["cookies"])
2023-08-11 14:30:36 [scrapy-playwright] DEBUG: [Context=context] Request: <GET https://httpbin.org/headers> (resource type: document, referrer: None)
2023-08-11 14:30:39 [scrapy-playwright] DEBUG: [Context=context] Response: <200 https://httpbin.org/headers> (referrer: None)
2023-08-11 14:30:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/headers> (referer: None) ['playwright']
2023-08-11 14:30:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://httpbin.org/headers>
{'url': 'https://httpbin.org/headers', 'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'Accept-Encoding': 'gzip, deflate, br', 'Cache-Control': 'no-cache', 'Cookie': 'cookie_name=cookie_value', 'Host': 'httpbin.org', 'Pragma': 'no-cache', 'Sec-Ch-Ua': '"Not/A)Brand";v="99", "HeadlessChrome";v="115", "Chromium";v="115"', 'Sec-Ch-Ua-Mobile': '?0', 'Sec-Ch-Ua-Platform': '"Linux"', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'none', 'Sec-Fetch-User': '?1', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/115.0.5790.75 Safari/537.36', 'X-Amzn-Trace-Id': 'Root=1-64d6703c-51ba29326fdd239d10e42c87'}}
Cookies sent:  b'cookie_name=cookie_value'
Response cookies:  []
Page cookies:  [{'name': 'cookie_name', 'value': 'cookie_value', 'domain': 'httpbin.org', 'path': '/', 'expires': -1, 'httpOnly': False, 'secure': False, 'sameSite': 'Lax'}]
2023-08-11 14:30:40 [scrapy.core.engine] INFO: Closing spider (finished)

The screenshot from PageMethod("screenshot"):
headers

@elacuesta
Copy link
Member

Closing due to inactivity.

@elacuesta elacuesta closed this as not planned Won't fix, can't repro, duplicate, stale Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants