Implement/document a way how to pass custom information to handlers #531

honzajavorek · 2024-09-15T17:34:48Z

honzajavorek
Sep 15, 2024

For the purposes of testability and nice code structure I thought I'd have some information dependency-injected top-down from the main function. I don't know how to do that. I'll illustrate my problem on a constant, but imagine there are some click options to my program which affect how the scraper behaves, so the value doesn't have to be immutable and the issue is the same. This is my program:

import re
import asyncio
from enum import StrEnum, auto

import click
from crawlee.beautifulsoup_crawler import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)
from crawlee.router import Router


LENGTH_RE = re.compile(r"(\d+)\s+min")


class Label(StrEnum):
    DETAIL = auto()


router = Router[BeautifulSoupCrawlingContext]()


@click.command()
def edison():
    asyncio.run(scrape())


async def scrape():
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data("edison.json", dataset_name="edison")


@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
    await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)


@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
    context.log.info(f"Scraping {context.request.url}")

    description = context.soup.select_one(".filmy_page .desc3").text
    length_min = LENGTH_RE.search(description).group(1)
    # TODO get starts_at, then calculate ends_at

    await context.push_data(
        {
            "url": context.request.url,
            "title": context.soup.select_one(".filmy_page h1").text.strip(),
            "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
        },
        dataset_name="edison",
    )

In the main function, I have certain information I want to pass down. For example I want "edison" to be an argument:

@click.command()
def edison():
    slug = "edison"
    asyncio.run(scrape(slug))

Then this is easy:

async def scrape(slug: str):
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data(f"{slug}.json", dataset_name=slug)

But then, how do I pass that slug down to the handlers? I have no idea. What do you suggest as the best approach?

janbuchar · 2024-09-17T12:54:41Z

janbuchar
Sep 17, 2024
Maintainer

I believe that what you want to achieve can be done using the language itself and there's no need to implement a robust dependency injection mechanism in Crawlee. I'm open to anything that might change my mind though.

If I understand this correctly, you receive some parameters on the command line and handle them with click. After that, you want to make these accessible in your request handler. Is there any need to keep your request handler on the top level of your file/module? In your situation, I would introduce a factory function that would take your parameters and use them to construct the crawler:

async def scrape(slug: str):
    crawler = create_crawler(slug)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data(f"{slug}.json", dataset_name=slug)

def create_crawler(slug: str):
    router = Router[BeautifulSoupCrawlingContext]()
    crawler = BeautifulSoupCrawler(request_handler=router)

    @router.default_handler
    async def detault_handler(context: BeautifulSoupCrawlingContext):
        await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)

    @router.handler(Label.DETAIL)
    async def detail_handler(context: BeautifulSoupCrawlingContext):
        context.log.info(f"Scraping {context.request.url}")

        description = context.soup.select_one(".filmy_page .desc3").text
        length_min = LENGTH_RE.search(description).group(1)
        # TODO get starts_at, then calculate ends_at

        await context.push_data(
            {
                "url": context.request.url,
                "title": context.soup.select_one(".filmy_page h1").text.strip(),
                "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
            },
            dataset_name="edison",
        )
        
    return crawler

5 replies

honzajavorek Sep 17, 2024
Author

This is a possible workaround. It would lead to a single giant function in case I have several handlers. Something in me tells me I should keep the handlers as separate functions which I could later test, but perhaps that's just me (this is not meant as a sarcasm, but truly as a consideration that my preference may not be mainstream), and also, with my current knowledge of Crawlee I have no idea how I'd actually test them, e.g. what's the easiest way to create and populate BeautifulSoupCrawlingContext with something pre-prepared in e.g. pytest.

honzajavorek Sep 17, 2024
Author

Hmm, I could keep the handlers minimal and have separate, ideally pure and nicely testable functions... Yeah, that could be an approach, too, I guess. Thanks!

janbuchar Sep 17, 2024
Maintainer

Honestly, it also might be just me 🙂 I work with JS as well and functions that return functions are pretty natural to me. If you want to be more granular, you could also make a "factory" for each of your handlers.

If testability is the only concern, I think that having a huge function that defines your handlers and builds the router isn't a major problem - you are free to test each "branch" separately thanks to labels.

Regarding pre-populating contexts, well, they are pretty simple dataclasses, and I think nothing keeps you from building those in any way you like, possibly utilizing unittest.mock. However, I'm not sure about the value of such tests. I believe that mocking HTTP responses, for example, would be a more practical approach.

honzajavorek Sep 18, 2024
Author

With Scrapy I sometimes call the individual methods in tests. But as you mention, something similar should be possible using labels. I'll see where I end up with Crawlee.

functions that return functions are pretty natural to me

To me too, but this is a factory, I don't see functions returning functions anywhere in the examples (except for the decorators). Maybe you meant the closures?

The example you provided actually looks a lot like Scrapy, but with the difference that in Scrapy I can access the methods. Let me do a few changes:

class Spider:
    def __init__(self):
        ...
        # router = Router[BeautifulSoupCrawlingContext]()
        # crawler = BeautifulSoupCrawler(request_handler=router)

    # @router.default_handler
    async def detault_handler(self):
        await self.enqueue_links(selector=".program_table .name a", handler=self.defail_handler)

    # @router.handler(Label.DETAIL)
    async def detail_handler(self):
        self.log.info(f"Scraping {self.request.url}")

        description = self.soup.select_one(".filmy_page .desc3").text
        length_min = LENGTH_RE.search(description).group(1)
        await self.push_data(
            {
                "url": context.request.url,
                "title": context.soup.select_one(".filmy_page h1").text.strip(),
                "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
            },
            dataset_name="edison",
        )

I replaced all context occurrences with self and the label with direct method reference, not much else. The advantage of Crawlee is that one doesn't have to do this if the scraper doesn't need something like a shared state, so a smaller scraper will definitely feel much more lightweight and functional. In Scrapy this is the only way to create a scraper.

The advantage of Scrapy, OTOH, is that the handlers are methods, not closures. I can use them as functions and feed them with whatever I want. In Crawlee it seems the scraper is a black box which I test from the outside.

I don't know what I'm trying to say exactly. I'm just thinking out loud. I wanted to point out that the suggested scraper architecture feels familiar. I've seen that elsewhere as a class, here it's with functional syntax. I guess it doesn't matter at the end of the day if I have a good way to test the scraper. We both suggested two ways how to do it:

Factor out smaller pieces (e.g. parsing price) as pure functions and unit test them separately.
Create artificial requests and use labels to target specific handlers in something like integration tests (would be great to see some docs on the topic btw, similar to Flask docs)

janbuchar Sep 18, 2024
Maintainer

You're right that you could store the parameters as instance attributes, which accomplishes the same thing as the function closure in the solution that I suggested. Maybe a class-based router could be a nice feature to add 🙂

CupOfGeo · 2024-12-02T17:24:29Z

CupOfGeo
Dec 2, 2024

Yes I'm also looking into this I would like to add some meta data to the request just so that i can have that data at time of save. I wish this happened automatically with kwargs. looks semi easy to implement TODO?
Heres my solution where I make a new ExtendedRequest class

from crawlee import Request
from typing import Any, Dict
from crawlee._types import HttpHeaders, HttpMethod, HttpPayload

class ExtendedRequest(Request):
    @classmethod
    def from_url(
        cls,
        url: str,
        *,
        method: HttpMethod = 'GET',
        headers: HttpHeaders | dict[str, str] | None = None,
        payload: HttpPayload | str | None = None,
        label: str | None = None,
        unique_key: str | None = None,
        id: str | None = None,
        keep_url_fragment: bool = False,
        use_extended_unique_key: bool = False,
        always_enqueue: bool = False,
        metadata: Dict[str, Any] | None = None,
        **kwargs: Any,
    ) -> 'ExtendedRequest':
        request = Request.from_url(
            url=url,
            method=method,
            headers=headers,
            payload=payload,
            label=label,
            unique_key=unique_key,
            id=id,
            keep_url_fragment=keep_url_fragment,
            use_extended_unique_key=use_extended_unique_key,
            always_enqueue=always_enqueue,
            metadata=metadata,
            **kwargs,
        )
        request.user_data.update(metadata)
        return request
        
        async def main() -> None:
    """The crawler entry point."""
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        max_requests_per_crawl=1,
    )

    await crawler.run(
        [
            ExtendedRequest.from_url(
                url='https://www.myurl.com',
                metadata={"my_extra_value": 1}
            ),
        ]
    )

Then in my router i can parse that value out with

my_extra_value = context.request.user_data.model_extra.get('my_extra_value')

1 reply

CupOfGeo Dec 3, 2024

ok better answer is to not do that and just pass in this

from crawlee import Request

req = Request.from_url('https://crawlee.dev/', user_data={'my_data': 123})

and it shows up the same way. Thank @vdusek

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement/document a way how to pass custom information to handlers #531

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implement/document a way how to pass custom information to handlers #531

honzajavorek Sep 15, 2024

Replies: 2 comments · 6 replies

janbuchar Sep 17, 2024 Maintainer

honzajavorek Sep 17, 2024 Author

honzajavorek Sep 17, 2024 Author

janbuchar Sep 17, 2024 Maintainer

honzajavorek Sep 18, 2024 Author

janbuchar Sep 18, 2024 Maintainer

CupOfGeo Dec 2, 2024

CupOfGeo Dec 3, 2024

honzajavorek
Sep 15, 2024

Replies: 2 comments 6 replies

janbuchar
Sep 17, 2024
Maintainer

honzajavorek Sep 17, 2024
Author

honzajavorek Sep 17, 2024
Author

janbuchar Sep 17, 2024
Maintainer

honzajavorek Sep 18, 2024
Author

janbuchar Sep 18, 2024
Maintainer

CupOfGeo
Dec 2, 2024