Implement/document a way how to pass custom information to handlers #531
Replies: 2 comments 6 replies
-
I believe that what you want to achieve can be done using the language itself and there's no need to implement a robust dependency injection mechanism in Crawlee. I'm open to anything that might change my mind though. If I understand this correctly, you receive some parameters on the command line and handle them with async def scrape(slug: str):
crawler = create_crawler(slug)
await crawler.run(["https://edisonfilmhub.cz/program"])
await crawler.export_data(f"{slug}.json", dataset_name=slug)
def create_crawler(slug: str):
router = Router[BeautifulSoupCrawlingContext]()
crawler = BeautifulSoupCrawler(request_handler=router)
@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)
@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
context.log.info(f"Scraping {context.request.url}")
description = context.soup.select_one(".filmy_page .desc3").text
length_min = LENGTH_RE.search(description).group(1)
# TODO get starts_at, then calculate ends_at
await context.push_data(
{
"url": context.request.url,
"title": context.soup.select_one(".filmy_page h1").text.strip(),
"csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
},
dataset_name="edison",
)
return crawler |
Beta Was this translation helpful? Give feedback.
-
Yes I'm also looking into this I would like to add some meta data to the request just so that i can have that data at time of save. I wish this happened automatically with kwargs. looks semi easy to implement TODO? from crawlee import Request
from typing import Any, Dict
from crawlee._types import HttpHeaders, HttpMethod, HttpPayload
class ExtendedRequest(Request):
@classmethod
def from_url(
cls,
url: str,
*,
method: HttpMethod = 'GET',
headers: HttpHeaders | dict[str, str] | None = None,
payload: HttpPayload | str | None = None,
label: str | None = None,
unique_key: str | None = None,
id: str | None = None,
keep_url_fragment: bool = False,
use_extended_unique_key: bool = False,
always_enqueue: bool = False,
metadata: Dict[str, Any] | None = None,
**kwargs: Any,
) -> 'ExtendedRequest':
request = Request.from_url(
url=url,
method=method,
headers=headers,
payload=payload,
label=label,
unique_key=unique_key,
id=id,
keep_url_fragment=keep_url_fragment,
use_extended_unique_key=use_extended_unique_key,
always_enqueue=always_enqueue,
metadata=metadata,
**kwargs,
)
request.user_data.update(metadata)
return request
async def main() -> None:
"""The crawler entry point."""
crawler = BeautifulSoupCrawler(
request_handler=router,
max_requests_per_crawl=1,
)
await crawler.run(
[
ExtendedRequest.from_url(
url='https://www.myurl.com',
metadata={"my_extra_value": 1}
),
]
) Then in my router i can parse that value out with my_extra_value = context.request.user_data.model_extra.get('my_extra_value') |
Beta Was this translation helpful? Give feedback.
-
For the purposes of testability and nice code structure I thought I'd have some information dependency-injected top-down from the main function. I don't know how to do that. I'll illustrate my problem on a constant, but imagine there are some
click
options to my program which affect how the scraper behaves, so the value doesn't have to be immutable and the issue is the same. This is my program:In the main function, I have certain information I want to pass down. For example I want
"edison"
to be an argument:Then this is easy:
But then, how do I pass that
slug
down to the handlers? I have no idea. What do you suggest as the best approach?Beta Was this translation helpful? Give feedback.
All reactions