Running different requests with different crawlers? #573

honzajavorek · 2024-10-06T09:28:09Z

honzajavorek
Oct 6, 2024

I'm trying to solve a situation where I want to make the initial request with a plain crawler (because it's an API or something), but continue with subsequent requests to detail pages with a BS4 crawler (because they're regular HTML pages).

A specific example would be requesting an RSS feed, having a default handler with a feedparser instead of BS4 (i.e. just feedparser.parse(context.http_response.read()) with no BS4 parsing taking place), but then requesting individual linked articles and scraping them with BS4.

Since the type of the crawler is set kinda globally for the whole program, I don't know how to do this. It would make more sense to be able to specify how the response gets parsed per handler or per request. I can imagine scrapers where I want to start with BS4, but then jump to Playwright for product detail pages or if BS4 fails to deliver.

What's the best approach to switch crawler types on the fly like this?

Answered by janbuchar

Oct 7, 2024

Well, the best approach to this would be to have separate RequestQueue instances for the separate crawlers and to add requests directly to the queue of the right crawler in your request handlers. There are however some challenges:

As of now, you need to make named queues, because Request.open() will always resolve to the same unnamed queue. This may or may not be a problem if you're running on Apify. Locally, you'll probably need to purge the named queues manually before each run.
Just waiting for both crawlers using something like await asyncio.gather(crawler_1.run(), crawler_2.run()) also won't work right off the bat - I assume that only one of your crawlers will have some start urls a…

View full answer

janbuchar · 2024-10-07T10:42:44Z

janbuchar
Oct 7, 2024
Maintainer

Well, the best approach to this would be to have separate RequestQueue instances for the separate crawlers and to add requests directly to the queue of the right crawler in your request handlers. There are however some challenges:

As of now, you need to make named queues, because Request.open() will always resolve to the same unnamed queue. This may or may not be a problem if you're running on Apify. Locally, you'll probably need to purge the named queues manually before each run.
Just waiting for both crawlers using something like await asyncio.gather(crawler_1.run(), crawler_2.run()) also won't work right off the bat - I assume that only one of your crawlers will have some start urls and the other one will exit immediately. In JS version, we have a keepAlive option, but that is not available in crawlee-python yet.

We are aware of this shortcoming though, and we'd like to enable running multiple interconnected crawlers in the future.

2 replies

honzajavorek Oct 7, 2024
Author

Thanks! So for simple cases, it's probably just easier to httpx the first request myself and only then feed Crawlee with the links I get out of that. Not ideal, but whatever works, right? 😄

janbuchar Oct 7, 2024
Maintainer

Absolutely, that's the KISS solution!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running different requests with different crawlers? #573

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Running different requests with different crawlers? #573

honzajavorek Oct 6, 2024

Replies: 1 comment · 2 replies

janbuchar Oct 7, 2024 Maintainer

honzajavorek Oct 7, 2024 Author

janbuchar Oct 7, 2024 Maintainer

honzajavorek
Oct 6, 2024

Replies: 1 comment 2 replies

janbuchar
Oct 7, 2024
Maintainer

honzajavorek Oct 7, 2024
Author

janbuchar Oct 7, 2024
Maintainer