How do I use Apify proxies with Crawlee? #575

honzajavorek · 2024-10-06T13:33:45Z

honzajavorek
Oct 6, 2024

My code:

async with Actor:
    proxy_configuration = await Actor.create_proxy_configuration()
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        proxy_configuration=proxy_configuration,
        configuration=Configuration(log_level="DEBUG" if debug else "INFO"),
    )
    await crawler.run(links)

It doesn't seem to work. There's no feedback in the Actor Run log. I have no idea what happens underneath. I couldn't figure out how to debug this, I cannot even print proxy_configuration as it has no valuable representation.

I went through several guides in the docs, but there is no single example which has both Actor and Crawlee crawler there. And I'm getting type errors, too:

Even when running the crawler with DEBUG level, I see no valuable output, because it only logs how it communicates with the platform, but I cannot see whether it uses any proxies or not or why my requests get HTTP 403.

I tried to figure out more about what can I pass to the crawler, but the docs are very brief:

What are all those things? How do I use them? What they're good for? Compare with e.g. Click:

I guess the kwargs are propagated to basic crawler, but there the docs only say "Initialize the BasicCrawler":

There's no information on what ProxyConfiguration is, what is inside, whether I can use the one I get from the Apify SDK, or what I'm supposed to do... Also from the types it seems it should be able to accept None, but nevertheless I get typing errors.

I'm lost in the rabbit hole of reading code and missing docs. For some reason my requests get blocked with HTTP 403. The same requests work fine from my laptop. I suppose it's anti-scraping protection, also because the target site is Reddit. But I don't know if it's failing despite running through proxies or whether I set up the proxies wrong. Tried both basic proxies and residential ones.

Answered by honzajavorek

Oct 6, 2024

Figured out that await Actor.create_proxy_configuration() throws if it doesn't get correct info from env variables. Also there's Inspecting current proxy in crawlers after all. And all the parameters are described - but in code, not in the docs. Rendering the API docs useless, it's just much better to read the code directly. Not sure why the type checking doesn't work, but at least I'm not stuck anymore.

View full answer

honzajavorek · 2024-10-06T13:45:57Z

honzajavorek
Oct 6, 2024
Author

Figured out that await Actor.create_proxy_configuration() throws if it doesn't get correct info from env variables. Also there's Inspecting current proxy in crawlers after all. And all the parameters are described - but in code, not in the docs. Rendering the API docs useless, it's just much better to read the code directly. Not sure why the type checking doesn't work, but at least I'm not stuck anymore.

5 replies

janbuchar Oct 7, 2024
Maintainer

It is strange that if Actor.create_proxy_configuration() throws and you don't see the exception - do I understand that right?
The type error that you're seeing is because Actor.create_proxy_configuration may return None, which is honestly weird.
We know that unpacked TypedDict kwargs are not rendered in the API reference and we hope to fix it soon.

B4nan Oct 7, 2024
Maintainer

The type error that you're seeing is because Actor.create_proxy_configuration may return None, which is honestly weird.

This is most likely the same quirk we have in JS version, when you use the proxy input on the platform, you can pick none which resolves to proxy configuration with { useApifyProxy: false } options which returns undefined.

This works because we accept ProxyConfiguration | undefined in the crawler options.

honzajavorek Oct 7, 2024
Author

Actor.create_proxy_configuration() throws if misconfigured. But I didn't see the exception on Apify, so I didn't know about it. I was looking for confirmation in logs that I do use proxy. When I tried to run the same code locally, I got the exception and realized that if proxy was misconfigured, I'd see that exception on Apify. Deducing that it's configured properly.

By further debugging I got to a point where my code is correct, and it works great locally, but I just couldn't get it work on either GitHub Actions or Apify, even when trying the residential proxies. Reddit probably plays a tough game and efficiently returns 403 to any scraping attempts.

Sadly, this was supposed to be just a small script which helps me to filter out items from Reddit's RSS according to certain flair (category). I wanted to request each URL and see if the flair is there or not, then modify the RSS and keep only certain items. In a normal world, this would be a 30min task, but I just couldn't get it done. First, I spent a lot of time learning Crawlee quirks, and then I figured out that Reddit apparently won't let me do this, from anywhere else than my own home IP 😢

Not sure if I want to sink more time into this, but I thought about forcing Crawlee to go slower, or to change retry policy, or something. But that would basically mean reading its code and figuring out how it all works, as the docs on this topic are not present. My use cases would be:

How fast Crawlee goes? I don't know. Can it slow it down?
What User Agents Crawlee uses? I don't know. Does it rotate them automatically or should I turn it on somewhere?
How does it rotate sessions and IPs? Does it do requests via proxy as I want, or not? I wanted to log this, but I can log it only in the handler. If I get 403, it never gets to the handler though, I can only see logged errors.
How many times Crawlee retries? On what status codes? Can I change this? Can I see it in the logs, that it really retries? In my case, I raised retries to 100, but immediately after running the code on Apify, I got "out of retries on this URL" error with HTTP 403. That's suspicious, but I have no idea how to debug this.
Can I do something like I do in one of my Scrapy scrapers - in case I get blocked, I switch to Playwright and retry the same request with Playwright? From Running different requests with different crawlers? #573 it seems I cannot. In theory I could save erroring URLs and then run Crawlee again in my program with those URLs. How do I save them though? If the error never runs my handler? Are there any hooks? Perhaps stats after the crawl finishes?

When combating the anti-scraping protections, I felt like there's very little visibility to what Crawlee and Apify actually does. So I set proxies, but there's no info in the logs if Crawlee actually picked them up or not. Or what requests exactly Crawlee does, so that I can replay them locally with curl or something.

Also, not really related, but it feels so weird that Crawlee doesn't fail (exit status 1) on errors. I guess this is good for scrapers which run whole day on millions of items, but for a small program which expects 100 % success rate this makes no sense. Can I configure this?

I'd also like to have an option to somehow run the scraper immediately from my machine, but over Apify infra, because the debugging loop is super slow. Maybe I should know better, but now when I change one line in my code, I must push to GitHub, wait for Actor build, then go to UI and manually trigger a run, see it fail, then change one line in my code...

Sorry for feedback like this, it's just that this little project has totally killed my Sunday mood and I want to squeeze at least something useful out of it.

janbuchar Oct 7, 2024
Maintainer

It'll take some time to process all your feedback, but you raise some valid points - thank you! In the meantime, I suggest trying one of these: https://apify.com/store/categories?search=reddit - Reddit indeed is tough to scrape.

Also, retrying may be configured via BasicCrawler.__init__ parameters. So can the maximum requests per minute - you need to use concurrency_settings for that.

honzajavorek Oct 7, 2024
Author

Thanks! ❤️ On Sunday I ended up with something like this:

max_concurrency = 1 if Actor.is_at_home() else 5
logger.info(f"Max concurrency: {max_concurrency}")

max_request_retries = 100 if Actor.is_at_home() else 3
logger.info(f"Max request retries: {max_request_retries}")

crawler = BeautifulSoupCrawler(
    request_handler=router,
    proxy_configuration=proxy_configuration,
    concurrency_settings=ConcurrencySettings(max_concurrency=max_concurrency),
    max_request_retries=max_request_retries,
    additional_http_error_status_codes=[403, 429],
    configuration=Configuration(log_level="DEBUG" if debug else "INFO"),
)

But the all the requests instantly failed with too many retries. Which was suspicious. Maybe the platform is fast, but I don't believe it's so fast it does 100 retries within a second 😅 I gave up digging deeper.

The conclusion definitely is that Reddit is tough to tackle. I also stumbled upon URS, and thanks for pointing out there might be ready-made scrapers in the store. It's just sad, because I don't even do anything shady, I'm using internet for what it was invented for, but still getting middle finger from Reddit. And the hobby project was supposed to be 50 lines of code - but in the end it feels like this 😂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I use Apify proxies with Crawlee? #575

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do I use Apify proxies with Crawlee? #575

honzajavorek Oct 6, 2024

Replies: 1 comment · 5 replies

honzajavorek Oct 6, 2024 Author

janbuchar Oct 7, 2024 Maintainer

B4nan Oct 7, 2024 Maintainer

honzajavorek Oct 7, 2024 Author

janbuchar Oct 7, 2024 Maintainer

honzajavorek Oct 7, 2024 Author

honzajavorek
Oct 6, 2024

Replies: 1 comment 5 replies

honzajavorek
Oct 6, 2024
Author

janbuchar Oct 7, 2024
Maintainer

B4nan Oct 7, 2024
Maintainer

honzajavorek Oct 7, 2024
Author

janbuchar Oct 7, 2024
Maintainer

honzajavorek Oct 7, 2024
Author