Solutioning follow-up from #1911.
What happened: a plain GET worked with requests but 404'd through HttpCrawler. Not a Crawlee bug — all HTTP clients inject an Accept-Language header by default (browser impersonation), and this particular API 404s whenever that header is present.
The friction: turning impersonation off today means importing a client, disabling it there, and passing it back to the crawler — and the opt-out is named differently per client:
from crawlee.http_clients import ImpitHttpClient
crawler = HttpCrawler(http_client=ImpitHttpClient(browser=None))
# or HttpxHttpClient(header_generator=None)
# or CurlImpersonateHttpClient(impersonate=None)
That's a lot of ceremony for a common need.
Proposal — expose it on the HTTP crawlers. Add a simple flag so it's a one-liner with no client import:
crawler = HttpCrawler(impersonate=False)
Applies to HttpCrawler, BeautifulSoupCrawler, ParselCrawler (the AbstractHttpCrawler family). It would configure the default HTTP client under the hood. If the user passes their own http_client, the flag doesn't apply — they configure impersonation on that client directly (we should document this clearly, or raise on conflict).
Also: a short docs guide on default browser-like headers (esp. Accept-Language) — when impersonation helps vs. hurts.
Related: #1683, #1685, #1911
Solutioning follow-up from #1911.
What happened: a plain
GETworked withrequestsbut 404'd throughHttpCrawler. Not a Crawlee bug — all HTTP clients inject anAccept-Languageheader by default (browser impersonation), and this particular API 404s whenever that header is present.The friction: turning impersonation off today means importing a client, disabling it there, and passing it back to the crawler — and the opt-out is named differently per client:
That's a lot of ceremony for a common need.
Proposal — expose it on the HTTP crawlers. Add a simple flag so it's a one-liner with no client import:
Applies to
HttpCrawler,BeautifulSoupCrawler,ParselCrawler(theAbstractHttpCrawlerfamily). It would configure the default HTTP client under the hood. If the user passes their ownhttp_client, the flag doesn't apply — they configure impersonation on that client directly (we should document this clearly, or raise on conflict).Also: a short docs guide on default browser-like headers (esp.
Accept-Language) — when impersonation helps vs. hurts.Related: #1683, #1685, #1911