-
Notifications
You must be signed in to change notification settings - Fork 1.1k
refactor: confusing timeout handling #3130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v4
Are you sure you want to change the base?
Conversation
BREAKING CHANGE: The project is now native ESM without a CJS alternative. This is fine since all supported node versions allow `require(esm)`. Also all the dependencies are updated to the latest versions, including cheerio v1.
BREAKING CHANGE: The crawler following options are removed: - `handleRequestFunction` -> `requestHandler` - `handlePageFunction` -> `requestHandler` - `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` - `handleFailedRequestFunction` -> `failedRequestHandler`
BREAKING CHANGE: The crawling context no longer includes the `Error` object for failed requests. Use the second parameter of the `errorHandler` or `failedRequestHandler` callbacks to access the error. Previously, the crawling context extended a `Record` type, allowing to access any property. This was changed to a strict type, which means that you can only access properties that are defined in the context.
….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.
….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.
also tries to bump better-sqlite3 to latest version to have prebuilds for node 22
… Brought in nav hooks into the nav timeout and changed the combined timeout description
…rry a timeout if the input is zero or less this is in essence the refactoring of the combined timeout without getting rid of it.
…w test file for http crawler
barjin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
I have a few questions and code quality ideas for now. Other contributors have worked with the timeouts deeper in the past and might have better insight into this part of Crawlee.
| try { | ||
| crawlingContext.response = (await this._navigationHandler(crawlingContext, gotoOptions)) ?? undefined; | ||
| // Wrap the entire navigation phase in one timeout | ||
| await addTimeoutToPromise( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought - it would make more sense to me to do this timeout one level higher in _runRequestHandler (where _handleNavigation is called - i.e. here). The syntax would require less nesting and would align with how we deal with timeouts in user defined request handlers - see below:
crawlee/packages/browser-crawler/src/internals/browser-crawler.ts
Lines 506 to 510 in 2b74503
| await addTimeoutToPromise( | |
| async () => Promise.resolve(this.userProvidedRequestHandler(crawlingContext as LoadedContext<Context>)), | |
| this.requestHandlerTimeoutInnerMillis, | |
| `requestHandler timed out after ${this.requestHandlerTimeoutInnerMillis / 1000} seconds.`, | |
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would align with how you guys are currently handling the user request handler but the current implementation makes it so _handleNavigation has built in protection if you ever decide to reuse it and _runRequestHandler stays slimmer. If you believe it to be necessary I can make the change. I will flatten _handleNavigation to improve readability.
Co-authored-by: Jindřich Bär <[email protected]>
…eout tests up to standard (maintainer suggestions)
…eout tests up to standard (maintainer suggestions) pt2
Problem:
Confusing timeout logic that causes a user sees a 60s request handler timeout, but the errors will mention 130s (or 100s for HTTP crawlers where navigation timeout is only 30s).
What this PR changes
Impact / Compatibility
Closes #2951
Contributors:
@ezequiel38
@anghel9